Rory Bremner vs speech recognition software: we now know who would win | Radio Times

I like to think I have a reasonable ear for voices. My career as an impressionist, after all, is based, in large part, on an ability to differentiate between accents, voice types and characteristics. I tend to think of this ability as instinctive; a gift, a party trick, even. That was before I met the forensic phoneticians for my Radio 4 documentary.

Forensic phoneticians are the linguistic scientists and speech analysts whose study of people’s voices puts my instinctive ear to shame, not least by being so, well, forensic. Whereas my impressions and caricatures (for that’s what they are they’re not an exact reproduction) are done for comic or satirical effect, the professionals’ analysis, involving speaker identification or profiling, is often used as evidence in criminal cases. In counter-terrorism, too, voice analysis is a vital tool, as security services analyse thousands of hours’ worth of speech recordings.

More fascinating than wondering whether my Donald Trump is close to the original, and funny enough (or indeed both), are the questions that form the work of the phoneticians: is the voice of the suspect the same as the one whispering a bomb threat on a police tape recording? From which part of Wales is the person who made the blackmail call? Was the pilot under the influence of alcohol while talking to air traffic control just before the plane crash?

More like this

As soon as a criminal case involves human speech or acoustic clues of any kind, the expertise of a forensic phonetician is required, and these are the types of questions that an expert in speaker recognition grapples with every day. In fact, there are between 500 and 600 criminal cases every year in the UK where voice-related data is used as evidence. And, like any other expert evidence, it is regulated by the Home Office Forensic Science Regulator and recognised by Parliament as an expert area.

It’s a specialism that covers a wide range of areas: speaker profiling (who is this speaker? What information can we glean from the voice?); voice comparisons, where a known sample of a voice is compared to that of the suspect (a known sample of preacher Abu Hamza’s voice was compared to secretly recorded cassette tapes of other racial-hatred sermons to determine if the speaker was the same person); speech enhancement; tape authentication; and helping the police carry out voice line-ups: the aural equivalent of an identity parade.

All this work is carried out by combining the skills of trained phoneticians with increasingly sophisticated automated speaker recognition systems (ASRS), which now have the power to analyse the human voice to an unprecedented degree.

But, tellingly, it is still the human analyst – the individual phonetician – whose expertise makes all the difference. Indeed, one of the most notable cases in forensic phonetics was one that didn’t use any kind of machine. Dialectologist and phonetician Stanley Ellis famously analysed the “Wearside Jack” tape that derailed the Yorkshire Ripper inquiry. By painstakingly researching and analysing the speech of ordinary people across the north of England, Ellis was able to pinpoint the accent of the hoaxer down to within a few miles on the north side of the Wear in Sunderland.

This remarkable human expertise is something to which the British still cling. While most other European countries recognise the legal validity of automatic speaker recognition software, the UK tradition has always been to use a skilled dialectician who would analyse one at a time the sound of the vowels, the rise and fall of the voice, its melody, through the notation system of the International Phonetic Alphabet.

But it’s not an exact science – the scientific community is divided as to the most effective method for identifying voices, whether through automated systems or through the expertise of the phonetician or, as seems the current best practice, both.

In addition, our voices vary: if we have a cold; if we’re drunk; if we’re nervous. As evidence, then, voice analysis is still only corroborative rather than conclusive in itself.

But what of the impressionists? We can fool some of the people some of the time, it seems. But we can’t fool the equipment. We had fun on the programme comparing my impression of Trump’s voice to the original. It amuses the scientists, but doesn’t kid the technology.

Not that the technology is perfect. Earlier this year, a BBC reporter succeeded in fooling HSBC’s security software by getting his twin to mimic his voice. But he didn’t make any money out of it. I guess I’ll have to stick to comedy .

The Race to Fingerprint the Human Voice is on Wednesday 9pm Radio 4