speech recognization.jpg

Speech Recognition

When discussing the rise of deep learning, the accuracy of automated approaches is typically compared to the gold standard of flawless human output. In reality, real-world human performance is actually quite poor at the kinds of tasks typically being considered for AI automation. Cataloging imagery, reviewing videos and transcribing are all tasks where humans have the potential for very high accuracy but the reality of their long repetitive mind-numbing hours sitting in front of the screen means human accuracy fades rapidly and can vary dramatically from day to day and even hour to hour. For all their accuracy issues, automated systems promise far more consistent results.

Speech recognition is an area where humans at their best still typically outperform machines. In real-life real-time transcription tasks like generating closed captioning for television news, however, it turns out that commercially available systems like Google’s Speech-to-Text API are actually almost as accurate as their human counterparts and are far more faithful in their renditions of what was said.

Look more closely at the captioning of some stations and an interesting pattern emerges: the quality of the human captioning can vary from day to day and even over the course of a single day.

Real-time transcription is typically outsourced to third-party companies who employ contractors to type up what they hear. Quality can vary dramatically between contractors and even the same individual might perform better in the morning when they are more rested or just have a bad day.

Different transcriptionists can exhibit different kinds of errors, meaning the same word can be spelled correctly for part of the day and exhibit far more typographical errors during the rest of the day.

Some stations tape their morning shows and rebroadcast them as-is directly from tape in the afternoon, but may choose to retranscribe them in the afternoon on the off chance that breaking news forces the station to interrupt the taped show. This means that the exact the same show may have different typographical errors in the afternoon than it did in the morning.


blog Blog