Why Automatic Audio Transcription software is still Science Fiction

Update: We've come a long way since we wrote this post 5 years ago. You can now reliably convert audios with clear voice to text with 90% accuracy!

Here's how you can transcribe your audio »

Given how painful transcribing audio is, people repeatedly ask us why there is still no software that can automatically take an audio and spit out its transcribed text with good accuracy.

Now, it’s not entirely true that there is no such software - there are a couple, but they don’t help in transcribing real-world audio which typically involves handling multiple voices and all kinds of background noises.

For example:

After more than 20 years of development, Nuance’s Dragon Naturally Speaking can barely turn speech into text if you have a nearly pristine audio recording and if the user speaks clearly and carefully.

Furthermore, converting dictation is a long way from automatically transcribing conversation, which dominates most forms of recorded audio.

Siri too has it's problems. Image credit

Everybody has their own style of speaking

Training a machine to recognize human voice has proven to be very difficult due to the variations in how people speak a particular language. Despite being the most widely spoken language, English itself sounds considerably different in various parts of the world.

Even if everyone spoke a language the exact same way, there is still the added difficulty of training the system for different voices - from young to old to male to female to hoarse to soft to - you get the drift. Even the same person tends to speak differently in different situations, for example, during a moment of excitement or a bout of cold.

Let’s not also forget that some people speak faster, while others speak slower, with lots of ums, ers and uhs, which aren’t even part of any spoken language! Arriving at a speech model that can handle all these variations (like humans do) is really tricky.

Ambiguity caused by homophones and word boundaries

Interpreting speech requires a good understanding of the overall context. Humans are gifted with the ability to interpret fuzzy data and automatically deduce the missing parts based on the context. Machines are really bad at disambiguating the meaning of words and phrases as they lack the ability to comprehend the bigger picture.

A homophone is a word that is pronounced the same way as another word but differing in meaning. For example, let’s take the following two phrases:

the sail of a boat
the sale of a boat

The words “sail” and “sale” sound the same, and there is no way to distinguish between the two without first understanding the overall context. Such a “context” might not even be available till later on in the speech!

Human speech tends to be continuous with no natural pauses between words. This poses a difficult challenge: where should a waveform be split to form meaningful words? Given a sequence of sounds, realigning the sounds to form different word boundaries can produce vastly different sentences:

It’s not easy to wreck a nice beach.
It’s not easy to recognize speech.
It’s not easy to wreck an ice beach.

Once again, an accurate transcription requires an understanding of what the speaker is trying to say in the context of the full speech.

If the words are spoken slowly, with a clear pause after every word, machines stand a better chance. This is another reason why today’s technology is better at handling dictation and transcription of short sentences or commands than conversational audio.

Time and resource intensive

Speech recognition is an incredibly complex and resource intensive process. One requires a lot of tagged audio samples to train the system to recognize the plethora of variations in human speech. The fact that there are, at the very least, a quarter of a million distinct English words does not help. In addition, storing and processing such large amounts of high quality audio samples requires significant engineering resources.

So, are we stuck?

Until a few years ago, there was a dry-spell where no major breakthrough in audio recognition materialized. However, in recent years, due to advancements in a machine learning algorithm called deep learning, there is renewed hope that machines might just be able to do a better job of automatic audio transcription in future.

Deep learning is a technique that tries to simulate the way our human brain works and is becoming a mainstream technology for speech recognition. Google’s large scale deep learning project taught itself to recognize cats (surprise!) by analyzing large amounts of images. Microsoft is also actively working on using deep learning to build systems that can understand human speech better.

While there are still significant hurdles to cross, the machines are definitely getting better at this. In fact, we recently launched our very own automatic transcription solution. If your audio is clearly recorded and does not have background noise, automatic transcription will give you about 90% accurate transcriptions. Learn more »

If automatic transcription does not work well for your audio, there is still hope! You can always transcribe using our Dictation engine.

Still have questions? Contact us.

Start Transcribing Now Guide Home

About this Guide

This is an attempt to build the definitive guide on various topics related to dictation, transcription & recording.

This guide is maintained by Transcribe, a professional and easy-to-use transcription software that helps you convert audio and video to text automatically.
Start Transcribing Now Guide Home