STTASR - A brief history and intro
John Jacob

John Jacob

6 min read

Automated Speech Recognition, or what you may have commonly heard as ASR is becoming ubiquitous. In this current age, speech to text has gained increasing prevalence. The global pandemic that hit us recently, have turned physical meetings to virtual meetings and consequently accelerated advances in the ASR space.

What is ASR?

In simple terms, ASR is the process of transcribing voice audio data into text data, which is more consumable. ASR is the Machine Learning pipeline to convert speech to text. It has gained prevalence in all technologically driven social and digital platforms, such as podcasts, webinars, video conferences and cloud meetings.

A Little History About ASR

Interestingly enough, ASR technology was introduced a lot earlier than we know. It was made available to the world by Bell Labs in the year 1952, and “Audrey” is what they called it. But it was only used to detect numbers in voice data. A little later, in the year 1960, IBM came up with new-tech, and called it “Shoebox”, which was slightly more advanced compared to Audrey, as it could not only understand numbers, but also numerical computation and measurement.

Another new model of ASR was created called the Hidden Markov Model in the year 1970, which uses probability to detect accurate words present in a voice audio file. Even though the efficiency and accuracy was not as good as it is today, it sure laid a path for future ASR tech along the way. The concept was to divide a piece of audio into further tiny pieces or fragments called phonemes, and process them to extract the most accurate word from it.

In the following years, ASR was at its peak of innovation at the time, with the concept of neural networks introduced in the later years of the 1980s. The addition of neural networks was of great impact, as it took a considerable jump from a series of simple template-oriented pattern recognition models to a more statistical approach. Of course, this would not be possible without the emergence of faster computers and quicker GPU processing. The more data we feed into neural networks, the better the quality of the transcripts.

Types of ASR variants:

The two main types of Automatic Speech Recognition or ASR variants are:

  • Directed Dialogue Conversation
  • Natural Language Conversation

Directed Dialogue Conversation: Directed Dialogue Conversation is the elementary variant of the two, in which the machine needs you to respond using a specific word from a set list of choices, and can process directed response requests only, for example: “Do you wish to re-activate a service, transfer to another service, or speak to a voice executive?”

The system is further trained with the help of Humans-in-the-loop (HITL), who will manually update the system’s vocabulary, by going through conversation logs and identifying frequently used words that haven’t yet been listed in the system’s vocabulary, which will in turn help the system to understand a broad range of responses. This process is called “Tuning”.

Natural Language Conversation: Natural language conversation is the more advanced variant of the two, which is an amalgam of natural language understanding and automatic speech recognition, using natural language processing (NLP) technology, which can imitate a real world open-ended chat conversation, for example: the system being able to visualize and interpret responses from a wide range of responses, even before posing a question, “May I be of any assistance?”.

Other examples of these systems are Amazon’s Alexa and Apple’s Siri. They are trained on a slightly different method called active learning, where the software is programmed to learn and adopt newer words autonomously. Constantly expanding ASR vocabulary, by storing valuable data from previous conversations. An average vocabulary of an ASR–NLP system consists of more than fifty thousand words.

With deep learning, ASR can be trained to recognize different accents and dialects around the world, and can also distinguish between different voices, and help with speaker diarization.

How it works

Simply put, ASR follows a set of steps/processes, which are:

  1. A recorded or live speech audio is fed into an ASR software.
  2. The device receiving the audio processes the audio and converts the raw audio signal to spectrograms.
  3. An acoustic model, then uses the spectrograms to generate the probabilities of different characters over time.
  4. A language model may be used to further improve on the acoustic model’s predictions, and a decoder finally outputs the words we see as transcripts.
  5. NLP models, such as a Punctuation & Capitalization models are then applied to enhance the outputs readability or to perform specific tasks such as Entity Detection, Keyword Recognition or Question-Answering.
  6. Once the sentences are clearly understood, the ASR software initiates a valid response to the earlier fed-in speech audio data.

Key features of ASR

  1. Live assistant: ASR captioning, and live assistance can be very useful during online meetings, as it allows you to focus more on the meeting, than on the manual process.
  2. Sentiment Analysis: The sentiments are generally positive, neutral, and negative analysis of a particular fragment of speech in an audio file.
  3. Custom Vocabulary: Mainly for Word Boost, custom vocabulary increases accuracy of a particular list of phrases or keywords while transcribing an audio file.
  4. Speaker Diarization: Speaker labelling, process of assigning participants to detected speakers in an input audio stream according to the speaker’s identity.

Applications of ASR

There have been tremendous advancements that ASR has seen in relation to Speech-to-Text APIs. A wide range of businesses all around the world are using ASR technology for Speech-to-Text applications. Some examples are:

Voice: Anything to do with voice is applicable here. Voice, we have voice calls over phone, voice calls over comms, voice calls over at contact centers, voice sessions over podcasts etc. They all need precise transcriptions, as well as insight generating features like call analytics, speaker diarization etc.

Voice Assistants: There are many common voice assistants, such as Amazon’s Alexa, Apple’s Siri, and Microsoft’s Cortana and Google’s Google Assistant. Now-a-days used even in-car infotainment systems to improve driver and passenger experience.

Video: Video editing platforms regarding online meets, need precise transcription and need for improvement on better app accessibility and search capabilities.

Virtual Meetings: Meeting platforms like Google Meet, WebEx, Zoom, Zuddl etc., all need precise transcriptions and content analysing capabilities to derive key insights.

Transcription: Multiple industries largely depend on speech to text transcribing services. These services are useful for transcribing customer voice calls in sales, customer meetings, interviews and podcasts, and acquiring MediNotes between a doctor and their patient.

Education for College grads: ASR may provide as a valuable tool for educational purposes, especially at colleges and universities, to provide precise transcriptions or serve as a live assistant to help inattentive students, or students who have missed a class etc. It may also serve as a need for students who are differently-abled, non-native speaking or having any other need.

Media: Broadcast network, podcasts, and radio etc., all need features like, brand detectability and other important points discussed for marketing and advertising purposes.

Security: Voice-enabled security devices that require voice recognition to provide improved security at critical access points by the use of ASR.

Legal: Due to the legal requirements, it is imperative to capture every bit of word of meetings recorded and transcribed, where ASR can be of help.

Health care: Doctors are utilizing ASR to transcribe MediNotes from their meetings with patients.

Media: Media Houses use ASR to show live captions and to generate transcription for all the media content produced, according to the required media guidelines.

Corporate: Companies are utilizing ASR for live captions, transcription for multiple domains.

Challenges faced today

One of the major challenges with ASR today, is to achieve a human-like accuracy level of transcription. While both ASR approaches, which are the old traditional hybrid and modern Deep Learning approaches, are a lot more precise than ever before, neither have come close to 100% human accuracy, without the help of leveraging domain focused, and customer specific data. This is due to different accents and dialects spoken by people all around the world that even the best DL models in the world cannot train without a remarkable effort.

Exemplary AI’s Intelligent Warehouse facilitates continual learning and helps with improving quality of transcripts with Custom Vocabularies, learning from corrections made to transcriptions, annotations and using additional domain-data and text to improve performance.

What the future would look like? Where is all this heading to?

As ASR tech grows exponentially in the years to come, the one thing we can expect for sure is the ubiquity of its applications in all aspects of our life. We can expect multi-language ASR, that seamlessly shift into understanding all languages that is also more accessible, cheaper. ASR will be the foundation upon which Natural Language Understanding will be layered on.

For today, try out ASR services like AWS Transcribe with ease and without vendor lock-in. Or Deepgram, which offers to train speech models for high accuracy with support for custom models.

Related Blogs