STTNLPLLMGoing beyond Speech-To-TextJohann VergheseJohann Verghese
3 min read
Johann Verghese
We are living in a world of online meetings and asynchronous communication. Post-Covid, the meeting time captured on platforms such as Zoom, Teams or Webex has risen from billions of minutes to trillions of minutes, with over a billion daily users on these platforms! There is a wealth of information and insight within this data, but it tends to be lost since people have limited time and capacity to analyze it.
Consequently, with more data, there have been strides in the development of speech recognition technology, with providers offering varying support for languages, accents and features (Catalog of STT Providers). There has been a burgeoning of speech-to-text transcription applications as well, catering to different needs such as:
- Transcription of Meetings
- Voicemail & Chat App Messaging Transcription
- Sales Coaching / Onboarding
- UX Research
- Subtitling & Closed Captioning of Videos
- Note Taking and Lecture Transcription
- Podcast Transcription
- Journalism & Academic Research
- Recruiting
- Legal & Medical Transcription.
Also, platforms such as Zoom and Slack provide Live Transcriptions which is a step in the right direction. This enables generating automated transcripts of Zoom Meetings & Webinars and Slack Huddles.
Understanding ASR/STT
Before we expand on how we can go beyond Speech-To-Text with Natural Language Processing / Understanding and Large Language Models, let's have a quick overview of the technology at the core, Automatic Speech Recognition (ASR).
Typically ASR is comprised of few processes:
- Language detection
- Speech Recognition
- Speaker Diarization
- Language Model
- Punctuation & Capitalization Model
This can be combined with other NLP / NLU tasks. However, an important thing to remember is that transcript and diarization quality is core to the performance of all downstream NLP / NLU tasks such as summarization, detecting and assigning action items, entity detection etc.
If speech recognition isn't accurate, then the generated transcripts aren't any good and consequently, all downstream tasks will yield poor results. A typical metric for measuring transcript quality is the Word Error Rate (WER).
As WER varies depending on the model being used, the language being spoken, or the accents of the speakers, one of the main challenges for today's ASR is maintaining a good WER for domains and use cases.
One way to improve speech recognition accuracy is by using a language model. This can be done in two ways:
- Increasing the size of the language model
- Training the model on more data
Both of these help to improve the accuracy of speech recognition. However, increasing the size of the language model can be computationally expensive and time-consuming unless you use the right tooling (more on this soon). Training the model on more data can be equally tricky, as it requires a lot of labelled and annotated data, and instead, smaller amounts of data can be used to fine-tune a modela for an individual use case.
Jargon is another common reason for inaccuracy. For industries like engineering, medicine, or law we may need to train our language model on recordings containing language from that particular domain. Additionally, we can also boost specific words, use a domain-specific vocabulary or even retrain the language model.
Going beyond Speech-To-Text
While running speech-to-text on recordings is a good start to surface searchable transcripts behind what would otherwise be opaque data such as meeting recordings, it is just an initial step toward unlocking the true value for many applications. Further insights can be extracted from the generated transcripts using Natural Language Processing (NLP) / Natural Language Understanding (NLU) Tasks on the transcripts such as:
- Summarization
- Sentiment
- Topics
- Q&A
- Filler Words / Disfluencies
- Detecting events of note such as Churn Warnings, Sales Blockers etc
- Data Extraction from transcripts (Such as budget of the department, or creating a new contact with name and role)
Use of Large Language Models (LLM)
Since the release of OpenAI's GTP-3, we have seen a proliferation of large language models with companies such as Cohere and AI21 coming out with their own super-sized model variations and even open-sourced models coming out of EleutherAI's research efforts.
Large language models are a natural fit for generative tasks such as summaries. But with some prompt engineering, they can be used to much beyond that — like extracting informational segments and annotating moments from transcriptions. More on this soon.
There's tremendous potential to combine NLP, NLU and LLM processes with high-quality transcripts. For combining ASR, NLP and NLU tasks from several providers through a simple API, please check out our playground.