STTASRSELF-HOSTAPIASR Solutions: Building In-house vs using a SaaS provider (Part 1)Shantanu Nair
Shantanu Nair8 min read
Speech to text has seen immense and rapid advancements and development in the last few years. With developers gaining increased access to these technologies, new avenues are opening up for ML backed startups and companies to build powerful solutions by building on top of these technologies. The ability to build powerful ASR backed solutions that can and are revolutionizing everything from sales enablement, advertising, accessibility, enabling SEO improvements for podcasts and videos, to meeting assistants among others is now in accessible to more developers than ever before. In the months and years to come, this space is only going to explode further, growing exponentially, complemented by the current revolution in Language AI and Large Language Models. In this blog piece, I will explore key considerations that can make our break your ability to execute when embarking on such a journey — do we maintain an in-house ASR/Speech-to-Text (STT) engine? Or should we use an existing SaaS provider?
While building ASR backed tooling and experiences at exemplary, we spent time exploring both these options, and also spoke with several end users, ML practitioners and founders utilizing and building ASR backed solutions. While some insights we share apply to any ML “serving” focused application, I think our ASR/STT focused findings offer many insights that one only discovers after diving into the deep end of ASR providers and building an in house ASR stack. In this article, I'd like to share some of these insights, and help you develop an intuition on what exactly existing SaaS providers may save you the trouble of, and what your experience could be like, if deciding to build and maintain ASR solutions in house. Stay tuned for future articles that will dive into each, and offer more specific learnings depending on the path you choose, but for now, here are some quick learnings that should help you make an informed decision while building ASR backed solutions.
In part 1 of this two-part blog, I'll explore what it's like to work with the existing SaaS providers, and what they tend to offer.
In part 2 of this blog, I'll discuss what your considerations should be should you decide to maintain your blog in-house, and contrast it to what the experience and focus while building would be on compared to going with an established ASR provider.
SaaS Providers – Going with the established big players
Usage and Pricing
There are two distinct categories of ASR providers. The services offered by large cloud providers such as AWS Transcribe, Azure, Google as well as domain-focused companies such as Deepgram, AssemblyAI, and Rev that provide tried and tested API solutions that are in use and trusted by a growing list of companies. A typical ASR provider will offer you the standard flow of obtaining an API key, which you will then use via their API to provide a straightforward job-scheduling flow where you make a request providing the source for your audio, along with provider-specific options that you can choose from, that offer varying degrees of customization for your ASR output, and the ability to run common downstream NLP tasks that you'd like to have performed on the generated transcripts.
A few providers like Rev provide rather straightforward and simple APIs that expose plain transcriptions, without any bells and whistles, along with the option to opt for (much) more expensive human generated transcripts, should your margins and use case allow for it. You purchase credits as needed up front, or utilize a pay-as-you go scheme, and don't have to ever worry about the costs of running discrete GPUs and maintaining ML serving infrastructure. Let the providers handle scale and reliability while you focus on iterating on your product and can work on building out other aspects of your business or application.
Features & Out-of-the-box NLP support
With most providers, the source media you provide is typically a live stream of audio, for when your use case requires live transcriptions, or a pre-recorded audio/video file, for when you want to utilize ASR in a more “post-processing” fashion. This is the first decision you will face. While both live-transcripts and pre-recorded (here on referred to as batch or offline transcripts) are priced per hour or per minute, you will notice that live transcripts are often priced higher with some constraints on additional features, while batch transcriptions although not live, tend to offer more accurate transcripts, and more downstream NLP features, and the ability to take advantage of volume based discount pricing.
Domain focused companies such Deepgram, AssemblyAI all provide job status update webhooks, allowing you to integrate them into your solution. However, AWS Transcribe requires you to be locked into their ecosystem to obtain callbacks after a job completes (for example, integrating with AWS Eventbridge to receive events when a job is complete). Live transcriptions are usually served via the websocket protocol, and most of the job queuing, retry logic, etc is configurable via their dashboard along with the help of developer SDKs and APIs, each in their own flavor and with prescribed workflows for how you should use them.
If diarization (detecting speakers, and who spoke when) is important to you, you will come to realize that live-transcriptions don't offer as much in the way of diarization or speaker detection. Certain NLP features, such as summarization, are also, understandably, only available once the whole transcript has been processed. Depending on the provider, these features are either priced in to the transcript costs, or are charged based on a per-words processed basis.
Live transcriptions also tend to perform worse, accuracy wise, when compared to running batch/offline transcription “jobs”. This is because batch transcripts can take advantage of the entire's transcript's context to make a more educated guess on what a speaker intended to say. Another rather standard transcription feature, one available from most providers – Punctuation & Capitalization, will also perform better when running batch transcriptions, as it's easier for the model to guess the right punctuation when given a complete speech segment.
Running with an ASR provider means that, depending on the provider you choose, you will likely have access to a curated catalog of out-of-the-box NLP features that they can be run on your transcript. Note that you will have to spend time running through the myriad of the NLP features each ASR providers offer, each with their own branding and style of usage and limitations. You will spend a good deal of time comparing these catalogs to make sure you can support current and potential use cases of your solution. Some providers offer significantly more out-of-the-box NLP features than others. Deepgram for examples offers only more basic features such as profanity filtering, and word search, while others like Assembly offering a more complete feature set like Sentiment Analysis, Topic and Entity Detection, etc. under their Audio Intelligence suite. You will notice that with that additional feature set comes the trade-off of speed. Deepgram is much faster at transcribing compared to its counterpart AssemblyAI which offers more NLP features.
Data Privacy and Locality
The other thing of note when using SaaS is that the data leaves your premises into that of the SaaS providers. Providers often use your data to train and improve their models unless you sign custom contracts with them forbidding them to use your data or they provided GPU boxes that you can use within your premises. Please make sure to go through the terms of any provider you may consider using.
So while you do get a lot of out-of-the-box features with the big ASR providers, once you choose a provider, you are locked in with them for good measure. If you ever plan on performing downstream processing on the text, or even run experiments to measure accuracy, don't be surprised when you find that the nuances of how a particular provider deals with edge cases such as out-of-vocabulary words or even disfluencies (umms, ahs, and uhhs in your audio), mean that comparisons between providers often become tedious. The current industry standards used when comparing accuracy in this regard (such as WER - word error rate), need more standardization and more dimensions to reflect on a transcripts provider's accuracy and intricacies. Don't take accuracy claims purely at face value – you must test providers' ASR capabilities on audio representative to your use case to make sure you make worthwhile trade-offs when selecting a provider to build with.
Improving Accuracy and Custom Datasets
When it comes to improving accuracy, providers typically have services or even programmatic access to endpoints that allow you to supply your own datasets, audio or text datasets, that enable the provider to fine tune their model offerings and provide you with custom acoustic models and custom language models, optimized to your domain and its use case. This could be of tremendous value to you when it comes to getting transcripts of acceptable quality, especially when, as you'll find out in the next part of this blog, attempting this yourself is rather cost and time prohibitive. When running an ASR model in-house you will have to bear the cost of the entire pipeline of, collecting datasets, annotating them, and finally trying to establish what effect the training had, whether it improved your transcription quality in only a subset of your use-case, or the majority of it. That's a hard task, and an expensive one at that. With ASR providers, you can rely on their in-house skill and expertise when it comes to streamlining this process. In contrast to running your own ASR stack in-house, you can take advantage of this, without needing significant expertise and capital that would be needed if trying to implement this whole process in house. In general, the entirety of customizing ASR, when done in house, requires involving multiple specialized roles pertaining to data collection/cleaning, human transcriptionists for ground truth transcripts, ML model training experts, ASR specialist practitioners, and finally deploying the trained model in a cost-efficient manner for production.
So, when going with ASR Providers, you gain ease of use, the ability to take advantage of their catalog of NLP and transcription features, and their streamlined processes and services to improve accuracy, and the ability to iterate fast, with a more focused, less diverse team, enabling you to build towards a solution without focusing on the nitty-gritty that is MLOps (ouch), data curation, and inference optimization, among many other highly involved tasks. You do, however, face some level of lock in, but this is often only an issue if your value proposition involves building proprietary ASR specialty technology, where you need absolute control of your stack and can utilize the fine-grained flexibility and customization you gain if you choose to go the more hardcore ML Engineering route of building your own stack. Another reason you may want to go in house is if a provider you are considering does not support on premise processing of transcripts, and that on premise processing is a requirement for your use case or customers.
Part 2 — Building ASR In-house
In the next part of this blog, I'll go through what considerations and tradeoffs are made when building an ASR stack in house and how it contrasts to using existing an ASR provider.