STTASRSELF-HOSTAPIASR Solutions: Building In-house vs using a SaaS provider (Part 2)
Shantanu Nair

Shantanu Nair

6 min read

ASR Solutions: Building In-house vs using a SaaS provider (Part 2)

Overview

Choosing to build your ASR stack in-house may give you just the edge you need over your competitors, offering enhanced accuracy, customizability and enabling you to take advantage of in-house ASR and ML expertise. On the other hand, if executed poorly, it may well cost you significant engineering resources, time, and significant capital that could've otherwise been put to better use. It could leave you with a solution that just doesn't offer much of any real advantage, given the effort put into it.

In Part 1 of this blog, I explained how at Exemplary we have investigated both these options, and offered some insight into what it's like, and what to expect if you choose to to go with an existing SaaS ASR provider such as Deepgram, AWS Transcribe, or AssemblyAI. In this part of the blog, I'll go through what considerations need to be made when deciding to run your own in-house ASR stack, and offer some insights into the hurdles you may face, some unique to ML serving in general, others more tailored to ASR specifically.

When is it worth it?

Now, why would you even be considering running an in-house ML stack, especially when you are operating with limited capital, or a smaller team? Some good reasons are:

  • Your data is sensitive, and you cannot send it to a SaaS provider
  • You need a more cost-effective solution because you run tens of thousands of hours of transcripts
  • You want to make use of a custom ML pipeline, and existing SaaS providers don't have an equivalent offering.
    • Lack of support the particular language or accent.
    • Lacks a pre-trained model that performs well with the data you have.
    • Lack of features that you need as part of the ASR pipeline.
  • You have a range of in-house expertise and teams, and you're confident you can take advantage of their ability to tweak the latency, accuracy, or cost-effectiveness of your ASR backed solution.
  • You are investigating new technologies in the space and want to run experiments.
  • You have access to GPUs/TPUs/hardware, which you are renting or have purchased, that you intend to take advantage of.

Depending on your use case, maintaining an in-house ASR solution will require you to balance latency, throughput, and cost-effectiveness while also taking care of the ops, and maintaining scalability, can prove to be challenging. You will need to hire or have access to teams with roles such as ML Engineers, Data Engineers, ASR Researchers and a strong MLOps engineer/team backing you, if you intend to build out these pieces in-house. Even once deployed, ASR engines need continual work, in order to avoid issues such as model/data drift to ensure your solution doesn't deteriorate in performance.

Flexibility and Customization

If you have the right teams and expertise, and are willing to invest engineering hours and capital into it, it may be worthwhile to run your own ASR stack. Maybe you want to pick and choose custom models, or even build proprietary models, train them yourself, or fine-tune an existing model on custom data which you collected, cleaned and annotated. You can test out that latest Language Model which you believe could show promising results with your data and use case, or swap it out for another one that your research shows to work well with certain resources constraints. These are all possible with the flexibility and control you gain when running your own ASR stack. Remember, though, that in the end, you will have to be able to gauge whether you're moving in the direction you want, with respect to performance and effectiveness.

If you are running multiple ASR or audio related tasks on the same hardware, such as, say, diarization in addition to generating transcripts, you may be able to reuse certain components of your ASR pipeline, increasing your GPU utilization for improved throughput and effectiveness. This is only possible with the control that comes with running your own ASR solution in-house.

You must remember that while we have access to serverless CPU compute nowadays, we do not have the availability of serverless GPU compute. GPUs are generally harder to isolate as a resource, and model spin up/spin down takes a much longer time than currently offered Serverless compute. A significant obstacle startups and smaller teams commonly face should they attempt to deploy a ASR solution in house, is that of GPU utilization. Without a steady stream of audio coming being sent, that can saturate your GPU and justify your GPU rates, you will almost certainly be losing out $$$ to dead compute. Capital that could otherwise go to hiring another engineer, or elsewhere.

Dataset Collection/Annotation

Collecting and annotating data is very expensive and cumbersome, all before you can tell if it's worth it. You will need to take the help of many, many, human transcriptionists, if you can even get your hands on the amount of data required to train a model on a custom dataset, as well as data engineers to maintain your data warehouse. Although you can get access to these datasets by purchasing them, if available, they would come at a steep price.

Training and Infra Costs

With today's absolutely giant models, and the massive amounts of data they need to train on, you may find that even the infrastructure required to train them is highly cost prohibitive. If you don't have extremely competitive rates on GPU hardware, or are paying on-demand or even reserved rates with your cloud provider, you may quickly find that just the cost of running training cycles and their experiments themselves are just sky-high. Training can sometimes take months of continuous high count multi-GPU setups, running 24x7.

Model Versioning and Explainability

While doing these experiments, changes to your models and pipelines need to be logged, and stored, just as you would in version control for your codebase, while keeping track of how well they perform, and what hyperparameters got them there.

Productionizing and Model Serving

Oftentimes, we have come across cases when smaller teams have spent time and money on developing and training a smaller model or pipeline and are ready to put it into production, only to realize that productionizing their work is not so straightforward. Early stage startups and smaller teams often make the mistake of handling ASR requests by mapping each request to its own separate GPU. This very quickly becomes a bottleneck, especially when you have to scale up to even a few requests in parallel. You will need to invest in ensuring your inference platform can take handle a satisfactory number of requests concurrently, optimizing for throughput and latency, otherwise you are simply paying large bills to wasting compute underutilized resources. Productionizing your pipeline might also involve optimizing the trained mode for inference by tuning it for a particular “inference runtime” or the particular hardware available. You would need to make several technical considerations and build towards ensuring its ability to scale, stay healthy and cost-efficient, and will have to distribute these tasks among multiple people spanning multiple specialized roles.

Concluding notes

All in all, I would say that, consider building an in-house ASR stack when your requirements be it for privacy, compliance or you have enormous volume of data that necessitates the an in-house solution. If and only if you are at the scale where it seems obvious to invest in it, or if you have access to the required teams and want to build a customized ASR solution because you absolutely need that competitive edge, does it make sense to invest in building and maintaining an ASR solution in-house. However, nowadays, you can also consider building your ASR solution with NVIDIA Riva.

If you'd like to use an in-house solution built using NVIDIA Riva or a SaaS provider, do take a look at our unified STT API offering that abstracts away the provider specific details, and gives you the ability to switch between providers to take advantage of their individual models, benefits, and customization options, via a standard workflow, all while ensuring a consistent and enjoyable developer experience.

Related Blogs