Speech to Text With Google Cloud

by Johannes Filter, Sep 8, 2020

Nobody should manually transcribe long interviews in the year 2021. Machine Learning solved this problem already. Google overs a simple API that works for over 120 languages. Check out their demo (works only for sound clips up to 60 seconds).

Using their API is free to process up to one hour. After this, they charge 0.006 per second (1.44 Euro per hour).

You need to work with the command line to use the API, there is AFAIK no frontend. The official tutorials are fine, but I stumbled upon a few obstacles. So I document some things in this blog post.

Follow the official tutorial and set up a project, enable billing, etc.

If you are on a Mac, install the gcloud with brew.

brew cask install google-cloud-sdk

If you want to process longer files, you first need to upload them to a Google bucket. Then create a bucket (as part of the already created Google project).

You also need to follow a slightly different API endpoint because large files are processed asynchronously. Check out their tutorial.

I am not sure about all the audio format the API accepts but using FLAC made it easy. So first convert your audio to a FLAC file. NB: The FLAC file should only have one channel. You may have to customize your conversion setting when using VLC. For MacOS: Custom > Audio Codec > set Channels to 1)

gcloud ml speech recognize-long-running gs://your-bucket-name/the-audio-file.flac --language-code=de_DE --async

The command should return an ID if everything is working as indented. Use the ID to wait for the job to finish.

gcloud ml speech operations wait 12121212121212 > text.json

The text is stored in chunks in a JSON format. In my example, the chunks were split when a speaker switched.