Basic Guide

Audio encoding for the Google Cloud Speech-to-Text API

Audio encoding for the Google Cloud Speech-to-Text API

The Speech recognition BLOCK calls on the Google Cloud Speech-to-Text API which supports the following audio encodings:

Encoding type Explanation
LINEAR16 16-bit linear PCM. An uncompressed, signed data type with little-endian byte order.
FLAC A recommended encoding type for Cloud Speech-to-Text API due to its lossless compression.
MULAW An 8-bit sample encoding that uses G.711 PCMU/mu-law to compand 14-bit samples.
AMR When using the Adaptive Multi-Rate Narrowband speech codec, the sample rate of voice data must be 800 Hz.
AMR_WB When using the Adaptive MultiRate Wideband codec, the sample rate of voice data must be 16,000 Hz.

FLAC and Linear16 are the recommended encoding types for best results with the Cloud Speech-to-Text API.

When using audio files of a lossy encoding type (MP3, AAC, etc.) not listed above, you must first convert the file into FLAC or LINEAR16. However, please be aware that the recognition ability of the Cloud Speech-to-Text API may be affected by the loss of data due to compression as compared to lossless encoding types.

Converting audio data

Here, we’ll demonstrate a simple method for converting audio file formats using SoX (Sound eXchange)

We explain how to use the application Audacity to record FLAC encoded audio on the How to use the Speech recognition BLOCK page. You can import audio into Audacity and export it in FLAC format, so you can use it instead of SoX, if you prefer.

SoX is a tool usable in Windows, macOS, and Linux that allows for easy audio file editing using the command line.

For example, if we want to convert an MP3 file into FLAC, we would input the following sox command into the Command Prompt or Windows PowerShell (Windows) or Terminal (macOS, Linux):

sox input.mp3 --rate 16k --bits 16 --channels 1 output.flac

We can also trim audio files when we convert them. In this example, we’ll take an MP3 file, trim it so it starts from the first second and ends at the fifth second of the recording (the trim 1 5 portion of the command), and convert to the FLAC format:

sox input.mp3 --rate 16k --bits 16 --channels 1 output.flac trim 1 5

Since there is considerable data loss using the MP3 format, even converting to FLAC can result in poor audio recognition from the Cloud Speech-to-Text API. In general, it is best to completely avoid using MP3 or other lossy encoding formats.