Building Effective Custom Vocabulary

Speech-to-text engines are trained on a vast collection of sample recordings and texts. This means they perform well when your source is similar to “average” speech – i.e. typical conversations on common topics and using colloquial vocabulary and phrases that you would commonly find in your language. They will not perform as well if your source contains a lot of unique words or specialized terminology or phrases that the engine has not encountered before.

Using custom vocabulary allows you to inform the speech engine about unique words and phrases that are likely to occur in your source audio, so that it is more likely (but not guaranteed) to recognize them correctly.

Ideally, you should include things like proper nouns, acronyms, specialized terms, and short phrases which frequently occur in your audio but which are not part of typical every day conversations.

The downside of using custom vocabulary is that the words and phrases you include will be prioritized over more commonly spoken words and phrases. Using a large custom vocabulary with too many unnecessary words or phrases can actually decrease the accuracy of the results.

Following a few tips listed below will improve transcription accuracy.

Do Include

  • Proper nouns (names of people, places, companies, etc.) – especially if they are from a different language, non-dictionary words, or words with an unusual spelling
  • Acronyms (company names, abbreviations, etc.)
  • Short phrases (less than 100 characters) which are unique to your source and are repeated often (e.g. a catch phrase often spoken by a character, or industry terminology)

Do not Include

  • Words or phrases that are unlikely to occur in your source file
  • Phrases longer than 100 characters
  • Long phrases, sentences, or paragraphs that occur only once in your source
  • Long lists of unrelated or unlikely words or phrases

Format

The custom vocabulary should be specified as a comma separated list of words or short phrases.
Longer phrases can be placed on separate lines (line delimited).

Limits

Please note these limits which are enforced by the API engine. Attempting to use a vocabulary file over the limits could result in the job failing.
These limits are so high that if you are approaching them, you should review the recommendations above.

  • Number of phrases: 5000
  • Characters per phrase: 100
  • Total characters: 100,000