Getting the Best Results from Auto-Transcription

Using Telestream Cloud for auto-transcription of media files is a great way to get started on a caption or subtitle project.

All submissions will result in an automatically generated transcript timed to match your media file. The timed transcript can be reviewed and edited on the fly in the Telestream Cloud console or directly populated into your MacCaption or CaptionMaker project. The more accurate the results the less clean up and editing is needed to make your transcript perfect. Below are some best practices and tips that can help increase the accuracy of the auto generated transcript from Timed Text Speech.

Isolating the Spoken Word
For original content or media with multichannel audio, the dialogue-only track can be isolated to eliminate noise, music, and sound effects. This isolation can be done with any video editing software or audio production tool. By submitting audio with only spoken words to the Timed Text Speech engine, accuracy of the auto-generated transcripts can be greatly increased.

To accomplish this a video editor can open their project in Adobe Premiere or Avid and silence all audio tracks which do not contain dialogue. Next, they can simply export an audio-only file. Timed Text Speech can handle audio files such as .mp3, aiff, and wav.

In some cases your media file such a QuickTime or .MXF may contain multiple audio tracks. This could be a 5.1 mix, or isolated tracks for archival or transcoding purposes. Within MacCaption or CaptionMaker users have the option to select any of the audio tracks within the video file before submitting their project to Timed Text Speech. By default the software will submit tracks 1 and 2. If there is different audio track in the file which contains the spoken words to transcribe, the software can be configured to submit the alternate audio track instead. This means that the spoken-word-only track will be processed and the results will in turn be much more accurate.

Training the Speech Engine
In many cases media files which require transcription may contain names, phrases, and acronyms that are not common. The speech engine may consistently get these wrong, causing users to manually correct the results again and again. To remedy this, Telestream Cloud’s console offers a way to train the Timed Text Speech engine by uploading a corpus text file, or by manually entering these uncommon words.

This file is a simple plain text .txt document that contains a list of names and phrases that are used in a project. Users can upload this .txt document to any specific project that may require training to increase accuracy. We recommend that the .txt document contain a list of phrases on each line instead of only individual words.

For example an effective corpus text document would like like this:

John Galveston
CEO of the Company
Working with CDN providers
Transcoding and captioning solutions
MacCaption Software

An example of a corpus text file that is not effective looks like this:


By using phrases the speech engine knows what to expect and which other words are typically used with the new vocabulary. This added context means that results when using the custom vocabulary will greatly increase in accuracy.

In some cases, creating a corpus text file for training is very easy and takes very little time. Some users simply repurpose the old transcripts or caption files from the same TV program or Project. For example, if a broadcaster needs to create a transcript for season 3 of a TV program, they can open the caption files from season 1 and 2 using MacCaption or CaptionMaker and export a corpus file which can be used for training Timed Text Speech. These 2 previous seasons contain the names and phrases that would greatly increase the vocabulary.

Another way that users can leverage the vocabulary training of Timed Text Speech is when a rough transcript is already available of the media file prior to submission to Telestream Cloud. This rough transcript will also contain the names and key phrases for the project. Timed Text Speech would then automatically time the rough transcript and fill in the text that is missing.

Content Types Best Suited for Automatic Speech Recognition (ASR)
The type of video content plays an important role in the level of accuracy achieved using auto-transcription software. For example, a news show with a professional announcer and clear studio audio will have great accuracy compared to a video shot outdoors in a noisy environment on a mobile phone. In addition, loud music, singing, and shouting will also bring down the level of accuracy. There are also cases where speakers may change their voice to provide dialogue for children’s programming or for dramatic effect. This means that a speech engine that is designed and trained for standard voices may not be able to understand these voice tones. Generally speaking, projects with clear studio quality recordings, minimal background noise, and a professional speaker will always result in the best accuracy.

Creating a Proxy
Professional video companies generally work with high quality video master files called mezzanine files. These files are used the same way tape masters were used in the old days. The original video must be uncompressed or high bitrate when submitted for processing. This is not the case for Timed Text Speech workflows. Because Telestream Cloud requires only the audio content for auto-transcription, users can submit a low bitrate MP4, or just the audio file. As long as the audio quality is good you will achieve high quality results, the video quality or resolution does not affect your results.

Voiceover for the Purpose of ASR
For video editing workflows, it’s quite common for editors to do a rough voiceover when editing prior to bringing in voice talent to the studio to record the final audio. This also provides an opportunity for video editors to re-speak any portions of the video project that do not have clear audio. This rough voice over can then be exported from the video editing system and submitted to Timed Text Speech for processing. This means that results will have a greater accuracy than the original audio.