What Are Captions?
Captions are the text representation of words and other important audio information that are synced with a video.
The National Association of the Deaf defines captioning as “the process of converting the audio content of a television broadcast, webcast, film, video, CD-ROM, DVD, live event, or other productions into text and displaying the text on a screen, monitor, or other visual display system.
“Captions not only display words as the textual equivalent of spoken dialogue or narration, but they also include speaker identification, sound effects, and music description.”
There are some important things to understand about different types of captions. Just because you see text on a video, doesn’t mean it has a fully accessible captions file. Captions should be a separate text file associated with the video that can be adjusted to viewer preferences.
Types of Captions
Closed captions can be turned on and off according to the viewer's preferences. Open captions are burned into a video. While open captions are better than nothing, closed captions provide users with the most accessible experience.
Open vs. Closed Captions
There are some challenges that come with open captions:
- They are burned in to the video and cannot be edited
- The captions cannot be turned off
- They cannot be moved
- A viewer cannot change the font size or text color of the captions
- The captions are not searchable
Closed captions are added to a video as a separate text file and provide several benefits for those who access your content:
- The font size and color of the text can be adjusted
- The captions can be turned on or off by the viewer
- The caption file can be edited
- The captions can be searched
- The captions can be moved on screen
Machine (Auto-Generated) Captions vs. Professional Captions
There are also different levels of quality with video captions. Professional captions are reviewed by a human editor who makes sure they meet federal quality standards and have a high accuracy rate. Machine (auto-generated captions) use automatic speech recognition (ASR) software to create closed captions on a video. This is frequently a built-in feature on platforms such as YouTube. While they can be a helpful starting point, machine captions do not meet the USU Video and Audio Accessibility Standards. 3Play Media has found that “solely using ASR to generate auto-captions for recorded videos is detrimental to the accuracy of the captions”.
You have probably seen examples of machine caption errors when watching a video, but a few examples can be seen below:
Sometimes machine captions can produce a fairly accurate caption, but there are many factors that have to be just right for that to happen, including high quality audio, no background noise, little or no grammatical errors, and few mispronunciations. If any of these are not just right, the accuracy of the captions can drop as low as 50 percent and lead to a horrible user experience. 3Play Media states, “The key to 99% or higher caption accuracy is human interaction” such as that available from professional captioning.
Auto-generated captions can also hurt your brand. First, capitalization and punctuation errors can make it difficult for a viewer to understand what your content is actually about. Auto-captions are well known for missing words or adding the wrong word in place of another, and you don’t ever want to be seen as saying something you don’t mean. This could cause embarrassment for both you and your brand.
In addition, auto-captions can make viewers who need captions feel like they are not valued by the brand. Without captions, some viewers have no choice but to stop watching the video. Providing high-quality captions helps all viewers feel included and valued.
Commonly Mistaken as Captions
Besides captions, there are some other text representations of audio content that are sometimes confused as captions. These can be helpful in different use cases.
Subtitles include only the spoken words of video or audio that are geared toward viewers who do not understand or are not familiar with the language of the video. They do not contain non-speech descriptions that are necessary for accessibility such as music or sound effects.
Transcripts are a text file of the spoken words in video or audio. They may or may not have non-speech descriptions. Transcripts are particularly useful for podcasts or other content that only has audio.