Potential approaches for captions/subtitles #356

FLGMwt · 2021-05-08T23:45:46Z

FLGMwt
May 8, 2021

I would have written a shorter letter, but I did not have the time.

#16 describes the need for captioning support for remotion to make its video outputs more accessible. In this discussion, I describe potential use cases for generated captioned remotion videos and some potential implementation paths to satisfy those use cases.

I'll go over:

Definitions: setting the stage by solidifying some terms and concepts
Remotion Captioning Output Scenarios: What should remotion output?
Caption Collection Options: How does remotion get the video captions?

Definitions

Want to set some terms so we're all on the same page.

Caption vs Subtitle

The term "caption" typically refers to text accompanying a video that describes the content of the video. This can include spoken word, textual descriptions of non-spoken audio (sound effects, music), and sometimes textual descriptions of visual content.

The term "subtitle" typically refers to the textual translation of spoken word or displayed text that's in a different language than what the presumptive viewer understands. This is common for multi-language films or for films distributed to audiences that do not understand the primary language of the film.

For simplicity, I'll use Caption throughout since the concerns for video output are the same for both.

Open Captions vs Closed Captions

"Open Captions" refer to text that is baked in to the image video output. Viewers cannot configure open captions since the text is embedded in the video frames. Additionally, the textual information is lost and cannot be read by assistive technology. The compositor of the video frames is responsible for setting and styling the displayed text.

"Closed Captions" refer to text that is embedded as metadata in a video. The metadata is any number of caption files (one for each desired locale) that tie text to timestamp ranges of the video when the text should be displayed. Viewers can enable captions for a locale (such as English or Spanish) or disable them. Video players can read a wide number of caption file formats and handle styling and displaying the text at the correct time. Assistive technology can also consume the caption data and provide it in alternative manners, such as through Text-To-Speech.

Open Captions were developed before video technology improved to allow things like Closed Captions. They can improve video accessibility for sighted people, but since the textual representations aren't available outside the visual medium, they exclude a great many individuals. In nearly all cases, Closed Captions should be the preferred method for capturing textual representation of video or audio content. For this reason, I primarily focus on Closed Captions for the purposes of this discussion.*

Caption File Formats

There are a lot of different file formats for structured caption such as .srt, .sbv, .vtt and more. For brevity, instead of saying "structured caption file" when referring to a "structured caption file", I'll just use .vtt. Assume I mean all file types. They're all pretty similar and I think we can treat them the same for the purposes of this discussion.

Remotion Captioning Output Scenarios

The following describes the what I think are the end state captioning use cases for remotion output and their respective levels of effort. Each assumes we've already handled providing or parsing caption data and will be to access the desired caption for a given frame and, for closed captions, the complete video's .vtt. Considerations for getting the caption content itself is covered later.

Open Captions For Video Output With `@remotion/renderer`

See "Open Captions vs Closed Captions" for concerns about open captions

Assuming an API for fetching the caption of a given frame, this won't be much more work than absolutely positioning the caption text. Remotion could provide default styling for captions and make it configurable.

Open Captions For Video Output With `@remotion/player`

See "Open Captions vs Closed Captions" for concerns about open captions

No changes would be necessary to "Open Captions For Video Output With @remotion/renderer".

Closed Captions For Video Output With `@remotion/renderer`

Remotion would not be responsible for rendering closed captions. Instead, it need simply include any generated/provided caption files to FFmpeg to be included in the video. I believe this should be the highest priority.

Closed Captions For Video Output With `@remotion/player`

This one's quite tricky and it might be impossible to do correctly, at least in the sense of conventional caption accessibility. Videos rendered with @remotion/player are essentially just animated DOM elements and do not use any true video primitives. Whereas a <video /> could have child <track /> elements with associated captions, browsers offer no similar captioning support for arbitrary and controllable sequences of DOM manipulations.

The best I can think to do right now is to have the remotion player element include an empty <video /> (or, more simply, an empty <audio />) that's synchronized with the player state and includes generated .vtts. This might work for fixed length videos with fixed caption content. However, since the remotion video is dynamic, we'd need to regenerate the fake media and caption on the fly. Since we don't have control over how the consuming assistive technology reacts to swapped out media and captions, we probably can't ensure a good experience.

Caption Collection Options

Here are some ideas on how we can collect caption data from remotion users. As I stated earlier, my priorities are in support of Closed Captions, but these options would also benefit Open Captions. I'd love to hear some more ideas :)

Accept `.vtt`s As CLI Arguments

Keeping it basic: we could just pull take any number of .vtts as arguments to remotion render and add them to the FFmpeg call.

Pros:

simple!
"Remotion has no caption support" => "Remotion has some caption support!😀"

Cons:

User has to supply .vtt for entire video. They may not know how long their video is until it's rendered. The experience is kinda goofy to render the video, go back and write/generate the .vtt, then re-render w/ the .vtt.
Lacks reusability. If I have two identical videos, but one has a 30s ad break in the middle, I have to adjust / rewrite the .vtt

Accept `.vtt`s As Remotion Props

We could have the sequence-y (Audio / Composition / Sequence / Video) components accept .vtts. The render process could somehow patch these together to generate a full-video .vtt.

Pros:

More reusable
Easier to write captions for smaller work units
It's more likely that a user has a .vtt for something like an audio file or video file

Cons:

How do we stitch them together? Some type of registry we build up during render? Would probably similar to how Audio / Video work but I need to look into that. Captions would be defined in arbitrary locations so we'd probably just have to live with "last caption for a timestamp wins".
We'd have to do time math on parsed .vtts: if a captioned bit of audio starts 20s worth of frames into the video, we'd have to add 20s to all of that audio's caption entries. Not a huge deal, but different caption file formats have different timestamp formats.

Accept Structured Caption JSON As Remotion Props

This is similar to above (and perhaps in addition to), but with this approach we would rely on the user to pass deserialized caption data as something like Array<{from: number, durationInFrames: number, caption: string }>. We could rely on raw user input and optionally expose caption parsers: <Audio captions={remotion.parseVtt(vttFileContent, videoConfig.fps)} />.

Pros:

Introduces a common abstracted data type for captions
Caption data is normalized before it gets to remotion video components (whether or not our helpers are used)
Allows inline data w/ similar api to other primitives: from / durationInFrames. Could make user caption generation easier.

Cons:

Still have to figure out stitching
Dependent on user to structure captions

`Captions` Component

We could create an explicit component for a series of captions over a time range, similar to Audio, accepting either or both of the props mentioned above. A special case might handle a single caption value.

Pros:

More explicit than just providing a captions prop to an otherwise empty <Sequence />
Could more smoothly handle a one-off caption without polluting other component APIs

Cons:

Wouldn't really be much more than a thin prop mapper over Sequence. Is it worth expanding the package API?

Summary

Heh sorry, this kind of got away from me 😅 Kept having more thoughts based on research and rubber ducking while typing this out.

I'm really excited about the potential of remotion as a tool and would love to see it empowered with support for standard video accessibility capabilities 💃.

I'm only as experienced with subtitles and videos as the last few hours have made me, but I'm happy to try to contribute to some of this support. I can at least take a stab at the command line input I mentioned above, but would love to talk through the rest!

Cheers ♥

*: An Open Caption-style implementation may be helpful for remotion video development so that the user isn't reliant on the rendering process to debug caption issues but I would be very hesitant to add "official" caption support to player without a proper answer for assistive technology.

tomByrer · 2022-01-04T16:39:25Z

tomByrer
Jan 4, 2022

Captions are super-important.
Not only for accessibility reasons, but to embellish everyone's understanding (I turn on captions when learning coding & science videos). Also easier to provide auto-translated captions for those outside target audience's

0 replies

gverri · 2023-11-29T03:43:34Z

gverri
Nov 29, 2023

Also very important for engagement and social media content, since you never know if the viewer has their audio on or off.

@JonnyBurger said that they are working on the Recorder project that will come with a solution for captions/subtitles implemented. I'm looking forward to getting my hands on that. But in the meanwhile I'm wondering how others have solved this problem.

0 replies

JonnyBurger · 2023-11-29T09:15:31Z

JonnyBurger
Nov 29, 2023
Maintainer

@gverri Our current stance:

For generating subtitles, we recommend Whisper and adding the JSON files to Remotion, and making a component that renders the subtitles.
For reading subtitle files: Use the parse-srt library - an example can be found on https://remotion.dev/templates/audiogram
For generating .srt files: We'll need this functionality ourselves for the Remotion Recorder and will therefore be adding this to Remotion soon

1 reply

tomByrer Jan 17, 2024

Is SRT the standard Remotion transcript file format? I'd vote for the WebVTT, but haven't really used SRT.

enesozturk · 2023-12-10T11:48:38Z

enesozturk
Dec 10, 2023

Hey there, I've been looking for the same solution for a while and found a few good repos so I'd like to share with you, seems like worth trying:

To create subtitle files from the audio/video file:

gen-subs - A tool to create .srt or .ass files from mp4
auto-subtitle-generator - Next.js app which is using gen-subs

To parse subtitle files:

ASS - JavaScript implementation to parse and render .ass subtitles to video.
ass-compiler - A repo to parse .ass files

Also, the .ass format might be a better choice than .srt since it has a highlight ability. So I'd love to use it inside Remotion, I'll try implementing it.

Hope this helps, let's hack this and add subtitles to Remotion.

cc @JonnyBurger @FLGMwt

4 replies

enesozturk Dec 10, 2023

Okay making progress. The .ass file works quite well, I like the ability to follow the word with different styles than the other words.

subtitles.mp4

marko-hologram Dec 27, 2023

Okay making progress. The .ass file works quite well, I like the ability to follow the word with different styles than the other words.
subtitles.mp4

I absolutely love this example you have shown! Do you maybe have some code I can take a look to see how this was implemented? I'm really curious 😄

cr-tech-repo Jan 15, 2024

Can you please share the cod , i'm curious how you impleted as well, thank you!

polooner Jan 18, 2024

I'm also curious 👀

RemyNtshaykolo · 2024-02-15T12:59:35Z

RemyNtshaykolo
Feb 15, 2024

Here is my code guys.

Generate timestamped subtitle from audio

from faster_whisper import WhisperModel
import json 

model_size = "medium"
model = WhisperModel(model_size)

def extract_subtitle(file_path, output_path):
    """Extract subtitle from file_path."""
    # ===========================
    # NOTE Extractracting text here
    # ===========================
    segments, info = model.transcribe(
        file_path,
        word_timestamps=True,
        append_punctuations="\"'“¿([{-",
        prepend_punctuations="\"'.。,，!！?？:：”)]}、",
    )
    segments = list(segments)
    wordlevel_info = []
    for segment in segments:
        for word in segment.words:
            wordlevel_info.append(
                {
                    "word": word.word,
                    "start": word.start,
                    "end": word.end,
                    "duration": word.end - word.start,
                }
            )

    # ===========================
    # NOTE Saving the subtitle
    # ===========================
    print(f"Saving subtitle to {output_path}")
    with open(output_path, "w") as f:
        json.dump(wordlevel_info, f)

Animate subtitles

import {TransitionSeries} from '@remotion/transitions';
import {Easing, interpolate, useCurrentFrame} from 'remotion';

interface TimestampedWord {
	start: number;
	end: number;
	word: string;
	duration: number;
}
interface TimestampedLine {
	start: number;
	end: number;
	words: string;
	data: TimestampedWord[];
}
interface HighlightedLine {
	words: TimestampedWord[];
	duration: number;
}
interface SubtitleAnimationProps {
	timestampedWords: any;
	maxCharacterByLine: number;
	maxWordByLine: number;
}

export const SubtitlesAnimation = ({
	timestampedWords,
	maxCharacterByLine,
	maxWordByLine,
}: SubtitleAnimationProps) => {
	const cleanSubtitles = aggregateSubtitleWords(timestampedWords);
	const frame = useCurrentFrame();
	const cleanSubtitlesWithWhite = addWhite(cleanSubtitles);
	const subtitleLines: TimestampedLine[] = generateSubtitleLines(
		cleanSubtitlesWithWhite,
		maxCharacterByLine,
		maxWordByLine
	);
	const subtitlesWithHighlight: HighlightedLine[] =
		generateWordHighlight(subtitleLines);
	let startFrame = 0;
	let endFrame = 0;
	return (
		<>
			{subtitlesWithHighlight.map((lines, index) => {
				endFrame += lines['duration'] * 30;
				const scale = interpolate(frame, [startFrame, endFrame], [0, 1], {
					easing: Easing.bounce,
					extrapolateRight: 'clamp',
					extrapolateLeft: 'clamp',
				});
				const opacity = interpolate(
					frame,
					[endFrame - 0.00001, endFrame],
					[1, 0],
					{
						extrapolateLeft: 'clamp',
						extrapolateRight: 'clamp',
					}
				);
				startFrame = endFrame;
				return (
					<div
						style={{
							fontSize: '65px',
							color: 'black',
							fontFamily: 'Futura',
							position: 'absolute',
							textTransform: 'uppercase',
							opacity,
							transform: `scale(${scale}) `,
						}}
					>
						<span
							style={{
								backgroundColor: 'transparent',
								borderRadius: '30px',
								padding: '20px',
							}}
						>
							{lines['words']}
						</span>
					</div>
				);
			})}
		</>
	);
};
const generateWordHighlight = (subtitleLines: TimestampedLine[]) => {
	const subtitlesWithHighlight = [];
	for (let i = 0; i < subtitleLines.length; i++) {
		const line = subtitleLines[i];
		const data = line.data;

		for (let j = 0; j < data.length; j++) {
			const lineWithHighlight = data.map((wordsMetadata, index) => {
				return index == j ? (
					<span
						style={{
							color: 'yellow',
							textShadow: '#FC0 0 0 25px',
						}}
					>
						{wordsMetadata.word}
					</span>
				) : (
					<span
						style={{
							backgroundColor: 'transparent',
							color: 'white',
							// backgroundColor: 'black',
						}}
					>
						{wordsMetadata.word}
					</span>
				);
			});

			let duration = data[j]['duration'];
			if (duration == 0) {
				duration = 0.0001;
			}

			subtitlesWithHighlight.push({
				words: lineWithHighlight,
				duration,
			});
		}
	}
	return subtitlesWithHighlight;
};

const generateSubtitleLines = (
	subtitles: TimestampedWord[],
	max_character_by_line: number,
	maxWordByLine: number
): TimestampedLine[] => {
	let lines: TimestampedLine[] = [];
	let currentLine = '';
	let currentLineDuration = 0;
	let currentLineStart = 0;
	let currentLineEnd = 0;
	let data = [];

	for (let i = 0; i < subtitles.length; i++) {
		const word = subtitles[i].word;
		const duration = subtitles[i].duration;
		const start = subtitles[i].start;
		const end = subtitles[i].end;

		const condDuration = currentLineDuration + duration > 14;
		const length = currentLine.length + word.length > max_character_by_line;
		const condMaxWordByLine = data.length >= maxWordByLine;

		if (
			condDuration ||
			condMaxWordByLine ||
			length ||
			i == subtitles.length - 1
		) {
			lines.push({
				words: currentLine,
				start: currentLineStart,
				end: currentLineEnd,
				data,
			});
			data = [];
			data.push(subtitles[i]);
			currentLine = word;
			currentLineDuration = duration;
			currentLineStart = start;
			currentLineEnd = end;
		} else {
			data.push(subtitles[i]);
			currentLine += ' ' + word;
			currentLineDuration += duration;
			currentLineEnd = end;
		}
		// Si c'était le dernier mot, on push la ligne
		if (i == subtitles.length - 1) {
			lines.push({
				words: currentLine,
				start: currentLineStart,
				end: currentLineEnd,
				data,
			});
		}
	}

	return lines;
};

function aggregateSubtitleWords(subtitles: TimestampedWord[]) {
	let aggregatedWords: TimestampedWord[] = [];
	let temp: TimestampedWord;
	let inAggregation = false;

	for (let i = 0; i < subtitles.length; i++) {
		// NOTE Current word
		const start = subtitles[i].start;
		const duration = subtitles[i].duration;
		const word = subtitles[i].word;

		// NOTE Next word
		let nextWord;
		let nextEnd;
		let nextDuration;
		if (i == subtitles.length - 1) {
			nextWord = '';
			nextEnd = 0;
			nextDuration = 0;
		} else {
			nextWord = subtitles[i + 1]['word'];
			nextEnd = subtitles[i + 1]['end'];
			nextDuration = subtitles[i + 1]['duration'];
		}
		if (
			nextWord.startsWith('-') ||
			nextWord.startsWith("'") ||
			nextWord.startsWith(',') ||
			nextWord.startsWith('.')
		) {
			if (inAggregation) {
				temp = {
					word: temp!.word + nextWord,
					start: temp!.start,
					end: nextEnd,
					duration: temp!.duration + nextDuration,
				};
				// Si c'était l'avant dernier mot et que le suivant est aggregé, alors on push le mot et on arrête la boucle
				if (i == subtitles.length - 2) {
					console.log(temp);
					aggregatedWords.push(temp);
					break;
				}
			} else {
				inAggregation = true;
				temp = {
					word: word + nextWord,
					start: start,
					end: nextEnd,
					duration: duration + nextDuration,
				};
				// Si c'était l'avant dernier mot et que le suivant est aggregé, alors on push le mot et on arrête la boucle
				if (i == subtitles.length - 2) {
					aggregatedWords.push(temp);
					break;
				}
			}
		} else {
			if (inAggregation) {
				aggregatedWords.push(temp!);
				inAggregation = false;
			} else {
				aggregatedWords.push(subtitles[i]);
			}
		}
	}

	return aggregatedWords;
}

function addWhite(subtitles: TimestampedWord[]) {
	const newSubtitles = [];
	for (let i = 0; i < subtitles.length; i++) {
		if (i < subtitles.length - 1) {
			if (subtitles[i]['end'] != subtitles[i + 1]['start']) {
				newSubtitles.push(subtitles[i]);
				newSubtitles.push({
					word: '',
					start: subtitles[i]['end'],
					end: subtitles[i + 1]['start'],
					duration: subtitles[i + 1]['start'] - subtitles[i]['end'],
				});
			} else {
				newSubtitles.push(subtitles[i]);
			}
		} else {
			newSubtitles.push(subtitles[i]);
		}
	}
	return newSubtitles;
}

0 replies

RemyNtshaykolo · 2024-02-15T14:18:26Z

RemyNtshaykolo
Feb 15, 2024

here is the render

Enregistrement.de.l.ecran.2024-02-15.a.15.16.28.mov

1 reply

polooner Feb 15, 2024

this is great ;) working on an ai youtube video summarizer, might open source it

ahgsql · 2024-04-05T19:45:21Z

ahgsql
Apr 5, 2024

check out my new package, you only need srt txt data then use it with auto created sequences.

https://github.com/ahgsql/remotion-subtitles

0 replies

BioYue · 2024-11-08T07:39:43Z

BioYue
Nov 8, 2024

remotion implements the slider effect

english3_subtitled.mp4

1 reply

marko-hologram Nov 8, 2024

This looks great. How did you do it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remotion

Potential approaches for captions/subtitles #356

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Potential approaches for captions/subtitles #356

Definitions

Caption vs Subtitle

Open Captions vs Closed Captions

Caption File Formats

Remotion Captioning Output Scenarios

Open Captions For Video Output With @remotion/renderer

Open Captions For Video Output With @remotion/player

Closed Captions For Video Output With @remotion/renderer

Closed Captions For Video Output With @remotion/player

Caption Collection Options

Accept .vtts As CLI Arguments

Accept .vtts As Remotion Props

Accept Structured Caption JSON As Remotion Props

Captions Component

Summary

Replies: 8 comments · 7 replies

JonnyBurger Nov 29, 2023 Maintainer

Generate timestamped subtitle from audio

Animate subtitles

Open Captions For Video Output With `@remotion/renderer`

Open Captions For Video Output With `@remotion/player`

Closed Captions For Video Output With `@remotion/renderer`

Closed Captions For Video Output With `@remotion/player`

Accept `.vtt`s As CLI Arguments

Accept `.vtt`s As Remotion Props

`Captions` Component

Replies: 8 comments 7 replies

JonnyBurger
Nov 29, 2023
Maintainer