Replies: 8 comments 7 replies
-
Captions are super-important. |
Beta Was this translation helpful? Give feedback.
-
Also very important for engagement and social media content, since you never know if the viewer has their audio on or off. @JonnyBurger said that they are working on the Recorder project that will come with a solution for captions/subtitles implemented. I'm looking forward to getting my hands on that. But in the meanwhile I'm wondering how others have solved this problem. |
Beta Was this translation helpful? Give feedback.
-
@gverri Our current stance:
|
Beta Was this translation helpful? Give feedback.
-
Hey there, I've been looking for the same solution for a while and found a few good repos so I'd like to share with you, seems like worth trying: To create subtitle files from the audio/video file:
To parse subtitle files:
Also, the Hope this helps, let's hack this and add subtitles to Remotion. |
Beta Was this translation helpful? Give feedback.
-
Here is my code guys. Generate timestamped subtitle from audiofrom faster_whisper import WhisperModel
import json
model_size = "medium"
model = WhisperModel(model_size)
def extract_subtitle(file_path, output_path):
"""Extract subtitle from file_path."""
# ===========================
# NOTE Extractracting text here
# ===========================
segments, info = model.transcribe(
file_path,
word_timestamps=True,
append_punctuations="\"'“¿([{-",
prepend_punctuations="\"'.。,,!!??::”)]}、",
)
segments = list(segments)
wordlevel_info = []
for segment in segments:
for word in segment.words:
wordlevel_info.append(
{
"word": word.word,
"start": word.start,
"end": word.end,
"duration": word.end - word.start,
}
)
# ===========================
# NOTE Saving the subtitle
# ===========================
print(f"Saving subtitle to {output_path}")
with open(output_path, "w") as f:
json.dump(wordlevel_info, f) Animate subtitlesimport {TransitionSeries} from '@remotion/transitions';
import {Easing, interpolate, useCurrentFrame} from 'remotion';
interface TimestampedWord {
start: number;
end: number;
word: string;
duration: number;
}
interface TimestampedLine {
start: number;
end: number;
words: string;
data: TimestampedWord[];
}
interface HighlightedLine {
words: TimestampedWord[];
duration: number;
}
interface SubtitleAnimationProps {
timestampedWords: any;
maxCharacterByLine: number;
maxWordByLine: number;
}
export const SubtitlesAnimation = ({
timestampedWords,
maxCharacterByLine,
maxWordByLine,
}: SubtitleAnimationProps) => {
const cleanSubtitles = aggregateSubtitleWords(timestampedWords);
const frame = useCurrentFrame();
const cleanSubtitlesWithWhite = addWhite(cleanSubtitles);
const subtitleLines: TimestampedLine[] = generateSubtitleLines(
cleanSubtitlesWithWhite,
maxCharacterByLine,
maxWordByLine
);
const subtitlesWithHighlight: HighlightedLine[] =
generateWordHighlight(subtitleLines);
let startFrame = 0;
let endFrame = 0;
return (
<>
{subtitlesWithHighlight.map((lines, index) => {
endFrame += lines['duration'] * 30;
const scale = interpolate(frame, [startFrame, endFrame], [0, 1], {
easing: Easing.bounce,
extrapolateRight: 'clamp',
extrapolateLeft: 'clamp',
});
const opacity = interpolate(
frame,
[endFrame - 0.00001, endFrame],
[1, 0],
{
extrapolateLeft: 'clamp',
extrapolateRight: 'clamp',
}
);
startFrame = endFrame;
return (
<div
style={{
fontSize: '65px',
color: 'black',
fontFamily: 'Futura',
position: 'absolute',
textTransform: 'uppercase',
opacity,
transform: `scale(${scale}) `,
}}
>
<span
style={{
backgroundColor: 'transparent',
borderRadius: '30px',
padding: '20px',
}}
>
{lines['words']}
</span>
</div>
);
})}
</>
);
};
const generateWordHighlight = (subtitleLines: TimestampedLine[]) => {
const subtitlesWithHighlight = [];
for (let i = 0; i < subtitleLines.length; i++) {
const line = subtitleLines[i];
const data = line.data;
for (let j = 0; j < data.length; j++) {
const lineWithHighlight = data.map((wordsMetadata, index) => {
return index == j ? (
<span
style={{
color: 'yellow',
textShadow: '#FC0 0 0 25px',
}}
>
{wordsMetadata.word}
</span>
) : (
<span
style={{
backgroundColor: 'transparent',
color: 'white',
// backgroundColor: 'black',
}}
>
{wordsMetadata.word}
</span>
);
});
let duration = data[j]['duration'];
if (duration == 0) {
duration = 0.0001;
}
subtitlesWithHighlight.push({
words: lineWithHighlight,
duration,
});
}
}
return subtitlesWithHighlight;
};
const generateSubtitleLines = (
subtitles: TimestampedWord[],
max_character_by_line: number,
maxWordByLine: number
): TimestampedLine[] => {
let lines: TimestampedLine[] = [];
let currentLine = '';
let currentLineDuration = 0;
let currentLineStart = 0;
let currentLineEnd = 0;
let data = [];
for (let i = 0; i < subtitles.length; i++) {
const word = subtitles[i].word;
const duration = subtitles[i].duration;
const start = subtitles[i].start;
const end = subtitles[i].end;
const condDuration = currentLineDuration + duration > 14;
const length = currentLine.length + word.length > max_character_by_line;
const condMaxWordByLine = data.length >= maxWordByLine;
if (
condDuration ||
condMaxWordByLine ||
length ||
i == subtitles.length - 1
) {
lines.push({
words: currentLine,
start: currentLineStart,
end: currentLineEnd,
data,
});
data = [];
data.push(subtitles[i]);
currentLine = word;
currentLineDuration = duration;
currentLineStart = start;
currentLineEnd = end;
} else {
data.push(subtitles[i]);
currentLine += ' ' + word;
currentLineDuration += duration;
currentLineEnd = end;
}
// Si c'était le dernier mot, on push la ligne
if (i == subtitles.length - 1) {
lines.push({
words: currentLine,
start: currentLineStart,
end: currentLineEnd,
data,
});
}
}
return lines;
};
function aggregateSubtitleWords(subtitles: TimestampedWord[]) {
let aggregatedWords: TimestampedWord[] = [];
let temp: TimestampedWord;
let inAggregation = false;
for (let i = 0; i < subtitles.length; i++) {
// NOTE Current word
const start = subtitles[i].start;
const duration = subtitles[i].duration;
const word = subtitles[i].word;
// NOTE Next word
let nextWord;
let nextEnd;
let nextDuration;
if (i == subtitles.length - 1) {
nextWord = '';
nextEnd = 0;
nextDuration = 0;
} else {
nextWord = subtitles[i + 1]['word'];
nextEnd = subtitles[i + 1]['end'];
nextDuration = subtitles[i + 1]['duration'];
}
if (
nextWord.startsWith('-') ||
nextWord.startsWith("'") ||
nextWord.startsWith(',') ||
nextWord.startsWith('.')
) {
if (inAggregation) {
temp = {
word: temp!.word + nextWord,
start: temp!.start,
end: nextEnd,
duration: temp!.duration + nextDuration,
};
// Si c'était l'avant dernier mot et que le suivant est aggregé, alors on push le mot et on arrête la boucle
if (i == subtitles.length - 2) {
console.log(temp);
aggregatedWords.push(temp);
break;
}
} else {
inAggregation = true;
temp = {
word: word + nextWord,
start: start,
end: nextEnd,
duration: duration + nextDuration,
};
// Si c'était l'avant dernier mot et que le suivant est aggregé, alors on push le mot et on arrête la boucle
if (i == subtitles.length - 2) {
aggregatedWords.push(temp);
break;
}
}
} else {
if (inAggregation) {
aggregatedWords.push(temp!);
inAggregation = false;
} else {
aggregatedWords.push(subtitles[i]);
}
}
}
return aggregatedWords;
}
function addWhite(subtitles: TimestampedWord[]) {
const newSubtitles = [];
for (let i = 0; i < subtitles.length; i++) {
if (i < subtitles.length - 1) {
if (subtitles[i]['end'] != subtitles[i + 1]['start']) {
newSubtitles.push(subtitles[i]);
newSubtitles.push({
word: '',
start: subtitles[i]['end'],
end: subtitles[i + 1]['start'],
duration: subtitles[i + 1]['start'] - subtitles[i]['end'],
});
} else {
newSubtitles.push(subtitles[i]);
}
} else {
newSubtitles.push(subtitles[i]);
}
}
return newSubtitles;
} |
Beta Was this translation helpful? Give feedback.
-
here is the render Enregistrement.de.l.ecran.2024-02-15.a.15.16.28.mov |
Beta Was this translation helpful? Give feedback.
-
check out my new package, you only need srt txt data then use it with auto created sequences. |
Beta Was this translation helpful? Give feedback.
-
remotion implements the slider effect english3_subtitled.mp4 |
Beta Was this translation helpful? Give feedback.
-
#16 describes the need for captioning support for remotion to make its video outputs more accessible. In this discussion, I describe potential use cases for generated captioned remotion videos and some potential implementation paths to satisfy those use cases.
I'll go over:
Definitions
Want to set some terms so we're all on the same page.
Caption vs Subtitle
The term "caption" typically refers to text accompanying a video that describes the content of the video. This can include spoken word, textual descriptions of non-spoken audio (sound effects, music), and sometimes textual descriptions of visual content.
The term "subtitle" typically refers to the textual translation of spoken word or displayed text that's in a different language than what the presumptive viewer understands. This is common for multi-language films or for films distributed to audiences that do not understand the primary language of the film.
For simplicity, I'll use Caption throughout since the concerns for video output are the same for both.
Open Captions vs Closed Captions
"Open Captions" refer to text that is baked in to the image video output. Viewers cannot configure open captions since the text is embedded in the video frames. Additionally, the textual information is lost and cannot be read by assistive technology. The compositor of the video frames is responsible for setting and styling the displayed text.
"Closed Captions" refer to text that is embedded as metadata in a video. The metadata is any number of caption files (one for each desired locale) that tie text to timestamp ranges of the video when the text should be displayed. Viewers can enable captions for a locale (such as English or Spanish) or disable them. Video players can read a wide number of caption file formats and handle styling and displaying the text at the correct time. Assistive technology can also consume the caption data and provide it in alternative manners, such as through Text-To-Speech.
Open Captions were developed before video technology improved to allow things like Closed Captions. They can improve video accessibility for sighted people, but since the textual representations aren't available outside the visual medium, they exclude a great many individuals. In nearly all cases, Closed Captions should be the preferred method for capturing textual representation of video or audio content. For this reason, I primarily focus on Closed Captions for the purposes of this discussion.*
Caption File Formats
There are a lot of different file formats for structured caption such as
.srt
,.sbv
,.vtt
and more. For brevity, instead of saying "structured caption file" when referring to a "structured caption file", I'll just use.vtt
. Assume I mean all file types. They're all pretty similar and I think we can treat them the same for the purposes of this discussion.Remotion Captioning Output Scenarios
The following describes the what I think are the end state captioning use cases for remotion output and their respective levels of effort. Each assumes we've already handled providing or parsing caption data and will be to access the desired caption for a given frame and, for closed captions, the complete video's
.vtt
. Considerations for getting the caption content itself is covered later.Open Captions For Video Output With
@remotion/renderer
Assuming an API for fetching the caption of a given frame, this won't be much more work than absolutely positioning the caption text. Remotion could provide default styling for captions and make it configurable.
Open Captions For Video Output With
@remotion/player
No changes would be necessary to "Open Captions For Video Output With
@remotion/renderer
".Closed Captions For Video Output With
@remotion/renderer
Remotion would not be responsible for rendering closed captions. Instead, it need simply include any generated/provided caption files to FFmpeg to be included in the video. I believe this should be the highest priority.
Closed Captions For Video Output With
@remotion/player
This one's quite tricky and it might be impossible to do correctly, at least in the sense of conventional caption accessibility. Videos rendered with
@remotion/player
are essentially just animated DOM elements and do not use any true video primitives. Whereas a<video />
could have child<track />
elements with associated captions, browsers offer no similar captioning support for arbitrary and controllable sequences of DOM manipulations.The best I can think to do right now is to have the remotion player element include an empty
<video />
(or, more simply, an empty<audio />
) that's synchronized with the player state and includes generated.vtt
s. This might work for fixed length videos with fixed caption content. However, since the remotion video is dynamic, we'd need to regenerate the fake media and caption on the fly. Since we don't have control over how the consuming assistive technology reacts to swapped out media and captions, we probably can't ensure a good experience.Caption Collection Options
Here are some ideas on how we can collect caption data from remotion users. As I stated earlier, my priorities are in support of Closed Captions, but these options would also benefit Open Captions. I'd love to hear some more ideas :)
Accept
.vtt
s As CLI ArgumentsKeeping it basic: we could just pull take any number of
.vtt
s as arguments toremotion render
and add them to the FFmpeg call.Pros:
Cons:
.vtt
for entire video. They may not know how long their video is until it's rendered. The experience is kinda goofy to render the video, go back and write/generate the.vtt
, then re-render w/ the.vtt
..vtt
Accept
.vtt
s As Remotion PropsWe could have the sequence-y (
Audio
/Composition
/Sequence
/Video
) components accept.vtt
s. The render process could somehow patch these together to generate a full-video.vtt
.Pros:
.vtt
for something like an audio file or video fileCons:
.vtt
s: if a captioned bit of audio starts 20s worth of frames into the video, we'd have to add 20s to all of that audio's caption entries. Not a huge deal, but different caption file formats have different timestamp formats.Accept Structured Caption JSON As Remotion Props
This is similar to above (and perhaps in addition to), but with this approach we would rely on the user to pass deserialized caption data as something like
Array<{from: number, durationInFrames: number, caption: string }>
. We could rely on raw user input and optionally expose caption parsers:<Audio captions={remotion.parseVtt(vttFileContent, videoConfig.fps)} />
.Pros:
Cons:
Captions
ComponentWe could create an explicit component for a series of captions over a time range, similar to
Audio
, accepting either or both of the props mentioned above. A special case might handle a single caption value.Pros:
captions
prop to an otherwise empty<Sequence />
Cons:
Sequence
. Is it worth expanding the package API?Summary
Heh sorry, this kind of got away from me 😅 Kept having more thoughts based on research and rubber ducking while typing this out.
I'm really excited about the potential of remotion as a tool and would love to see it empowered with support for standard video accessibility capabilities 💃.
I'm only as experienced with subtitles and videos as the last few hours have made me, but I'm happy to try to contribute to some of this support. I can at least take a stab at the command line input I mentioned above, but would love to talk through the rest!
Cheers ♥
*: An Open Caption-style implementation may be helpful for remotion video development so that the user isn't reliant on the rendering process to debug caption issues but I would be very hesitant to add "official" caption support to
player
without a proper answer for assistive technology.Beta Was this translation helpful? Give feedback.
All reactions