tl;dr I created Manim Voiceover, a plugin for the Python math animation library Manim that lets you add voiceovers to your Manim videos directly in Python, with both AI voices or actual recordings.

This makes it possible to create “fully code-driven” educational videos in pure Python. Videos can be developed like software, taking advantage of version controlled, git-based workflows (i.e. no more Final.final.final.mp4 :),

It also makes it possible to use AI to automate all sorts of things. For example, I have created a pipeline for translating videos into other languages automatically with i18n (gettext) and machine translation (DeepL).

Follow my Twitter to get updates on Manim Voiceover.

A little background

For those who are not familiar, Manim is a Python library that lets you create animations programmatically, created by Grant Sanderson, a.k.a. 3blue1brown. His visual explainers are highly acclaimed and breathtakingly good (to see an example, click here for his introduction to neural networks).

Manim was originally built for animating math, but you can already see it being used in other domains such as physics, chemistry, computer science, and so on.

Creating any video is a very time-consuming process. Creating an explainer that needs to be mathematically exact is even more so, because the visuals often need to be precise to convey knowledge efficiently. That is why Manim was created: to automate the animation process. It turns out programming mathematical structures is easier than trying to animate them in a video editor.

However, this results in a workflow that is part spent in the text editor (writing Python code), and part in the video editor (editing the final video), with a lot of back and forth in between. The main reason is that the animation needs to be synced with voiceovers, which are recorded separately.

In this post, I will try to demonstrate how we can take this even further by making voiceovers a part of the code itself with Manim Voiceover, and why this is so powerful.

The traditional workflow

Creating a video with Manim is very tedious. The steps involved are usually as follows:

  1. Plan: come up with a script and a screenplay.
  2. Record: Record the voiceover with a microphone.
  3. Animate: Write the Python code for each scene, that will generate the animation videos.
  4. Edit: Overlay and synchronize the voiceover and animations in a video editor, such as Adobe Premiere.

The workflow is often not linear. The average video requires you to rewrite, re-record, re-animate and re-sync multiple scenes:

The less experience you have making videos, the more takes you will need. Creating such an explainer has a very steep learning curve. It can take up to 1 month for a beginner to create their first few minutes of video.

Enter Manim Voiceover

I am a developer by trade, and when I first tried to create a video with the traditional workflow, I found it harder than it should be. We developers are spoiled, because we get to enjoy automating our work. Imagine that you had to manually compile your code using a hex editor every time you made a change. That is what it felt like to create a video using a video editor. The smallest change in the script meant that I had to re-animate, re-record and re-sync parts of the video, the main culprit being the voiceover.

To overcome this, I thought of a simple idea: Create an API that lets one to add voiceovers directly in Python. Manim Voiceover does exactly that and provides a comprehensive framework for automating voiceovers. Once the entire production can be done in Python, editing in the video editor becomes mostly unnecessary. The workflow becomes:

  1. Plan: Same as before.
  2. Animate: Develop the video with an AI-generated voiceover, all in Python.
  3. Record: When the final revision is ready, record the actual voiceover with Manim Voiceover’s recorder utility. The audio is transcribed with timestamps and inserted at the right times automatically.

A little demo—see how a video would look like at the end of step (2):

And watch below to see how it would look like at the end of step (3), with my own voice:

I explain why this is so powerful below:

Zero-cost revisions

In the previous method, making modifications to the script has a cost, because you need to re-record the voiceover and readjust the scenes in the video editor. Here, making modifications is as easy as renaming a variable, since the AI voiceover is generated from code automatically. This saves a lot of time in the production process:

This lets videos created with Manim to be “fully code-driven” and take advantage of open source, collaborative, git-based workflows. No manual video editing needed, and no need to pay for overpriced video editing software:

(Or at least drastically reduced need for them)

Increased production speed

From personal experience and talking to others who have used it, Manim Voiceover increases production speed by a factor of at least 2x, compared to manual recording and editing.

Note: The current major bottlenecks are developing the scene itself and waiting for the render. Regarding render speed: Manim CE’s Cairo renderer is much slower then ManimGL’s OpenGL renderer. Manim Voiceover currently only supports Manim CE, but it is on my roadmap to add support ManimGL.

The API in a nutshell

This all sounds great, but how does it look like in practice? Let’s take a look at the API. Here is a “Hello World” example for Manim, drawing a circle:

from manim import *

class Example(Scene):
    def construct(self):
        circle = Circle()
        self.play(Create(circle))

Here is the same scene, with a voiceover that uses Google Translate’s free text-to-speech service:

from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover.services.gtts import GTTSService

class VoiceoverExample(VoiceoverScene):
    def construct(self):
        self.set_speech_service(GTTSService(lang="en"))

        circle = Circle()
        with self.voiceover(text="This circle is drawn as I speak."):
            self.play(Create(circle))

Notice the with statement. You can chain such blocks back to back, and Manim will vocalize them in sequence:

with self.voiceover(text="This circle is drawn as I speak."):
    self.play(Create(circle))

with self.voiceover(text="Let's shift it to the left 2 units."):
    self.play(circle.animate.shift(2 * LEFT))

The code for videos made with Manim Voiceover generally looks cleaner, since it is compartmentalized into blocks with voiceovers acting as annotations on top of each block.

See how this is rendered:

Record

To record an actual voiceover, you simply change a single line of code:

# self.set_speech_service(GTTSService(lang="en")) # Comment this out
self.set_speech_service(RecorderService())        # Add this line

Currently, rendering with RecorderService starts up a voice recorder implemented as a command line utility. The recorder prompts you to record each voiceover in the scene one by one and inserts audio at appropriate times. In the future, a web app could make this process even more seamless.

Check out the documentation for more examples and the API specification.

Auto-translating videos

Having a machine readable source for voiceovers unlocks another superpower: automatic translation. Manim Voiceover can automatically translate your videos to any language, and even generate subtitles in that language. This will let educational content creators reach a much wider audience.

Here is an example of the demo translated to Turkish and rendered with my own voice:

To create this video, I followed these steps:

  1. I wrapped transtable strings in my demo inside _() per gettext convention. For example, I changed text="Hey Manim Community!" to text=_("Hey Manim Community!").
  2. I ran manim_translate blog-translation-demo.py -s en -t tr -d blog-translation-demo, which created the locale folder, called DeepL’s API to translate the strings, and saved them under locale/tr/LC_MESSAGES/blog-translation-demo.po.
    • Here, -s stands for source language,
    • -t stands for target language,
    • and -d stands for the gettext domain.
  3. I edited the .po file manually, because the translation was still a bit off.
  4. I ran manim_render_translation blog-translation-demo.py -s BlogTranslationDemo -d blog-translation-demo -l tr -qh, which rendered the final video.

Check out the translation page in the docs for more details. You can also find the source code for this demo here.

Here is a Japanese translation, created the same way but with an AI voiceover:

Note that I have very little knowledge of Japanese so that the translation might be off, but I was still able to create it with services that are freely available online. This is to foreshadow how communities could create and translate educational videos in the future:

  1. Video is created using Manim/Manim Voiceover and is open-sourced.
  2. The repo is connected to a CI/CD service that tracks the latest changes, re-renders and deploys the video to a permalink.
  3. When a translation in a language is requested, said service automatically generates it using AI translation and voiceover.
  4. The community can then review the translation and voiceover, make changes if necessary, and record a human voiceover if they want to.
  5. All the different versions and translations of the video are seamlessly deployed, similar to how ReadTheDocs deploys software documentation.

That is the main idea of my next project, GitMovie. If this excites you, leave your email address on the form on the website to get notified when it launches.

Conclusion

While using Manim Voiceover might seem tedious to some who are already using Manim with a video editor, I guarantee that it is overall more convenient than using a video editor when it comes to adding voiceovers to scenes. Feel free to create an issue if you have a use case that is currently not covered by Manim Voiceover.

What is even more interesting, Manim Voiceover can provide AI models such as GPT-4 with a convenient way to generate mathematically precise videos. Khan Academy has recently debuted a private release of Khanmigo, their GPT-4 based AI teacher. Imagine that Khanmigo could create a 3blue1brown-level explainer in a matter of minutes, for any question you ask! (I already tried to make GPT-4 output Manim code, but it is not quite there yet.)

To see why this is powerful, check out my video rendering of Euclid’s Elements using Manim Voiceover (part 1):

This video itself is pedagogically not very effective because books do not necessarily translate into good video scripts. But it serves as preparation for the point that I wanted to make with this post:

Having a machine-readable source and being able to program voiceovers allowed me to generate over 10 hours of video in less than a few days. In a few years, AI models will make such approaches 1000 times easier, faster and cheaper for everyone.

Imagine being able to auto-generate the “perfect explainer” for every article on Wikipedia, every paper on arXiv, every technical specification that would otherwise be too dense. In every language, available instantly around the globe. Universal knowledge, accessible by anyone who is willing to learn. Thanks to 3blue1brown, Manim and similar open source projects, all of this will be just a click away!