Ever wished you could turn a simple audio recording into a polished video—with subtitles in multiple languages—all automatically? Whether you’re a content creator, educator, or localizer, you probably know how time-consuming it is to manually transcribe, translate, and sync subtitles.
What if there was a tool that could handle all of this for you in one go?
DeovidLang does exactly that. It takes your audio files, transcribes them using OpenAI’s Whisper, translates the text, generates bilingual subtitles, adds image overlays at specific timestamps, and produces a ready-to-share MP4 video.
DeovidLang automates the entire video creation pipeline: transcription → translation → subtitle generation → image overlay → final video output. All from a single Python script!
What You’ll Learn
- How DeovidLang works under the hood
- How to set up and run the tool
- How to use different Whisper models
- How to customize language and output settings
- Real-world use cases for content creators
Key Features
📖 🎙️ Audio Transcription
Powered by OpenAI’s Whisper model, DeovidLang accurately transcribes audio in multiple languages. Choose from tiny (fastest) to large (most accurate) models.
🌍 Multi-Language Translation
Translate transcribed text into your desired language. Perfect for reaching international audiences with localized content.
📝 Bilingual Subtitles
Generate SRT subtitle files showing both original and translated text—ideal for language learners and multilingual audiences.
📖 🖼️ Image Overlay
Add images at specific timestamps during video playback. Great for adding slides, diagrams, or branding to your videos.
🎬 Video Generation
FFmpeg combines audio, subtitles, and image overlays into a single MP4 file. Configurable resolution and format options.
📁 Automated Organization
Output files are automatically organized in date-based directories, keeping your project structured and easy to navigate.
How It Works
DeovidLang follows a straightforward pipeline:
Audio File → Whisper Transcription → Translation → SRT Subtitles → Image Overlay → Final MP4 Video
Step-by-Step Process
- Input: You provide an audio file (M4A, MP3, WAV, etc.)
- Transcription: Whisper converts speech to text
- Translation: Translated text is generated in your target language
- Subtitle Generation: Both original and translated text become SRT files
- Image Overlay: Specified images appear at timestamps you define
- Video Assembly: FFmpeg combines everything into the final MP4
Image overlays happen at specific timestamps—perfect for matching slides to narration in presentations or tutorials.
Installation
Step 1: Install System Dependencies
FFmpeg is required for video processing:
sudo apt install ffmpeg
brew install ffmpeg
# Download from https://ffmpeg.org/download.html
# Or use winget:
winget install ffmpeg
Step 2: Install Python Dependencies
pip install pydub pillow
Step 3: Install Whisper
pip install git+https://github.com/openai/whisper.git
Step 4: Clone the Repository
git clone https://gitlab.com/krafi/deovidlang.git
cd deovidlang
Usage
Basic Command
python deovidlang.py <directory> <audio_file> --model <model> --language <language> --task <task>
Example
python deovidlang.py myproject myproject/speech.m4a --model medium --language en --task translate
Parameters
| Parameter | Options | Default | Description |
|---|---|---|---|
--model | tiny, base, small, medium, large | tiny | Whisper model size |
--language | Language code (en, ru, es, etc.) | ru | Source audio language |
--task | transcribe, translate, both | both | Operation to perform |
📖 Whisper Model Comparison
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | 39 MB | Fastest | Lower |
| base | 74 MB | Fast | Good |
| small | 244 MB | Medium | Better |
| medium | 1.5 GB | Slow | High |
| large | 2.9 GB | Slowest | Highest |
Start with “tiny” for quick testing. Use “medium” or “large” for production-quality transcriptions.
Real-World Use Cases
1. Educational Content Localization
Create bilingual videos for language learners. Transcribe your English lecture, translate to Spanish, and generate subtitles showing both languages simultaneously.
2. Podcast to Video
Turn podcast episodes into YouTube videos. Add cover art or branding images at the beginning, and let DeovidLang handle the rest.
3. Tutorial Video Creation
Record your voiceover, then automatically generate subtitles and create a professional video with slides matching your narration.
4. Accessibility
Make your content accessible to deaf or hard-of-hearing viewers with accurate auto-generated subtitles in multiple languages.
Example Output
After running DeovidLang, you’ll get:
output/
├── 2024-01-15/
│ ├── original.srt # Original language subtitles
│ ├── translated.srt # Translated subtitles
│ └── final.mp4 # Final video with everything
The final MP4 includes:
- Your original audio
- Image overlays at specified timestamps
- Burned-in bilingual subtitles
Troubleshooting
📖 FFmpeg not found?
-
Verify FFmpeg is installed:
ffmpeg -version -
Add FFmpeg to your system PATH
-
Restart your terminal
📖 Whisper model download slow?
First run downloads the model (39MB - 2.9GB depending on size). Use a stable internet connection. The model is cached for subsequent runs.
📖 Subtitles not showing in video?
- Check that SRT files were generated
- Verify image timestamps don’t overlap
- Try a different output format
📖 Audio quality issues?
Ensure your input audio is clear. Whisper works best with:
- Minimal background noise
- Clear speech
- Sample rate of 16kHz or higher
Why This Project is Useful
DeovidLang solves several pain points for content creators:
- Saves Hours: Manual transcription takes 4-6x the audio length. DeovidLang does it in minutes.
- No Expensive Tools: No need for Adobe Premiere, Final Cut, or subscription services.
- Multilingual Ready: Reach global audiences with translated subtitles.
- Automated Workflow: One command handles the entire pipeline.
Combine DeovidLang with my other project WhisperWeb (covered in a previous blog) for a complete audio-to-video workflow!
Conclusion
DeovidLang is a powerful yet simple tool that automates video creation from audio. Whether you’re a educator, podcaster, or content creator, it handles the heavy lifting—transcription, translation, subtitles, and video assembly—so you can focus on creating content.
Give it a try and transform your audio files into shareable videos in minutes!
Source Code
View and contribute to the project: DeovidLang on GitLab
Happy automating!
Discussion
0 commentsJoin the Discussion
Sign in to post comments and join the conversation.
No comments yet. Be the first to share your thoughts!