Offline Transcribing and Summarizing Audio with Whisper, Phi, FastAPI, Docker on NVIDIA GPU

askaresh

1 year ago

In this blog post, we’ll dive into how I built an offline comprehensive audio transcription and summarization system using OpenAI Whisper (medium) for transcription, a Microsoft Phi 3.5 Large Language Model (LLM) for summarizing, FastAPI for the REST API, and Docker for containerization. Audio content can be dense and long, so having an automated way to extract transcripts and high-level summaries can be a game-changer for meetings, interviews, podcasts, and beyond!

Github – askaresh/LocalAudioTran-LLM-Summar: Offline Audio Transcription (Whisper) and LLM based (Phi-3.5) Summarization

Why Use LLMs for Audio Summaries?

Traditional speech-to-text solutions focus on generating transcripts. However, reading pages of raw transcript text can be time-consuming. Conversational text is quite boring and often makes sense during an audio call by bringing an LLM-based summarizer into the pipeline changes the entire perspective:

High-Level Summaries: Quickly get the core ideas or key actions from a meeting.
Contextual Understanding: LLMs handle nuance like speaker changes, main topics, and action items.
Reduced Human Effort: Saves time sifting through entire transcripts.

High-Level Architecture

Audio Ingestion: The user uploads an audio file (e.g., .mp3, .wav).
Transcription: OpenAI Whisper medium model transcribes the audio into text.
LLM Summarization: A large language model (e.g., Microsoft Phi 3.5) processes the transcript and produces a condensed summary.
RESTful API: Built with FastAPI, handling file uploads and returning structured JSON responses.
Docker: Containerizes the entire application for easy deployment anywhere with a GPU.

Design Decisions

Following is the list of design decisions around this project:

Offline Processing – All processing is conducted locally to maximize efficiency. Utilizing a robust setup with multiple GPUs, specifically the cutting-edge NVIDIA graphics cards (A4000 and RTX 3090), ensures unparalleled performance and reliability.
Audio Transcription – Using OpenAI Whisper (medium) is an obvious choice, as the transcription output is quite accurate, and the model size is efficient for offline running. I tried the large model, but the output did not justify the increased GPU VRAM requirements.
Summarization – This aspect took the most time to refine. I initially experimented with FLAN5 and BERT models, but I found their summarization outputs to be subpar, which made the project feel unworthy. While I believe these models could perform better with extensive training, I prefer an out-of-the-box solution. Therefore, I chose Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct) as my model of choice.
Context Window – I quickly learned that a large content window-based model is best for generating great summaries. I selected Phi 3.5 due to its 128K context window.
LLM Model Quantization – My NVIDIA A4000 has 16GB of VRAM. To effectively use the Phi-3.5 model, I opted for the quantized phi3.5:3.8b-mini-instruct-q4_K_M model, which balances performance and quality output. However, I noted that the KVCache still overflows and utilizes system RAM. I also experimented with Q8 LLaMA models, but I found Q4 to be the best fit.
- Because I am using the quantised model, I ended up using the Ollama container to run the GGUF model, which has the most straightforward implementation.
API/Containers – All the code utilizes FastAPI for GET/POST requests. Of course, for modularity, everything operates within a container.

Implementation Details

FastAPI for the REST Endpoints
- /transcribe: Receives an audio file, calls Whisper for transcription, and returns the text.
- /summarize: Takes the transcribed text, calls the LLM, and returns a summary.
- Health checks (/health) keep the container orchestration informed of readiness.
Whisper
- We used a GPU version for speed (if torch.cuda.is_available()).
- For smaller hardware or faster inference, you can opt for "tiny" or "small" model sizes.
LLM Summarization
- Could be an open-source LLM (like Llama 2, GPT-NeoX, etc.) or something hosted. We are using Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct-q4_K_M)
- Direct Processing: Transcript processed in a single pass using Phi model. The biggest reason to choose a large context window is to ensure the model can process the entire transcript without truncation, chunking, overlapping sections, etc as the quality gets deteriorated with chunking
- Structured Output: Summary organized into clear sections:
  - Overview
  - Main Points
  - Key Insights
  - Action Items / Decisions
  - Open Questions / Next Steps
  - Conclusions
- System Prompt does all the magic for summarisation. I highly recommend spending time and learning the System Prompt
Docker
- A Dockerfile that installs Python, PyTorch, Whisper, plus your LLM dependencies.
- The container also runs Uvicorn for FastAPI.
- If GPU acceleration is needed, we used an NVIDIA CUDA base image (e.g., nvidia/cuda:12.1.0-runtime-ubuntu22.04) and pass --gpus all to docker run.
Optional: Streamlit UI
- If you want a friendly front-end, spin up a UI to upload audio, track progress, and view results in real-time.
- Alternatively, you could just expose the endpoints in FastAPI and have your favorite front-end call them.

Key Challenges and Lessons

Timeouts for Large Audio
- Whisper or LLM summarization might take a while for hour-long recordings.
- We increased the request timeout or used asynchronous background tasks.
GPU Memory Constraints
- Large LLMs and Whisper can each use significant VRAM.
- Consider smaller quantized (Q2, etc.) LLMs or chunk-based summarization.
Accuracy vs. Speed
- The "medium" or "large" Whisper model is more accurate but slower.
- Summaries can get more coherent using bigger LLMs, but performance can suffer.
Logging & Error Handling
- Detailed logs ensure you catch issues early (e.g., partial transcripts, AI inference errors).
- A robust system logs whether GPU is found, load times, and inference performance metrics.
Security & Data Privacy
- Audio recordings may contain sensitive information.
- Ensure your container or environment has proper access controls.

Validation and Examples

Meeting Transcription + Summary
- A 30-minute internal meeting is processed, producing a transcript of 6,000 words.
- LLM Summaries: A concise bullet-point list of decisions, tasks, and key insights.
Podcast or Interview
- Summarize multi-speaker dialogues to highlight important quotes or topics.
- Possibly split each speaker’s segment, then unify in final text.
Conference Keynote
- Summaries merged into an “executive summary” with top-level takeaways.

Project Structure

LocalAudioTran-LLM-Summar/
├─ .dockerignore
├─ .env
├─ .gitignore
├─ README.md
├─ docker-compose.yml
├─ Dockerfile
├─ backend/
│  ├─ requirements.txt
│  └─ app/
│     ├─ main.py
│     ├─ services/
│     │  ├─ transcription.py
│     │  ├─ summarization.py
│     │  └─ __init__.py
│     ├─ utils/
│     │  └─ logger.py
│     ├─ models/
│     │  ├─ schemas.py
│     │  └─ __init__.py
│     └─ __init__.py
├─ frontend/
│  ├─ requirements.txt
│  └─ src/
│     └─ app.py
└─ logs/

transcription.py loads Whisper, handles file I/O.
summarization.py calls your Phi3.5 LLM (Hugging Face Transformers, Ollama, etc.).
docker-compose.yml to spin up both the app container and optional GPU-based service.

Conclusion

By combining OpenAI Whisper (or any speech-to-text engine) with a Large Language Model (Phi 3.5 mini) summarizer inside a Docker container, we’ve built a unified pipeline for turning raw audio into manageable transcripts and actionable summaries. Whether you’re automating meeting minutes or analyzing podcast content, this approach saves countless hours. Feel free to experiment with chunking, smaller models, or advanced summarization prompts. Let me know how it goes!

Happy transcribing and summarizing!
Aresh Sarkari

Reference Links

Following are the list of helpful Links:

Description	Link
Microsoft Phi 3.5 Model Page	microsoft/Phi-3.5-mini-instruct · Hugging Face
OpenAI Whisper Model	openai/whisper-medium · Hugging Face
Ollama Model Card Details	phi3.5:3.8b-mini-instruct-q4_K_M
NVIDIA Docker Images (Contains Container Engine)	nvidia/cuda – Docker Image \| Docker Hub
IDE Editor of my choice	Cursor – The AI Code Editor