Site icon AskAresh

Offline Transcribing and Summarizing Audio with Whisper, Phi, FastAPI, Docker on NVIDIA GPU

In this blog post, we’ll dive into how I built an offline comprehensive audio transcription and summarization system using OpenAI Whisper (medium) for transcription, a Microsoft Phi 3.5 Large Language Model (LLM) for summarizing, FastAPI for the REST API, and Docker for containerization. Audio content can be dense and long, so having an automated way to extract transcripts and high-level summaries can be a game-changer for meetings, interviews, podcasts, and beyond!

Github – askaresh/LocalAudioTran-LLM-Summar: Offline Audio Transcription (Whisper) and LLM based (Phi-3.5) Summarization

Why Use LLMs for Audio Summaries?

Traditional speech-to-text solutions focus on generating transcripts. However, reading pages of raw transcript text can be time-consuming. Conversational text is quite boring and often makes sense during an audio call by bringing an LLM-based summarizer into the pipeline changes the entire perspective:

High-Level Architecture

Design Decisions

Following is the list of design decisions around this project:

Implementation Details

  1. FastAPI for the REST Endpoints
    • /transcribe: Receives an audio file, calls Whisper for transcription, and returns the text.
    • /summarize: Takes the transcribed text, calls the LLM, and returns a summary.
    • Health checks (/health) keep the container orchestration informed of readiness.
  2. Whisper
    • We used a GPU version for speed (if torch.cuda.is_available()).
    • For smaller hardware or faster inference, you can opt for "tiny" or "small" model sizes.
  3. LLM Summarization
    • Could be an open-source LLM (like Llama 2, GPT-NeoX, etc.) or something hosted. We are using Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct-q4_K_M)
    • Direct Processing: Transcript processed in a single pass using Phi model. The biggest reason to choose a large context window is to ensure the model can process the entire transcript without truncation, chunking, overlapping sections, etc as the quality gets deteriorated with chunking
    • Structured Output: Summary organized into clear sections:
      •    Overview
      •    Main Points
      •    Key Insights
      •    Action Items / Decisions
      •    Open Questions / Next Steps
      •    Conclusions
    • System Prompt does all the magic for summarisation. I highly recommend spending time and learning the System Prompt
  4. Docker
    • A Dockerfile that installs Python, PyTorch, Whisper, plus your LLM dependencies.
    • The container also runs Uvicorn for FastAPI.
    • If GPU acceleration is needed, we used an NVIDIA CUDA base image (e.g., nvidia/cuda:12.1.0-runtime-ubuntu22.04) and pass --gpus all to docker run.
  5. Optional: Streamlit UI
    • If you want a friendly front-end, spin up a UI to upload audio, track progress, and view results in real-time.
    • Alternatively, you could just expose the endpoints in FastAPI and have your favorite front-end call them.

Key Challenges and Lessons

  1. Timeouts for Large Audio
    • Whisper or LLM summarization might take a while for hour-long recordings.
    • We increased the request timeout or used asynchronous background tasks.
  2. GPU Memory Constraints
    • Large LLMs and Whisper can each use significant VRAM.
    • Consider smaller quantized (Q2, etc.) LLMs or chunk-based summarization.
  3. Accuracy vs. Speed
    • The "medium" or "large" Whisper model is more accurate but slower.
    • Summaries can get more coherent using bigger LLMs, but performance can suffer.
  4. Logging & Error Handling
    • Detailed logs ensure you catch issues early (e.g., partial transcripts, AI inference errors).
    • A robust system logs whether GPU is found, load times, and inference performance metrics.
  5. Security & Data Privacy
    • Audio recordings may contain sensitive information.
    • Ensure your container or environment has proper access controls.

Validation and Examples

  1. Meeting Transcription + Summary
    • A 30-minute internal meeting is processed, producing a transcript of 6,000 words.
    • LLM Summaries: A concise bullet-point list of decisions, tasks, and key insights.
  2. Podcast or Interview
    • Summarize multi-speaker dialogues to highlight important quotes or topics.
    • Possibly split each speaker’s segment, then unify in final text.
  3. Conference Keynote
    • Summaries merged into an “executive summary” with top-level takeaways.

Project Structure

LocalAudioTran-LLM-Summar/
├─ .dockerignore
├─ .env
├─ .gitignore
├─ README.md
├─ docker-compose.yml
├─ Dockerfile
├─ backend/
│  ├─ requirements.txt
│  └─ app/
│     ├─ main.py
│     ├─ services/
│     │  ├─ transcription.py
│     │  ├─ summarization.py
│     │  └─ __init__.py
│     ├─ utils/
│     │  └─ logger.py
│     ├─ models/
│     │  ├─ schemas.py
│     │  └─ __init__.py
│     └─ __init__.py
├─ frontend/
│  ├─ requirements.txt
│  └─ src/
│     └─ app.py
└─ logs/

Conclusion

By combining OpenAI Whisper (or any speech-to-text engine) with a Large Language Model (Phi 3.5 mini) summarizer inside a Docker container, we’ve built a unified pipeline for turning raw audio into manageable transcripts and actionable summaries. Whether you’re automating meeting minutes or analyzing podcast content, this approach saves countless hours. Feel free to experiment with chunking, smaller models, or advanced summarization prompts. Let me know how it goes!

Happy transcribing and summarizing!
Aresh Sarkari

Following are the list of helpful Links:

DescriptionLink
Microsoft Phi 3.5 Model Pagemicrosoft/Phi-3.5-mini-instruct · Hugging Face
OpenAI Whisper Modelopenai/whisper-medium · Hugging Face
Ollama Model Card Detailsphi3.5:3.8b-mini-instruct-q4_K_M
NVIDIA Docker Images (Contains Container Engine)nvidia/cuda – Docker Image | Docker Hub
IDE Editor of my choiceCursor – The AI Code Editor
Exit mobile version