Tag Archives: FastAPI

Offline Transcribing and Summarizing Audio with Whisper, Phi, FastAPI, Docker on NVIDIA GPU

9 Jan

In this blog post, we’ll dive into how I built an offline comprehensive audio transcription and summarization system using OpenAI Whisper (medium) for transcription, a Microsoft Phi 3.5 Large Language Model (LLM) for summarizing, FastAPI for the REST API, and Docker for containerization. Audio content can be dense and long, so having an automated way to extract transcripts and high-level summaries can be a game-changer for meetings, interviews, podcasts, and beyond!

Github – askaresh/LocalAudioTran-LLM-Summar: Offline Audio Transcription (Whisper) and LLM based (Phi-3.5) Summarization

Why Use LLMs for Audio Summaries?

Traditional speech-to-text solutions focus on generating transcripts. However, reading pages of raw transcript text can be time-consuming. Conversational text is quite boring and often makes sense during an audio call by bringing an LLM-based summarizer into the pipeline changes the entire perspective:

  • High-Level Summaries: Quickly get the core ideas or key actions from a meeting.
  • Contextual Understanding: LLMs handle nuance like speaker changes, main topics, and action items.
  • Reduced Human Effort: Saves time sifting through entire transcripts.

High-Level Architecture

  • Audio Ingestion: The user uploads an audio file (e.g., .mp3, .wav).
  • Transcription: OpenAI Whisper medium model transcribes the audio into text.
  • LLM Summarization: A large language model (e.g., Microsoft Phi 3.5) processes the transcript and produces a condensed summary.
  • RESTful API: Built with FastAPI, handling file uploads and returning structured JSON responses.
  • Docker: Containerizes the entire application for easy deployment anywhere with a GPU.

Design Decisions

Following is the list of design decisions around this project:

  • Offline Processing – All processing is conducted locally to maximize efficiency. Utilizing a robust setup with multiple GPUs, specifically the cutting-edge NVIDIA graphics cards (A4000 and RTX 3090), ensures unparalleled performance and reliability.
  • Audio Transcription – Using OpenAI Whisper (medium) is an obvious choice, as the transcription output is quite accurate, and the model size is efficient for offline running. I tried the large model, but the output did not justify the increased GPU VRAM requirements.
  • Summarization – This aspect took the most time to refine. I initially experimented with FLAN5 and BERT models, but I found their summarization outputs to be subpar, which made the project feel unworthy. While I believe these models could perform better with extensive training, I prefer an out-of-the-box solution. Therefore, I chose Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct) as my model of choice.
  • Context Window – I quickly learned that a large content window-based model is best for generating great summaries. I selected Phi 3.5 due to its 128K context window.
  • LLM Model Quantization – My NVIDIA A4000 has 16GB of VRAM. To effectively use the Phi-3.5 model, I opted for the quantized phi3.5:3.8b-mini-instruct-q4_K_M model, which balances performance and quality output. However, I noted that the KVCache still overflows and utilizes system RAM. I also experimented with Q8 LLaMA models, but I found Q4 to be the best fit.
    • Because I am using the quantised model, I ended up using the Ollama container to run the GGUF model, which has the most straightforward implementation.
  • API/Containers – All the code utilizes FastAPI for GET/POST requests. Of course, for modularity, everything operates within a container.

Implementation Details

  1. FastAPI for the REST Endpoints
    • /transcribe: Receives an audio file, calls Whisper for transcription, and returns the text.
    • /summarize: Takes the transcribed text, calls the LLM, and returns a summary.
    • Health checks (/health) keep the container orchestration informed of readiness.
  2. Whisper
    • We used a GPU version for speed (if torch.cuda.is_available()).
    • For smaller hardware or faster inference, you can opt for "tiny" or "small" model sizes.
  3. LLM Summarization
    • Could be an open-source LLM (like Llama 2, GPT-NeoX, etc.) or something hosted. We are using Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct-q4_K_M)
    • Direct Processing: Transcript processed in a single pass using Phi model. The biggest reason to choose a large context window is to ensure the model can process the entire transcript without truncation, chunking, overlapping sections, etc as the quality gets deteriorated with chunking
    • Structured Output: Summary organized into clear sections:
      •    Overview
      •    Main Points
      •    Key Insights
      •    Action Items / Decisions
      •    Open Questions / Next Steps
      •    Conclusions
    • System Prompt does all the magic for summarisation. I highly recommend spending time and learning the System Prompt
  4. Docker
    • A Dockerfile that installs Python, PyTorch, Whisper, plus your LLM dependencies.
    • The container also runs Uvicorn for FastAPI.
    • If GPU acceleration is needed, we used an NVIDIA CUDA base image (e.g., nvidia/cuda:12.1.0-runtime-ubuntu22.04) and pass --gpus all to docker run.
  5. Optional: Streamlit UI
    • If you want a friendly front-end, spin up a UI to upload audio, track progress, and view results in real-time.
    • Alternatively, you could just expose the endpoints in FastAPI and have your favorite front-end call them.

Key Challenges and Lessons

  1. Timeouts for Large Audio
    • Whisper or LLM summarization might take a while for hour-long recordings.
    • We increased the request timeout or used asynchronous background tasks.
  2. GPU Memory Constraints
    • Large LLMs and Whisper can each use significant VRAM.
    • Consider smaller quantized (Q2, etc.) LLMs or chunk-based summarization.
  3. Accuracy vs. Speed
    • The "medium" or "large" Whisper model is more accurate but slower.
    • Summaries can get more coherent using bigger LLMs, but performance can suffer.
  4. Logging & Error Handling
    • Detailed logs ensure you catch issues early (e.g., partial transcripts, AI inference errors).
    • A robust system logs whether GPU is found, load times, and inference performance metrics.
  5. Security & Data Privacy
    • Audio recordings may contain sensitive information.
    • Ensure your container or environment has proper access controls.

Validation and Examples

  1. Meeting Transcription + Summary
    • A 30-minute internal meeting is processed, producing a transcript of 6,000 words.
    • LLM Summaries: A concise bullet-point list of decisions, tasks, and key insights.
  2. Podcast or Interview
    • Summarize multi-speaker dialogues to highlight important quotes or topics.
    • Possibly split each speaker’s segment, then unify in final text.
  3. Conference Keynote
    • Summaries merged into an “executive summary” with top-level takeaways.

Project Structure

LocalAudioTran-LLM-Summar/
├─ .dockerignore
├─ .env
├─ .gitignore
├─ README.md
├─ docker-compose.yml
├─ Dockerfile
├─ backend/
│  ├─ requirements.txt
│  └─ app/
│     ├─ main.py
│     ├─ services/
│     │  ├─ transcription.py
│     │  ├─ summarization.py
│     │  └─ __init__.py
│     ├─ utils/
│     │  └─ logger.py
│     ├─ models/
│     │  ├─ schemas.py
│     │  └─ __init__.py
│     └─ __init__.py
├─ frontend/
│  ├─ requirements.txt
│  └─ src/
│     └─ app.py
└─ logs/
  • transcription.py loads Whisper, handles file I/O.
  • summarization.py calls your Phi3.5 LLM (Hugging Face Transformers, Ollama, etc.).
  • docker-compose.yml to spin up both the app container and optional GPU-based service.

Conclusion

By combining OpenAI Whisper (or any speech-to-text engine) with a Large Language Model (Phi 3.5 mini) summarizer inside a Docker container, we’ve built a unified pipeline for turning raw audio into manageable transcripts and actionable summaries. Whether you’re automating meeting minutes or analyzing podcast content, this approach saves countless hours. Feel free to experiment with chunking, smaller models, or advanced summarization prompts. Let me know how it goes!

Happy transcribing and summarizing!
Aresh Sarkari

Following are the list of helpful Links:

DescriptionLink
Microsoft Phi 3.5 Model Pagemicrosoft/Phi-3.5-mini-instruct · Hugging Face
OpenAI Whisper Modelopenai/whisper-medium · Hugging Face
Ollama Model Card Detailsphi3.5:3.8b-mini-instruct-q4_K_M
NVIDIA Docker Images (Contains Container Engine)nvidia/cuda – Docker Image | Docker Hub
IDE Editor of my choiceCursor – The AI Code Editor

Building an Image Captioning API with FastAPI and Hugging Face Transformers packaged with Docker

17 Apr

In this blog post, we’ll embark on an exciting journey of building an Image Captioning API using FastAPI and Hugging Face Transformers. Image captioning is a fascinating task that involves generating textual descriptions for given images. By leveraging the power of deep learning and natural language processing, we can create a system that automatically understands the content of an image and generates human-like captions. The example below, I input a image with a rider on a bike in a garage and the caption provides the exact details of the image.

Project Overview

👨‍💻 GitHub: https://github.com/askaresh/blip-image-captioning-api

The goal of this project is to develop a RESTful API that accepts an image as input and returns a generated caption describing the image. We’ll be using FastAPI, a modern and fast web framework for building APIs, along with Hugging Face Transformers, a popular library for natural language processing tasks.

The key components of our project include:

  1. FastAPI: A web framework for building efficient and scalable APIs in Python.
  2. Hugging Face Transformers: A library that provides state-of-the-art pre-trained models for various NLP tasks, including image captioning.
  3. Docker: A containerization platform that allows us to package our application and its dependencies into a portable and reproducible environment.

Implementation Details

To build our Image Captioning API, we started by setting up a FastAPI project and defining the necessary endpoints. The main endpoint accepts an image file and an optional text input for conditional image captioning.

We utilized the pre-trained BLIP (Bootstrapping Language-Image Pre-training) model from Hugging Face Transformers for image captioning. BLIP is a powerful model that has been trained on a large dataset of image-caption pairs and achieves impressive results in generating accurate and coherent captions.

To ensure a smooth development experience, and ability for it to run on AnyCloud I containerized our application using Docker. This allowed us to encapsulate all the dependencies, including Python libraries and the pre-trained model, into a portable and reproducible environment.

HF-IT-DOCKER/

├── app/
│ ├── config.py
│ ├── main.py
│ ├── model.py
│ └── utils.py

├── .dockerignore
├── .gitignore
├── compose.yaml
├── Dockerfile
├── logging.conf
├── README.Docker.md
└── requirements.txt

Detailed description of each file:

  • app/config.py:
    • This file contains the configuration settings for the application.
    • It defines a Settings class using the pydantic_settings library to store and manage application-specific settings.
    • The blip_model_name setting specifies the name of the BLIP model to be used for image captioning.
  • app/main.py:
    • This is the main entry point of the FastAPI application.
    • It sets up the FastAPI app, loads the BLIP model, and configures logging.
    • It defines the API endpoints, including the root path (“/”) and the image captioning endpoint (“/caption”).
    • The “/caption” endpoint accepts an image file and an optional text input, processes the image, generates a caption using the BLIP model, and returns the generated caption.
  • app/model.py:
    • This file contains the functions related to loading and using the BLIP model for image captioning.
    • The load_model function loads the pre-trained BLIP model and processor based on the specified model name.
    • The generate_caption function takes an image and optional text input, preprocesses the inputs, and generates a caption using the loaded BLIP model.
  • app/utils.py:
    • This file contains utility functions used in the project.
    • The load_image_from_file function reads an image file and converts it to the appropriate format (RGB) using the PIL library.
  • .dockerignore:
    • This file specifies the files and directories that should be excluded when building the Docker image.
    • It helps to reduce the size of the Docker image by excluding unnecessary files and directories.
  • .gitignore:
    • This file specifies the files and directories that should be ignored by Git version control.
    • It helps to keep the repository clean by excluding files that are not necessary to track, such as generated files, cache files, and environment-specific files.
  • compose.yaml:
    • This file contains the configuration for Docker Compose, which is used to define and run multi-container Docker applications.
    • It defines the services, including the FastAPI server, and specifies the build context, ports, and any necessary dependencies.
  • Dockerfile:
    • This file contains the instructions for building the Docker image for the FastAPI application.
    • It specifies the base image, sets up the working directory, installs dependencies, copies the application code, and defines the entry point for running the application.
  • logging.conf:
    • This file contains the configuration for the Python logging system.
    • It defines the loggers, handlers, formatters, and their respective settings.
    • It specifies the log levels, log file paths, and log message formats.
  • README.Docker.md:
    • This file provides documentation and instructions specific to running the application using Docker.
    • It may include information on how to build the Docker image, run the container, and any other Docker-related details.
  • requirements.txt:
    • This file lists the Python dependencies required by the application.
    • It includes the necessary libraries and their versions, such as FastAPI, Hugging Face Transformers, PIL, and others.
    • It is used by pip to install the required packages when building the Docker image or setting up the development environment.

Lessons Learned and Debugging

Throughout the development process, I encountered several challenges and learned valuable lessons:

  1. Dependency Management: Managing dependencies can be tricky, especially when working with large pre-trained models. We learned the importance of properly specifying dependencies in our requirements file and using Docker to ensure consistent environments across different systems.
  2. Debugging Permission Issues: We encountered permission-related issues when running our application inside a Docker container. Through debugging, we learned the significance of properly setting file and directory permissions and running the container as a non-root user to enhance security.
  3. Logging Configuration: Proper logging is crucial for understanding the behavior of our application and troubleshooting issues. I learned how to configure logging using a configuration file and ensure that log files are written to directories with appropriate permissions.
  4. Testing and Error Handling: Comprehensive testing and error handling are essential for building a robust API. We implemented thorough error handling to provide meaningful error messages to API users and conducted extensive testing to ensure the reliability of our image captioning functionality.

Validation of the API

After the container is up and running go to http://localhost:8004/docs and select Post method and pick try out. Upload any image of your choice and enter the text (optional) and further click Execute. You will have the caption below as the output.

Conclusion

Building an Image Captioning API with FastAPI and Hugging Face Transformers has been an incredible learning experience. By leveraging the power of pre-trained models and containerization, I created a scalable and efficient solution for generating image captions automatically.

Through this project, I gained valuable insights into dependency management, debugging permission issues, logging configuration, and the importance of testing and error handling. These lessons will undoubtedly be applicable to future projects and contribute to our growth as developers.

I hope that this blog post has provided you with a comprehensive overview of our Image Captioning API project and inspired you to explore the fascinating world of image captioning and natural language processing. Feel free to reach out with any questions or suggestions, and happy captioning!

Thanks,
Aresh Sarkari