Archive | AI & MachineLearning RSS feed for this section

Offline Transcribing and Summarizing Audio with Whisper, Phi, FastAPI, Docker on NVIDIA GPU

9 Jan

In this blog post, we’ll dive into how I built an offline comprehensive audio transcription and summarization system using OpenAI Whisper (medium) for transcription, a Microsoft Phi 3.5 Large Language Model (LLM) for summarizing, FastAPI for the REST API, and Docker for containerization. Audio content can be dense and long, so having an automated way to extract transcripts and high-level summaries can be a game-changer for meetings, interviews, podcasts, and beyond!

Github – askaresh/LocalAudioTran-LLM-Summar: Offline Audio Transcription (Whisper) and LLM based (Phi-3.5) Summarization

Why Use LLMs for Audio Summaries?

Traditional speech-to-text solutions focus on generating transcripts. However, reading pages of raw transcript text can be time-consuming. Conversational text is quite boring and often makes sense during an audio call by bringing an LLM-based summarizer into the pipeline changes the entire perspective:

  • High-Level Summaries: Quickly get the core ideas or key actions from a meeting.
  • Contextual Understanding: LLMs handle nuance like speaker changes, main topics, and action items.
  • Reduced Human Effort: Saves time sifting through entire transcripts.

High-Level Architecture

  • Audio Ingestion: The user uploads an audio file (e.g., .mp3, .wav).
  • Transcription: OpenAI Whisper medium model transcribes the audio into text.
  • LLM Summarization: A large language model (e.g., Microsoft Phi 3.5) processes the transcript and produces a condensed summary.
  • RESTful API: Built with FastAPI, handling file uploads and returning structured JSON responses.
  • Docker: Containerizes the entire application for easy deployment anywhere with a GPU.

Design Decisions

Following is the list of design decisions around this project:

  • Offline Processing – All processing is conducted locally to maximize efficiency. Utilizing a robust setup with multiple GPUs, specifically the cutting-edge NVIDIA graphics cards (A4000 and RTX 3090), ensures unparalleled performance and reliability.
  • Audio Transcription – Using OpenAI Whisper (medium) is an obvious choice, as the transcription output is quite accurate, and the model size is efficient for offline running. I tried the large model, but the output did not justify the increased GPU VRAM requirements.
  • Summarization – This aspect took the most time to refine. I initially experimented with FLAN5 and BERT models, but I found their summarization outputs to be subpar, which made the project feel unworthy. While I believe these models could perform better with extensive training, I prefer an out-of-the-box solution. Therefore, I chose Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct) as my model of choice.
  • Context Window – I quickly learned that a large content window-based model is best for generating great summaries. I selected Phi 3.5 due to its 128K context window.
  • LLM Model Quantization – My NVIDIA A4000 has 16GB of VRAM. To effectively use the Phi-3.5 model, I opted for the quantized phi3.5:3.8b-mini-instruct-q4_K_M model, which balances performance and quality output. However, I noted that the KVCache still overflows and utilizes system RAM. I also experimented with Q8 LLaMA models, but I found Q4 to be the best fit.
    • Because I am using the quantised model, I ended up using the Ollama container to run the GGUF model, which has the most straightforward implementation.
  • API/Containers – All the code utilizes FastAPI for GET/POST requests. Of course, for modularity, everything operates within a container.

Implementation Details

  1. FastAPI for the REST Endpoints
    • /transcribe: Receives an audio file, calls Whisper for transcription, and returns the text.
    • /summarize: Takes the transcribed text, calls the LLM, and returns a summary.
    • Health checks (/health) keep the container orchestration informed of readiness.
  2. Whisper
    • We used a GPU version for speed (if torch.cuda.is_available()).
    • For smaller hardware or faster inference, you can opt for "tiny" or "small" model sizes.
  3. LLM Summarization
    • Could be an open-source LLM (like Llama 2, GPT-NeoX, etc.) or something hosted. We are using Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct-q4_K_M)
    • Direct Processing: Transcript processed in a single pass using Phi model. The biggest reason to choose a large context window is to ensure the model can process the entire transcript without truncation, chunking, overlapping sections, etc as the quality gets deteriorated with chunking
    • Structured Output: Summary organized into clear sections:
      •    Overview
      •    Main Points
      •    Key Insights
      •    Action Items / Decisions
      •    Open Questions / Next Steps
      •    Conclusions
    • System Prompt does all the magic for summarisation. I highly recommend spending time and learning the System Prompt
  4. Docker
    • A Dockerfile that installs Python, PyTorch, Whisper, plus your LLM dependencies.
    • The container also runs Uvicorn for FastAPI.
    • If GPU acceleration is needed, we used an NVIDIA CUDA base image (e.g., nvidia/cuda:12.1.0-runtime-ubuntu22.04) and pass --gpus all to docker run.
  5. Optional: Streamlit UI
    • If you want a friendly front-end, spin up a UI to upload audio, track progress, and view results in real-time.
    • Alternatively, you could just expose the endpoints in FastAPI and have your favorite front-end call them.

Key Challenges and Lessons

  1. Timeouts for Large Audio
    • Whisper or LLM summarization might take a while for hour-long recordings.
    • We increased the request timeout or used asynchronous background tasks.
  2. GPU Memory Constraints
    • Large LLMs and Whisper can each use significant VRAM.
    • Consider smaller quantized (Q2, etc.) LLMs or chunk-based summarization.
  3. Accuracy vs. Speed
    • The "medium" or "large" Whisper model is more accurate but slower.
    • Summaries can get more coherent using bigger LLMs, but performance can suffer.
  4. Logging & Error Handling
    • Detailed logs ensure you catch issues early (e.g., partial transcripts, AI inference errors).
    • A robust system logs whether GPU is found, load times, and inference performance metrics.
  5. Security & Data Privacy
    • Audio recordings may contain sensitive information.
    • Ensure your container or environment has proper access controls.

Validation and Examples

  1. Meeting Transcription + Summary
    • A 30-minute internal meeting is processed, producing a transcript of 6,000 words.
    • LLM Summaries: A concise bullet-point list of decisions, tasks, and key insights.
  2. Podcast or Interview
    • Summarize multi-speaker dialogues to highlight important quotes or topics.
    • Possibly split each speaker’s segment, then unify in final text.
  3. Conference Keynote
    • Summaries merged into an “executive summary” with top-level takeaways.

Project Structure

LocalAudioTran-LLM-Summar/
├─ .dockerignore
├─ .env
├─ .gitignore
├─ README.md
├─ docker-compose.yml
├─ Dockerfile
├─ backend/
│  ├─ requirements.txt
│  └─ app/
│     ├─ main.py
│     ├─ services/
│     │  ├─ transcription.py
│     │  ├─ summarization.py
│     │  └─ __init__.py
│     ├─ utils/
│     │  └─ logger.py
│     ├─ models/
│     │  ├─ schemas.py
│     │  └─ __init__.py
│     └─ __init__.py
├─ frontend/
│  ├─ requirements.txt
│  └─ src/
│     └─ app.py
└─ logs/
  • transcription.py loads Whisper, handles file I/O.
  • summarization.py calls your Phi3.5 LLM (Hugging Face Transformers, Ollama, etc.).
  • docker-compose.yml to spin up both the app container and optional GPU-based service.

Conclusion

By combining OpenAI Whisper (or any speech-to-text engine) with a Large Language Model (Phi 3.5 mini) summarizer inside a Docker container, we’ve built a unified pipeline for turning raw audio into manageable transcripts and actionable summaries. Whether you’re automating meeting minutes or analyzing podcast content, this approach saves countless hours. Feel free to experiment with chunking, smaller models, or advanced summarization prompts. Let me know how it goes!

Happy transcribing and summarizing!
Aresh Sarkari

Following are the list of helpful Links:

DescriptionLink
Microsoft Phi 3.5 Model Pagemicrosoft/Phi-3.5-mini-instruct · Hugging Face
OpenAI Whisper Modelopenai/whisper-medium · Hugging Face
Ollama Model Card Detailsphi3.5:3.8b-mini-instruct-q4_K_M
NVIDIA Docker Images (Contains Container Engine)nvidia/cuda – Docker Image | Docker Hub
IDE Editor of my choiceCursor – The AI Code Editor

Building a Comprehensive Image Analysis API with Microsoft Florence-2-large, Chainlit and Docker

8 Jul

In this blog post, we’ll embark on an exciting journey of building a comprehensive Image Analysis API using Microsoft Florence-2-large, Chainlit, and Docker. Image analysis is a fascinating field that involves extracting meaningful information from images using advanced AI techniques. By leveraging the power of Microsoft’s Florence-2-large model, we can create a system that automatically understands the content of an image and performs various analysis tasks such as captioning, object detection, expression segmentation, OCR etc..

My Florence2 Code Repository askaresh/MS-Florence2 (github.com)

Note – In the past have written a blog article on Image Captioning you can read more here – Building an Image Captioning API with FastAPI and Hugging Face Transformers packaged with Docker | AskAresh

Model Overview

Hugging Face Link – microsoft/Florence-2-large · Hugging Face

The Microsoft Florence-2-large model is a powerful pre-trained model designed for various image analysis tasks. Developed by Microsoft, this model is part of the Florence family, which is known for its robust performance in computer vision applications. The Florence-2-large model leverages extensive training on a vast dataset of images, enabling it to excel in tasks such as image captioning, object detection, and optical character recognition (OCR).

Key Features of Florence-2-large

  • Multitask Capabilities: The model can perform a wide range of image analysis tasks, including generating captions, detecting objects, segmenting regions, and recognizing text within images.
  • High Accuracy: Trained on diverse and extensive datasets, the Florence-2-large model achieves high accuracy in understanding and analyzing image content.
  • Scalability: Its architecture is designed to scale effectively, making it suitable for integration into various applications and systems.

Why Florence-2-large?

We chose the Florence-2-large model for our Image Analysis API due to its versatility and performance. The model’s ability to handle multiple tasks with high precision makes it an ideal choice for building a comprehensive image analysis system. By leveraging this model, we can ensure that our API delivers accurate and reliable results across different types of image analysis tasks.

Implementation Details

To build our Image Analysis API, we started by setting up a Chainlit project and defining the necessary message handlers. The main handler accepts an image file and processes it through various analysis tasks.

We utilized the pre-trained Florence-2-large model from Hugging Face Transformers for image analysis. This powerful model has been trained on a vast dataset of images and can perform multiple tasks such as image captioning, object detection, and OCR.

To ensure a smooth development experience and ability to run on any cloud, we containerized our application using Docker. This allowed us to encapsulate all the dependencies, including Python libraries and the pre-trained model, into a portable and reproducible environment.

Choosing NVIDIA Docker Image

We specifically chose the NVIDIA CUDA-based Docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04) for our containerization. This choice was driven by the need to leverage GPU acceleration for the Florence-2-large model, which significantly enhances the performance of image processing tasks. The CUDA-based image ensures compatibility with GPU drivers and provides pre-installed libraries necessary for efficient model execution.

Our project structure looks like this:

MS-FLORENCE2/
│
├── app/
│   ├── __init__.py
│   ├── config.py
│   ├── model.py
│   └── utils.py
│
├── Dockerfile
├── docker-compose.yml
├── .env
├── .gitignore
├── chainlit_app.py
├── requirements.txt
└── logging_config.py

Let’s break down the key components:

  1. chainlit_app.py: This is the heart of our Chainlit application. It defines the message handler that processes uploaded images and generates responses using the Florence model.
  2. app/model.py: This file contains the ModelManager class, which is responsible for loading and managing the Florence-2-large model.
  3. app/utils.py: This directory contains utility functions for image drwaing plot boxes, polygons and OCR boxes.
  4. logging_config.py: This file contains the detailed logging of this entire project and its various files
  5. Dockerfile: This file defines how our application is containerized, ensuring all dependencies are properly installed and the environment is consistent. The use of the NVIDIA CUDA-based Docker image ensures compatibility and performance optimization.

Task Prompts and Their Functions

Let’s break down the task prompts used in the Florence-2-large model and explain what each of them does:

  • <CAPTION>
    • Purpose: Generates a simple, concise caption for the image.
    • Output: A brief description of the main elements in the image.
    • Example: “A credit card bill with a price tag on it”
  • <DETAILED_CAPTION>
    • Purpose: Provides a more detailed description of the image.
    • Output: A comprehensive description including more elements and details from the image.
    • Example: “The image shows a credit card bill with a black background. The bill is printed on a white sheet of paper with a blue border and a blue header. The header reads ‘Credit Card Bill’ in bold black font. The bottom of the bill has a space for the customer’s name, address, and contact information.”
  • <OD> Object Detection
    • Purpose: Detects and locates objects within the image.
    • Output: A list of detected objects with their bounding box coordinates and labels.
    • Example: [{‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘credit card’, ‘price tag’, …]}]
  • <OCR>
    • Purpose: Performs Optical Character Recognition on the image.
    • Output: Extracted text from the image.
    • Example: “Credit Card Bill\nName: John Doe\nAddress: 123 Main St…”
  • <CAPTION_TO_PHRASE_GROUNDING>
    • Purpose: Locates specific phrases or objects mentioned in the caption within the image.
    • Input: Requires a caption (usually the output from ”) as additional text input.
    • Output: Bounding boxes and labels for phrases/objects from the caption found in the image.
    • Example: [{‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘credit card’, ‘price tag’, …]}]
  • <DENSE_REGION_CAPTION>
    • Purpose: Generates captions for specific regions within the image.
    • Output: A list of regions with their bounding boxes and corresponding captions.
    • Example: [{‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘Header with Credit Card Bill text’, ‘Customer information section’, …]}]
  • <REGION_PROPOSAL>
    • Purpose: Suggests regions of interest within the image without labeling them.
    • Output: A list of bounding boxes for potentially important regions in the image.
    • Example: {‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [”, ”, …]}
  • <MORE_DETAILED_CAPTION>
    • Purpose: Generates an even more comprehensive description of the image than ”.
    • Output: A very detailed narrative of the image, often including subtle details and potential interpretations.
    • Example: “The image displays a credit card bill document against a stark black background. The bill itself is printed on crisp white paper, framed by a professional-looking blue border. At the top, a bold blue header prominently declares ‘Credit Card Bill’ in a large, easy-to-read font. Below this, the document is structured into clear sections, likely detailing transactions, fees, and payment information. At the bottom of the bill, there’s a designated area for customer details, including name, address, and possibly account information. The contrast between the white document and black background gives the image a formal, official appearance, emphasizing the importance of the financial information presented.”
  • <REFERRING_EXPRESSION_SEGMENTATION>
    • Purpose: Segments the image based on a textual description of a specific object or region.
    • Input: Requires a textual description as additional input.
    • Output: A segmentation mask for the described object or region.
  • <REGION_TO_SEGMENTATION>
    • Purpose: Generates a segmentation mask for a specified region in the image.
    • Input: Requires coordinates of the region of interest.
    • Output: A segmentation mask for the specified region.
  • <OPEN_VOCABULARY_DETECTION>
    • Purpose: Detects objects in the image based on user-specified categories.
    • Input: Can accept a list of categories to look for.
    • Output: Bounding boxes and labels for detected objects matching the specified categories.
  • <REGION_TO_CATEGORY>
    • Purpose: Classifies a specific region of the image into a category.
    • Input: Requires coordinates of the region of interest.
    • Output: A category label for the specified region.
  • <REGION_TO_DESCRIPTION>
    • Purpose: Generates a detailed description of a specific region in the image.
    • Input: Requires coordinates of the region of interest.
    • Output: A textual description of the contents of the specified region.
  • ‘<OCR_WITH_REGION>’
    • Purpose: Performs OCR on specific regions of the image.
    • Output: Extracted text along with the corresponding regions (bounding boxes) where the text was found.

These task prompts allow us to leverage the Florence-2-large model’s capabilities for various image analysis tasks. By combining these prompts, we can create a comprehensive analysis of an image, from basic captioning to detailed object detection and text recognition. Understanding and effectively utilizing these task prompts was crucial in maximizing the potential of the Florence-2-large model in our project.

Lessons Learned and Debugging

Throughout the development of our Florence Image Analysis project, I encountered several challenges and learned valuable lessons:

  • Flash Attention Challenges: One of the most significant hurdles we faced was integrating flash-attn into our project. Initially, we encountered installation issues and compatibility problems with our CUDA setup. We learned that:
    • Flash-attn requires specific CUDA versions and can be sensitive to the exact configuration of the environment.
    • Note we moved to NVDIA based docker image to take care of all the pre-requsites specific to CUDA/Flashattention and interoperatbility of versions, that helped tremendously
    • Building flash-attn from source was often necessary to ensure compatibility with our specific setup. environment.
    • Using the --no-build-isolation flag during installation helped resolve some dependency conflicts. Solution: We ended up creating a custom build process in our Dockerfile, ensuring all dependencies were correctly installed before attempting to install flash-attn.
  • Segmentation and OCR with Region Iterations: Implementing effective OCR, especially with region detection, proved to be an iterative process:
    • Initially, we tried using the Florence model for general OCR, but found it lacking in accuracy for structured documents.
    • We experimented with pre-processing steps to detect distinct regions in documents (headers, body, footer) before applying OCR.
    • Balancing between processing speed and accuracy was a constant challenge. Solution: We implemented a custom region detection algorithm that identifies potential text blocks before applying OCR. This improved both accuracy and processing speed.
  • Error Handling and Logging: As the project grew more complex, we realized the importance of robust error handling and comprehensive logging:
    • Initially, errors in model processing would crash the entire application.
    • Debugging was challenging without detailed logs. Solution: We implemented try-except blocks throughout the code, added detailed logging, and created a system to gracefully handle and report errors to users.
  • Optimizing for Different Document Types: We found that the performance of our system varied significantly depending on the type of document being processed:
    • Handwritten documents required different preprocessing than printed text.
    • Certain document layouts (e.g., tables, multi-column text) posed unique challenges. Solution: We implemented a document type detection step and adjusted our processing pipeline based on the detected type.
  • Balancing Between Flexibility and Specialization: While I aimed to create a general-purpose image analysis tool, we found that specializing for certain tasks greatly improved performance:
    • We created separate processing paths for tasks like receipt OCR, business card analysis, and general document processing. Solution: We implemented a modular architecture that allows for easy addition of specialized processing pipelines while maintaining a common core.

These lessons significantly improved the robustness and effectiveness of our Florence Image Analysis project.

Validation of the API with real Examples

After the container is up and running, users can access the Chainlit interface at http://localhost:8010. Here’s an example of how to use the API:

Example – <Caption>

Example – <MORE_DETAILED_CAPTION>

Example – <OCR>

Example – <OCR_WITH_REGION>

Model GPU – VRAM Consumption

Following are the list of helpful links:

DescriptionLink
Florence-2: Advancing a Unified Representation for a Variety of Vision TasksFlorence-2: Advancing a Unified Representation for a Variety of Vision Tasks – Microsoft Research
Sample Google Collab Notebook Florence2sample_inference.ipynb · microsoft/Florence-2-large at main (huggingface.co)
Research Paper link direct[2311.06242] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (arxiv.org)
Florence 2 Inference ChainlitFlorence 2 Inference Chainlit – Mervin Praison

Conclusion

Building this comprehensive Image Analysis API with Florence-2-large, Chainlit, and Docker has been an incredible learning experience. I must have spend atleast a week to get to this point working all the features and functionality within a Docker image. By leveraging the power of advanced AI models and containerization, we created a scalable and efficient solution for performing various image analysis tasks automatically. Through this project, we gained valuable insights into model management, error handling, GPU utilization in containerized environments, and designing interactive UIs for AI applications.

I hope that this blog post has provided you with a comprehensive overview of our Image Analysis API project and inspired you to explore the fascinating world of computer vision. Feel free to check out our GitHub repository, try out the API, and let me know if you have any questions or suggestions!.

Thanks,
Aresh Sarkari

Unlocking the Power of Multimodal AI: A Deep Dive into LLaVA and LLaMA 3 – Demo in LM Studio

23 May

In my earlier post we explored uncensored LLM like Dolphin. Today, we shall look into the intersection of visual and language understanding what happens when a marriage takes place between Vision & LLM. One such innovation is LLaVA (Large Language and Visual Assistant), an open-source generative AI model that combines the strengths of vision encoders and large language models to create a powerful tool for general-purpose visual and language understanding. In this blog post, we’ll delve into the details of LLaVA, its underlying models, and how you can harness its capabilities using LMStudio.

What is LLaVA?

🖼️ LLaVA is a novel, end-to-end trained large multimodal model that integrates a pre-trained CLIP ViT-L/14 visual encoder with the Vicuna large language model. The integration is achieved through a projection matrix, enabling seamless interaction between visual and language data. LLaVA is designed to excel in both daily user-oriented applications and specialized domains such as science, offering a versatile tool for multimodal reasoning and instruction-following tasks.

What is LLaMA 3?

🧠 LLaMA 3 is the third iteration of the Large Language Model from Meta AI, known for its remarkable language understanding and generation capabilities. LLaMA 3 builds upon its predecessors with improved architecture, enhanced training techniques, and a broader dataset, making it one of the most advanced language models available. In the context of LLaVA, LLaMA 3 serves as the foundation for the language model component, providing robust support for complex conversational and reasoning tasks.

How to Run the Model Locally Using LMStudio

💻 Running LLaVA locally using LMStudio is a straightforward process that allows you to leverage the model’s capabilities on your own hardware. Here’s a step-by-step guide to get you started:

  • Setup Your Environment
    • Install LMStudio: The software its available on (Windows, Mac & Linux). This software allows you to manage and deploy local LLMs without you having to setup Python, Machine Learning, Transformers etc. libraries. Link to Download the Windows Bits  – LM Studio – Discover, download, and run local LLMs
  • Download the Model and Dependencies
    • The best space to keep a track on models is Hugging Face – Models – Hugging Face. You can keep a track of the model releases and updates here.
    • Copy the model name from Hugging Face – xtuner/llava-llama-3-8b-v1_1-gguf
    • Paste this name in LM Studio and it will list out all the quantized models
    • In my case due to the configurations I selected int4 model. Please note lower the quantized version less accurate the model is.
    • Obtain the LLaVA model files, including the quantized GGUF version and MMProj files, from the official repository.
    • Download of the model will take time depending upon your internet connection.
  • Prepare the Model for Running:
    • Within LMStudio click on the Chat interface to configure model settings.
    • Select the model from the drop down list – llava llama 3 v int4 GGUF
    • You will be able to run it stock but I like to configure the Advanced Configurations
    • Adjust the model settings to match your hardware capabilities and specific requirements.
    • Based on your system set the GPU to 50/50 or max. I have setup for max
    • Click Relod model to apply configuration
  • Run Inference: Start the model and begin running inference tasks, whether for visual chat, science QA, or other applications.

Note – If there is enough interest, I can also do a extended blogpost on Dockerized version of this model. Leave comments down below.

What are MMProj Files?

📂 MMProj files are a key component in the LLaVA ecosystem, representing multimodal projection matrices that facilitate the alignment between visual and language features. These files are crucial for the seamless integration of visual encoders and language models, enabling LLaVA to effectively interpret and generate content that spans both modalities. MMProj files are fine-tuned during the model’s training process to ensure optimal performance in various applications.

What is the Quantized GGUF Version of LLaVA?

💾 The quantized GGUF (GPT-Generated Unified Format) version of LLaVA represents a compressed and optimized variant of the model, enabling efficient deployment on consumer-grade hardware. Quantization reduces the precision of the model’s weights, significantly decreasing the memory footprint and computational requirements while maintaining a high level of performance. This makes the quantized GGUF version ideal for applications where resource constraints are a concern.

Testing the Model

🧪 Testing showcases the beauty of the LLaVA model look at the details its providing in the example images.

Example 1

Example 2

Through rigorous testing and validation, LLaVA continues to demonstrate its potential as a versatile and powerful multimodal model.

Reference Links

Following are the list of helpful links:

DescriptionLink
LLaVA Github PageLLaVA (llava-vl.github.io)
Microsoft Research Paper LLaVA: Large Language and Vision Assistant – Microsoft Research
Hugging Face GGUF modelxtuner/llava-llama-3-8b-v1_1-gguf · Hugging Face
Visual Instruction Tuning (arxiv)[2304.08485] Visual Instruction Tuning (arxiv.org)

🌐 LLaVA represents a significant advancement in the field of multimodal AI, combining powerful visual and language understanding capabilities in a single, efficient model. By leveraging the strengths of LLaMA 3 and innovative techniques like quantization and multimodal projection, LLaVA offers a robust tool for a wide range of applications. Whether you’re a researcher, developer, or enthusiast, exploring the potential of LLaVA can open up new possibilities in the realm of AI-driven interaction and understanding.

By following the steps outlined in this post, you can get started with LLaVA and begin harnessing its capabilities for your own projects. Please let me know if I’ve missed any steps or details, and I’ll be happy to update the post.

Thanks,
Aresh Sarkari

Building an Image Captioning API with FastAPI and Hugging Face Transformers packaged with Docker

17 Apr

In this blog post, we’ll embark on an exciting journey of building an Image Captioning API using FastAPI and Hugging Face Transformers. Image captioning is a fascinating task that involves generating textual descriptions for given images. By leveraging the power of deep learning and natural language processing, we can create a system that automatically understands the content of an image and generates human-like captions. The example below, I input a image with a rider on a bike in a garage and the caption provides the exact details of the image.

Project Overview

👨‍💻 GitHub: https://github.com/askaresh/blip-image-captioning-api

The goal of this project is to develop a RESTful API that accepts an image as input and returns a generated caption describing the image. We’ll be using FastAPI, a modern and fast web framework for building APIs, along with Hugging Face Transformers, a popular library for natural language processing tasks.

The key components of our project include:

  1. FastAPI: A web framework for building efficient and scalable APIs in Python.
  2. Hugging Face Transformers: A library that provides state-of-the-art pre-trained models for various NLP tasks, including image captioning.
  3. Docker: A containerization platform that allows us to package our application and its dependencies into a portable and reproducible environment.

Implementation Details

To build our Image Captioning API, we started by setting up a FastAPI project and defining the necessary endpoints. The main endpoint accepts an image file and an optional text input for conditional image captioning.

We utilized the pre-trained BLIP (Bootstrapping Language-Image Pre-training) model from Hugging Face Transformers for image captioning. BLIP is a powerful model that has been trained on a large dataset of image-caption pairs and achieves impressive results in generating accurate and coherent captions.

To ensure a smooth development experience, and ability for it to run on AnyCloud I containerized our application using Docker. This allowed us to encapsulate all the dependencies, including Python libraries and the pre-trained model, into a portable and reproducible environment.

HF-IT-DOCKER/

├── app/
│ ├── config.py
│ ├── main.py
│ ├── model.py
│ └── utils.py

├── .dockerignore
├── .gitignore
├── compose.yaml
├── Dockerfile
├── logging.conf
├── README.Docker.md
└── requirements.txt

Detailed description of each file:

  • app/config.py:
    • This file contains the configuration settings for the application.
    • It defines a Settings class using the pydantic_settings library to store and manage application-specific settings.
    • The blip_model_name setting specifies the name of the BLIP model to be used for image captioning.
  • app/main.py:
    • This is the main entry point of the FastAPI application.
    • It sets up the FastAPI app, loads the BLIP model, and configures logging.
    • It defines the API endpoints, including the root path (“/”) and the image captioning endpoint (“/caption”).
    • The “/caption” endpoint accepts an image file and an optional text input, processes the image, generates a caption using the BLIP model, and returns the generated caption.
  • app/model.py:
    • This file contains the functions related to loading and using the BLIP model for image captioning.
    • The load_model function loads the pre-trained BLIP model and processor based on the specified model name.
    • The generate_caption function takes an image and optional text input, preprocesses the inputs, and generates a caption using the loaded BLIP model.
  • app/utils.py:
    • This file contains utility functions used in the project.
    • The load_image_from_file function reads an image file and converts it to the appropriate format (RGB) using the PIL library.
  • .dockerignore:
    • This file specifies the files and directories that should be excluded when building the Docker image.
    • It helps to reduce the size of the Docker image by excluding unnecessary files and directories.
  • .gitignore:
    • This file specifies the files and directories that should be ignored by Git version control.
    • It helps to keep the repository clean by excluding files that are not necessary to track, such as generated files, cache files, and environment-specific files.
  • compose.yaml:
    • This file contains the configuration for Docker Compose, which is used to define and run multi-container Docker applications.
    • It defines the services, including the FastAPI server, and specifies the build context, ports, and any necessary dependencies.
  • Dockerfile:
    • This file contains the instructions for building the Docker image for the FastAPI application.
    • It specifies the base image, sets up the working directory, installs dependencies, copies the application code, and defines the entry point for running the application.
  • logging.conf:
    • This file contains the configuration for the Python logging system.
    • It defines the loggers, handlers, formatters, and their respective settings.
    • It specifies the log levels, log file paths, and log message formats.
  • README.Docker.md:
    • This file provides documentation and instructions specific to running the application using Docker.
    • It may include information on how to build the Docker image, run the container, and any other Docker-related details.
  • requirements.txt:
    • This file lists the Python dependencies required by the application.
    • It includes the necessary libraries and their versions, such as FastAPI, Hugging Face Transformers, PIL, and others.
    • It is used by pip to install the required packages when building the Docker image or setting up the development environment.

Lessons Learned and Debugging

Throughout the development process, I encountered several challenges and learned valuable lessons:

  1. Dependency Management: Managing dependencies can be tricky, especially when working with large pre-trained models. We learned the importance of properly specifying dependencies in our requirements file and using Docker to ensure consistent environments across different systems.
  2. Debugging Permission Issues: We encountered permission-related issues when running our application inside a Docker container. Through debugging, we learned the significance of properly setting file and directory permissions and running the container as a non-root user to enhance security.
  3. Logging Configuration: Proper logging is crucial for understanding the behavior of our application and troubleshooting issues. I learned how to configure logging using a configuration file and ensure that log files are written to directories with appropriate permissions.
  4. Testing and Error Handling: Comprehensive testing and error handling are essential for building a robust API. We implemented thorough error handling to provide meaningful error messages to API users and conducted extensive testing to ensure the reliability of our image captioning functionality.

Validation of the API

After the container is up and running go to http://localhost:8004/docs and select Post method and pick try out. Upload any image of your choice and enter the text (optional) and further click Execute. You will have the caption below as the output.

Conclusion

Building an Image Captioning API with FastAPI and Hugging Face Transformers has been an incredible learning experience. By leveraging the power of pre-trained models and containerization, I created a scalable and efficient solution for generating image captions automatically.

Through this project, I gained valuable insights into dependency management, debugging permission issues, logging configuration, and the importance of testing and error handling. These lessons will undoubtedly be applicable to future projects and contribute to our growth as developers.

I hope that this blog post has provided you with a comprehensive overview of our Image Captioning API project and inspired you to explore the fascinating world of image captioning and natural language processing. Feel free to reach out with any questions or suggestions, and happy captioning!

Thanks,
Aresh Sarkari