Building a Comprehensive Image Analysis API with Microsoft Florence-2-large, Chainlit and Docker

askaresh

2 years ago

In this blog post, we’ll embark on an exciting journey of building a comprehensive Image Analysis API using Microsoft Florence-2-large, Chainlit, and Docker. Image analysis is a fascinating field that involves extracting meaningful information from images using advanced AI techniques. By leveraging the power of Microsoft’s Florence-2-large model, we can create a system that automatically understands the content of an image and performs various analysis tasks such as captioning, object detection, expression segmentation, OCR etc..

My Florence2 Code Repository – askaresh/MS-Florence2 (github.com)

Note – In the past have written a blog article on Image Captioning you can read more here – Building an Image Captioning API with FastAPI and Hugging Face Transformers packaged with Docker | AskAresh

Model Overview

Hugging Face Link – microsoft/Florence-2-large · Hugging Face

The Microsoft Florence-2-large model is a powerful pre-trained model designed for various image analysis tasks. Developed by Microsoft, this model is part of the Florence family, which is known for its robust performance in computer vision applications. The Florence-2-large model leverages extensive training on a vast dataset of images, enabling it to excel in tasks such as image captioning, object detection, and optical character recognition (OCR).

Key Features of Florence-2-large

Multitask Capabilities: The model can perform a wide range of image analysis tasks, including generating captions, detecting objects, segmenting regions, and recognizing text within images.
High Accuracy: Trained on diverse and extensive datasets, the Florence-2-large model achieves high accuracy in understanding and analyzing image content.
Scalability: Its architecture is designed to scale effectively, making it suitable for integration into various applications and systems.

Why Florence-2-large?

We chose the Florence-2-large model for our Image Analysis API due to its versatility and performance. The model’s ability to handle multiple tasks with high precision makes it an ideal choice for building a comprehensive image analysis system. By leveraging this model, we can ensure that our API delivers accurate and reliable results across different types of image analysis tasks.

Implementation Details

To build our Image Analysis API, we started by setting up a Chainlit project and defining the necessary message handlers. The main handler accepts an image file and processes it through various analysis tasks.

We utilized the pre-trained Florence-2-large model from Hugging Face Transformers for image analysis. This powerful model has been trained on a vast dataset of images and can perform multiple tasks such as image captioning, object detection, and OCR.

To ensure a smooth development experience and ability to run on any cloud, we containerized our application using Docker. This allowed us to encapsulate all the dependencies, including Python libraries and the pre-trained model, into a portable and reproducible environment.

Choosing NVIDIA Docker Image

We specifically chose the NVIDIA CUDA-based Docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04) for our containerization. This choice was driven by the need to leverage GPU acceleration for the Florence-2-large model, which significantly enhances the performance of image processing tasks. The CUDA-based image ensures compatibility with GPU drivers and provides pre-installed libraries necessary for efficient model execution.

Our project structure looks like this:

MS-FLORENCE2/
│
├── app/
│   ├── __init__.py
│   ├── config.py
│   ├── model.py
│   └── utils.py
│
├── Dockerfile
├── docker-compose.yml
├── .env
├── .gitignore
├── chainlit_app.py
├── requirements.txt
└── logging_config.py

Let’s break down the key components:

chainlit_app.py: This is the heart of our Chainlit application. It defines the message handler that processes uploaded images and generates responses using the Florence model.
app/model.py: This file contains the ModelManager class, which is responsible for loading and managing the Florence-2-large model.
app/utils.py: This directory contains utility functions for image drwaing plot boxes, polygons and OCR boxes.
logging_config.py: This file contains the detailed logging of this entire project and its various files
Dockerfile: This file defines how our application is containerized, ensuring all dependencies are properly installed and the environment is consistent. The use of the NVIDIA CUDA-based Docker image ensures compatibility and performance optimization.

Task Prompts and Their Functions

Let’s break down the task prompts used in the Florence-2-large model and explain what each of them does:

<CAPTION>
- Purpose: Generates a simple, concise caption for the image.
- Output: A brief description of the main elements in the image.
- Example: “A credit card bill with a price tag on it”

<DETAILED_CAPTION>
- Purpose: Provides a more detailed description of the image.
- Output: A comprehensive description including more elements and details from the image.
- Example: “The image shows a credit card bill with a black background. The bill is printed on a white sheet of paper with a blue border and a blue header. The header reads ‘Credit Card Bill’ in bold black font. The bottom of the bill has a space for the customer’s name, address, and contact information.”

<OD> Object Detection
- Purpose: Detects and locates objects within the image.
- Output: A list of detected objects with their bounding box coordinates and labels.
- Example: [{‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘credit card’, ‘price tag’, …]}]

<OCR>
- Purpose: Performs Optical Character Recognition on the image.
- Output: Extracted text from the image.
- Example: “Credit Card Bill\nName: John Doe\nAddress: 123 Main St…”

<CAPTION_TO_PHRASE_GROUNDING>
- Purpose: Locates specific phrases or objects mentioned in the caption within the image.
- Input: Requires a caption (usually the output from ”) as additional text input.
- Output: Bounding boxes and labels for phrases/objects from the caption found in the image.
- Example: [{‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘credit card’, ‘price tag’, …]}]

<DENSE_REGION_CAPTION>
- Purpose: Generates captions for specific regions within the image.
- Output: A list of regions with their bounding boxes and corresponding captions.
- Example: [{‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘Header with Credit Card Bill text’, ‘Customer information section’, …]}]

<REGION_PROPOSAL>
- Purpose: Suggests regions of interest within the image without labeling them.
- Output: A list of bounding boxes for potentially important regions in the image.
- Example: {‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [”, ”, …]}

<MORE_DETAILED_CAPTION>
- Purpose: Generates an even more comprehensive description of the image than ”.
- Output: A very detailed narrative of the image, often including subtle details and potential interpretations.
- Example: “The image displays a credit card bill document against a stark black background. The bill itself is printed on crisp white paper, framed by a professional-looking blue border. At the top, a bold blue header prominently declares ‘Credit Card Bill’ in a large, easy-to-read font. Below this, the document is structured into clear sections, likely detailing transactions, fees, and payment information. At the bottom of the bill, there’s a designated area for customer details, including name, address, and possibly account information. The contrast between the white document and black background gives the image a formal, official appearance, emphasizing the importance of the financial information presented.”
<REFERRING_EXPRESSION_SEGMENTATION>
- Purpose: Segments the image based on a textual description of a specific object or region.
- Input: Requires a textual description as additional input.
- Output: A segmentation mask for the described object or region.
<REGION_TO_SEGMENTATION>
- Purpose: Generates a segmentation mask for a specified region in the image.
- Input: Requires coordinates of the region of interest.
- Output: A segmentation mask for the specified region.
<OPEN_VOCABULARY_DETECTION>
- Purpose: Detects objects in the image based on user-specified categories.
- Input: Can accept a list of categories to look for.
- Output: Bounding boxes and labels for detected objects matching the specified categories.
<REGION_TO_CATEGORY>
- Purpose: Classifies a specific region of the image into a category.
- Input: Requires coordinates of the region of interest.
- Output: A category label for the specified region.
<REGION_TO_DESCRIPTION>
- Purpose: Generates a detailed description of a specific region in the image.
- Input: Requires coordinates of the region of interest.
- Output: A textual description of the contents of the specified region.
‘<OCR_WITH_REGION>’
- Purpose: Performs OCR on specific regions of the image.
- Output: Extracted text along with the corresponding regions (bounding boxes) where the text was found.

These task prompts allow us to leverage the Florence-2-large model’s capabilities for various image analysis tasks. By combining these prompts, we can create a comprehensive analysis of an image, from basic captioning to detailed object detection and text recognition. Understanding and effectively utilizing these task prompts was crucial in maximizing the potential of the Florence-2-large model in our project.

Lessons Learned and Debugging

Throughout the development of our Florence Image Analysis project, I encountered several challenges and learned valuable lessons:

Flash Attention Challenges: One of the most significant hurdles we faced was integrating flash-attn into our project. Initially, we encountered installation issues and compatibility problems with our CUDA setup. We learned that:
- Flash-attn requires specific CUDA versions and can be sensitive to the exact configuration of the environment.
- Note we moved to NVDIA based docker image to take care of all the pre-requsites specific to CUDA/Flashattention and interoperatbility of versions, that helped tremendously
- Building flash-attn from source was often necessary to ensure compatibility with our specific setup. environment.
- Using the --no-build-isolation flag during installation helped resolve some dependency conflicts. Solution: We ended up creating a custom build process in our Dockerfile, ensuring all dependencies were correctly installed before attempting to install flash-attn.

Segmentation and OCR with Region Iterations: Implementing effective OCR, especially with region detection, proved to be an iterative process:
- Initially, we tried using the Florence model for general OCR, but found it lacking in accuracy for structured documents.
- We experimented with pre-processing steps to detect distinct regions in documents (headers, body, footer) before applying OCR.
- Balancing between processing speed and accuracy was a constant challenge. Solution: We implemented a custom region detection algorithm that identifies potential text blocks before applying OCR. This improved both accuracy and processing speed.

Error Handling and Logging: As the project grew more complex, we realized the importance of robust error handling and comprehensive logging:
- Initially, errors in model processing would crash the entire application.
- Debugging was challenging without detailed logs. Solution: We implemented try-except blocks throughout the code, added detailed logging, and created a system to gracefully handle and report errors to users.

Optimizing for Different Document Types: We found that the performance of our system varied significantly depending on the type of document being processed:
- Handwritten documents required different preprocessing than printed text.
- Certain document layouts (e.g., tables, multi-column text) posed unique challenges. Solution: We implemented a document type detection step and adjusted our processing pipeline based on the detected type.

Balancing Between Flexibility and Specialization: While I aimed to create a general-purpose image analysis tool, we found that specializing for certain tasks greatly improved performance:
- We created separate processing paths for tasks like receipt OCR, business card analysis, and general document processing. Solution: We implemented a modular architecture that allows for easy addition of specialized processing pipelines while maintaining a common core.

These lessons significantly improved the robustness and effectiveness of our Florence Image Analysis project.

Validation of the API with real Examples

After the container is up and running, users can access the Chainlit interface at http://localhost:8010. Here’s an example of how to use the API:

Example – <Caption>

Example – <MORE_DETAILED_CAPTION>

Example – <OCR>

Example – <OCR_WITH_REGION>

Model GPU – VRAM Consumption

Reference Links

Following are the list of helpful links:

Description	Link
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks – Microsoft Research
Sample Google Collab Notebook Florence2	sample_inference.ipynb · microsoft/Florence-2-large at main (huggingface.co)
Research Paper link direct	[2311.06242] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (arxiv.org)
Florence 2 Inference Chainlit	Florence 2 Inference Chainlit – Mervin Praison

Conclusion

Building this comprehensive Image Analysis API with Florence-2-large, Chainlit, and Docker has been an incredible learning experience. I must have spend atleast a week to get to this point working all the features and functionality within a Docker image. By leveraging the power of advanced AI models and containerization, we created a scalable and efficient solution for performing various image analysis tasks automatically. Through this project, we gained valuable insights into model management, error handling, GPU utilization in containerized environments, and designing interactive UIs for AI applications.

I hope that this blog post has provided you with a comprehensive overview of our Image Analysis API project and inspired you to explore the fascinating world of computer vision. Feel free to check out our GitHub repository, try out the API, and let me know if you have any questions or suggestions!.

Thanks,
Aresh Sarkari