Site icon AskAresh

Building a Comprehensive Image Analysis API with Microsoft Florence-2-large, Chainlit and Docker

In this blog post, we’ll embark on an exciting journey of building a comprehensive Image Analysis API using Microsoft Florence-2-large, Chainlit, and Docker. Image analysis is a fascinating field that involves extracting meaningful information from images using advanced AI techniques. By leveraging the power of Microsoft’s Florence-2-large model, we can create a system that automatically understands the content of an image and performs various analysis tasks such as captioning, object detection, expression segmentation, OCR etc..

My Florence2 Code Repository askaresh/MS-Florence2 (github.com)

Note – In the past have written a blog article on Image Captioning you can read more here – Building an Image Captioning API with FastAPI and Hugging Face Transformers packaged with Docker | AskAresh

Model Overview

Hugging Face Link – microsoft/Florence-2-large · Hugging Face

The Microsoft Florence-2-large model is a powerful pre-trained model designed for various image analysis tasks. Developed by Microsoft, this model is part of the Florence family, which is known for its robust performance in computer vision applications. The Florence-2-large model leverages extensive training on a vast dataset of images, enabling it to excel in tasks such as image captioning, object detection, and optical character recognition (OCR).

Key Features of Florence-2-large

Why Florence-2-large?

We chose the Florence-2-large model for our Image Analysis API due to its versatility and performance. The model’s ability to handle multiple tasks with high precision makes it an ideal choice for building a comprehensive image analysis system. By leveraging this model, we can ensure that our API delivers accurate and reliable results across different types of image analysis tasks.

Implementation Details

To build our Image Analysis API, we started by setting up a Chainlit project and defining the necessary message handlers. The main handler accepts an image file and processes it through various analysis tasks.

We utilized the pre-trained Florence-2-large model from Hugging Face Transformers for image analysis. This powerful model has been trained on a vast dataset of images and can perform multiple tasks such as image captioning, object detection, and OCR.

To ensure a smooth development experience and ability to run on any cloud, we containerized our application using Docker. This allowed us to encapsulate all the dependencies, including Python libraries and the pre-trained model, into a portable and reproducible environment.

Choosing NVIDIA Docker Image

We specifically chose the NVIDIA CUDA-based Docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04) for our containerization. This choice was driven by the need to leverage GPU acceleration for the Florence-2-large model, which significantly enhances the performance of image processing tasks. The CUDA-based image ensures compatibility with GPU drivers and provides pre-installed libraries necessary for efficient model execution.

Our project structure looks like this:

MS-FLORENCE2/
│
├── app/
│   ├── __init__.py
│   ├── config.py
│   ├── model.py
│   └── utils.py
│
├── Dockerfile
├── docker-compose.yml
├── .env
├── .gitignore
├── chainlit_app.py
├── requirements.txt
└── logging_config.py

Let’s break down the key components:

  1. chainlit_app.py: This is the heart of our Chainlit application. It defines the message handler that processes uploaded images and generates responses using the Florence model.
  2. app/model.py: This file contains the ModelManager class, which is responsible for loading and managing the Florence-2-large model.
  3. app/utils.py: This directory contains utility functions for image drwaing plot boxes, polygons and OCR boxes.
  4. logging_config.py: This file contains the detailed logging of this entire project and its various files
  5. Dockerfile: This file defines how our application is containerized, ensuring all dependencies are properly installed and the environment is consistent. The use of the NVIDIA CUDA-based Docker image ensures compatibility and performance optimization.

Task Prompts and Their Functions

Let’s break down the task prompts used in the Florence-2-large model and explain what each of them does:

These task prompts allow us to leverage the Florence-2-large model’s capabilities for various image analysis tasks. By combining these prompts, we can create a comprehensive analysis of an image, from basic captioning to detailed object detection and text recognition. Understanding and effectively utilizing these task prompts was crucial in maximizing the potential of the Florence-2-large model in our project.

Lessons Learned and Debugging

Throughout the development of our Florence Image Analysis project, I encountered several challenges and learned valuable lessons:

These lessons significantly improved the robustness and effectiveness of our Florence Image Analysis project.

Validation of the API with real Examples

After the container is up and running, users can access the Chainlit interface at http://localhost:8010. Here’s an example of how to use the API:

Example – <Caption>

Example – <MORE_DETAILED_CAPTION>

Example – <OCR>

Example – <OCR_WITH_REGION>

Model GPU – VRAM Consumption

Following are the list of helpful links:

DescriptionLink
Florence-2: Advancing a Unified Representation for a Variety of Vision TasksFlorence-2: Advancing a Unified Representation for a Variety of Vision Tasks – Microsoft Research
Sample Google Collab Notebook Florence2sample_inference.ipynb · microsoft/Florence-2-large at main (huggingface.co)
Research Paper link direct[2311.06242] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (arxiv.org)
Florence 2 Inference ChainlitFlorence 2 Inference Chainlit – Mervin Praison

Conclusion

Building this comprehensive Image Analysis API with Florence-2-large, Chainlit, and Docker has been an incredible learning experience. I must have spend atleast a week to get to this point working all the features and functionality within a Docker image. By leveraging the power of advanced AI models and containerization, we created a scalable and efficient solution for performing various image analysis tasks automatically. Through this project, we gained valuable insights into model management, error handling, GPU utilization in containerized environments, and designing interactive UIs for AI applications.

I hope that this blog post has provided you with a comprehensive overview of our Image Analysis API project and inspired you to explore the fascinating world of computer vision. Feel free to check out our GitHub repository, try out the API, and let me know if you have any questions or suggestions!.

Thanks,
Aresh Sarkari

Exit mobile version