Tag Archives: AI

Unlocking the Power of Multimodal AI: A Deep Dive into LLaVA and LLaMA 3 – Demo in LM Studio

23 May

In my earlier post we explored uncensored LLM like Dolphin. Today, we shall look into the intersection of visual and language understanding what happens when a marriage takes place between Vision & LLM. One such innovation is LLaVA (Large Language and Visual Assistant), an open-source generative AI model that combines the strengths of vision encoders and large language models to create a powerful tool for general-purpose visual and language understanding. In this blog post, we’ll delve into the details of LLaVA, its underlying models, and how you can harness its capabilities using LMStudio.

What is LLaVA?

๐Ÿ–ผ๏ธ LLaVA is a novel, end-to-end trained large multimodal model that integrates a pre-trained CLIP ViT-L/14 visual encoder with the Vicuna large language model. The integration is achieved through a projection matrix, enabling seamless interaction between visual and language data. LLaVA is designed to excel in both daily user-oriented applications and specialized domains such as science, offering a versatile tool for multimodal reasoning and instruction-following tasks.

What is LLaMA 3?

๐Ÿง  LLaMA 3 is the third iteration of the Large Language Model from Meta AI, known for its remarkable language understanding and generation capabilities. LLaMA 3 builds upon its predecessors with improved architecture, enhanced training techniques, and a broader dataset, making it one of the most advanced language models available. In the context of LLaVA, LLaMA 3 serves as the foundation for the language model component, providing robust support for complex conversational and reasoning tasks.

How to Run the Model Locally Using LMStudio

๐Ÿ’ป Running LLaVA locally using LMStudio is a straightforward process that allows you to leverage the model’s capabilities on your own hardware. Here’s a step-by-step guide to get you started:

  • Setup Your Environment
    • Install LMStudio: The software its available on (Windows, Mac & Linux). This software allows you to manage and deploy local LLMs without you having to setup Python, Machine Learning, Transformers etc. libraries. Link to Download the Windows Bits  – LM Studio – Discover, download, and run local LLMs
  • Download the Model and Dependencies
    • The best space to keep a track on models is Hugging Face โ€“ Models โ€“ Hugging Face. You can keep a track of the model releases and updates here.
    • Copy the model name from Hugging Face โ€“ xtuner/llava-llama-3-8b-v1_1-gguf
    • Paste this name in LM Studio and it will list out all the quantized models
    • In my case due to the configurations I selected int4 model. Please note lower the quantized version less accurate the model is.
    • Obtain the LLaVA model files, including the quantized GGUF version and MMProj files, from the official repository.
    • Download of the model will take time depending upon your internet connection.
  • Prepare the Model for Running:
    • Within LMStudio click on the Chat interface to configure model settings.
    • Select the model from the drop down list โ€“ llava llama 3 v int4 GGUF
    • You will be able to run it stock but I like to configure the Advanced Configurations
    • Adjust the model settings to match your hardware capabilities and specific requirements.
    • Based on your system set the GPU to 50/50 or max. I have setup for max
    • Click Relod model to apply configuration
  • Run Inference: Start the model and begin running inference tasks, whether for visual chat, science QA, or other applications.

Note – If there is enough interest, I can also do a extended blogpost on Dockerized version of this model. Leave comments down below.

What are MMProj Files?

๐Ÿ“‚ MMProj files are a key component in the LLaVA ecosystem, representing multimodal projection matrices that facilitate the alignment between visual and language features. These files are crucial for the seamless integration of visual encoders and language models, enabling LLaVA to effectively interpret and generate content that spans both modalities. MMProj files are fine-tuned during the model’s training process to ensure optimal performance in various applications.

What is the Quantized GGUF Version of LLaVA?

๐Ÿ’พ The quantized GGUF (GPT-Generated Unified Format) version of LLaVA represents a compressed and optimized variant of the model, enabling efficient deployment on consumer-grade hardware. Quantization reduces the precision of the model’s weights, significantly decreasing the memory footprint and computational requirements while maintaining a high level of performance. This makes the quantized GGUF version ideal for applications where resource constraints are a concern.

Testing the Model

๐Ÿงช Testing showcases the beauty of the LLaVA model look at the details its providing in the example images.

Example 1

Example 2

Through rigorous testing and validation, LLaVA continues to demonstrate its potential as a versatile and powerful multimodal model.

Reference Links

Following are the list of helpful links:

DescriptionLink
LLaVA Github PageLLaVA (llava-vl.github.io)
Microsoft Research Paper LLaVA: Large Language and Vision Assistant – Microsoft Research
Hugging Face GGUF modelxtuner/llava-llama-3-8b-v1_1-gguf ยท Hugging Face
Visual Instruction Tuning (arxiv)[2304.08485] Visual Instruction Tuning (arxiv.org)

๐ŸŒ LLaVA represents a significant advancement in the field of multimodal AI, combining powerful visual and language understanding capabilities in a single, efficient model. By leveraging the strengths of LLaMA 3 and innovative techniques like quantization and multimodal projection, LLaVA offers a robust tool for a wide range of applications. Whether you’re a researcher, developer, or enthusiast, exploring the potential of LLaVA can open up new possibilities in the realm of AI-driven interaction and understanding.

By following the steps outlined in this post, you can get started with LLaVA and begin harnessing its capabilities for your own projects. Please let me know if Iโ€™ve missed any steps or details, and Iโ€™ll be happy to update the post.

Thanks,
Aresh Sarkari

Exploring Uncensored LLM Model – Dolphin 2.9 on Llama-3-8b

2 May

I’ve been diving deep into the world of Large Language Models (LLMs) like ChatGPT, Gemini, Claude, and LLAMA. But recently, I stumbled upon something that completely blew my mind: uncensored LLMs! ๐Ÿคฏ

As someone who loves pushing the boundaries of AI and exploring new frontiers, I couldn’t resist the temptation to try out an uncensored LLM for myself. And let me tell you, the experience was nothing short of mind-blowing! ๐ŸŽ† After setting up and running an uncensored LLM locally for the first time, I was amazed by the raw, unfiltered outputs it generated. It gave me a whole new perspective on the potential of such LLMs and why having an uncensored variant is so important for certain perspectives and society in general.

In this blog post, I’ll be sharing my journey with uncensored LLMs, diving into the nitty-gritty details of what they are, how they differ from regular LLMs, and why they exist. I’ll also be sharing my hands-on experience with setting up and running an uncensored LLM locally, so you can try it out for yourself! ๐Ÿ’ป

๐Ÿค– Introduction: Uncensored LLM vs Regular LLM

Large Language Models (LLMs) are AI systems trained on vast amounts of text data to understand and generate human-like text based on input prompts. There are two main types of LLMs: regular and uncensored.

Regular LLMs, such as those created by major organizations like OpenAI, Anthropic, Google, etc. are designed with specific safety and ethical guidelines, often reflecting societal norms and legal standards. These models avoid generating harmful or inappropriate content. (Click on each link to read their AI Principles)

Uncensored LLMs, on the other hand, are models that do not have these built-in restrictions. They are designed to generate outputs based on the input without ethical filtering, which can be useful for certain applications but also pose risks.

๐Ÿ“Š Table of Comparison

FeatureRegular LLMUncensored LLM
Content FilteringYes (aligned to avoid harmful content)No (generates responses as is)
Use CasesGeneral purpose, safer for public useSpecialized tasks needing raw output
Cultural AlignmentOften aligned with Western normsNo specific alignment
Risk of Harmful OutputLowerHigher
FlexibilityRestricted by ethical guidelinesHigher flexibility in responses

๐Ÿฌ What is the Dolphin 2.9 Latest Model?

๐ŸฌDolphin 2.9 is a project by Eric Hartford @ Cognitive Computations aimed at creating an open-source, uncensored, and commercially licensed dataset and series of instruct-tuned language models. This initiative is based on Microsoft’s Orca paper and seeks to provide a foundation for building customized models without the typical content restrictions found in conventional LLMs. The model uses a dataset that removes biases, alignment, or any form of censorship, aiming to create a purely instructional tool that can be layered with user-specific alignments.

๐Ÿฌ The Dolphin 2.9 Dataset

Following are the details of the dataset used to train the Dolphin Model: (Note the base model is Llama-3-8b)

Dataset DetailsLinks
cognitivecomputations/dolphin – This dataset is an attempt to replicate the results of Microsoft’s Orca
cognitivecomputations/dolphin ยท Datasets at Hugging Face
HuggingFaceH4/ultrachat_200k – HuggingFaceH4/ultrachat_200k ยท Datasets at Hugging Face
teknium/OpenHermes-2.5 – This is the dataset that made OpenHermes 2.5 and Nous Hermes 2 series of models.teknium/OpenHermes-2.5 ยท Datasets at Hugging Face
microsoft/orca-math-word-problems-200k – This dataset contains ~200K grade school math word problems.microsoft/orca-math-word-problems-200k ยท Datasets at Hugging Face

๐Ÿ’ป How to Run the Model Locally Using LMStudio

To run Dolphin or any similar uncensored model locally, you typically need to follow these steps, assuming you are using a system like LMStudio for managing your AI models:

  • Setup Your Environment:
    • Install LMStudio software its available on (Windows, Mac & Linux). This software allows you to manage and deploy local LLMs without you having to setup Python, Machine Learning, Transformers etc. libraries.
    • Link to Download the Windows Bits – LM Studio – Discover, download, and run local LLMs
    • My laptop config has 11th Gen Intel processor, 64 GB RAM & Nvdia RTX 3080 8 GB VRAM, 3 TB Storage.
  • Download the Model and Dependencies:
    • The best space to keep a track on models is Hugging Face – Models – Hugging Face. You can keep a track of the model releases and updates here.
    • Copy the model name from Hugging Face – cognitivecomputations/dolphin-2.9-llama3-8b
    • Paste this name in LM Studio and it will list out all the quantized models
    • In my case due to the configurations I selected 8Bit model. Please note lower the quantized version less accurate the model is.
    • Download of the model will take time depending upon your internet connection.
  • Prepare the Model for Running:
    • Within LMStudio click on the Chat interface to configure model settings.
    • Select the model from the drop down list – dolphin 2.9 llama3
    • You will be able to run it stock but I like to configure the Advanced Configurations
    • Based on your system set the GPU to 50/50 or max. I have setup for max
    • Click Relod model to apply configuration
  • Run the Model:
    • Use LMStudio to load and run the model.
    • Within the User Prompt enter what you want to ask the Dolphin model
    • Monitor the model’s performance and adjust settings as needed.
  • Testing and Usage:
    • Once the model is running, you can begin to input prompts and receive outputs.
    • Test the model with various inputs to ensure it functions as expected and adjust configurations as needed.
    • Note below was a test fun prompt across ChatGPT, Claude & Dolphin. You can clearly see the winner being Dolphin ๐Ÿค— 
  • Eject and Closing the model:
    • Once you done with the session you can select Eject Model
    • This will release the VRAM/RAM and CPU utlization back to normal

๐Ÿ’ป Quantized & GGUF Model

As home systems usually wont have the necessary GPU to run LLM models natively on consumer grade hardware. A quantized model is a compressed version of a neural network where the weights and activations are represented with lower-precision data types, such as int8 or uint8, instead of the typical float32. This reduces the model’s size and computational requirements while maintaining acceptable performance.

GGUF stands for “GPT-Generated Unified Format“. It refers to a type of large language model that is designed to be versatile and capable of performing a wide range of natural language processing tasks without requiring expensive GPU hardware for inference.

The Dolphin 2.9 GGUF models are:

Model NameQuantizationModel SizeCPUGPUVRAMRAM
dolphin-2.9-llama3-8b-q3_K_M.gguf3-bit (q3)4.02 GBCompatible with most CPUsNot required for inferenceNot required for inference~4.02 GB
dolphin-2.9-llama3-8b-q4_K_M.gguf4-bit (q4)4.92 GBCompatible with most CPUsNot required for inferenceNot required for inference~4.92 GB
dolphin-2.9-llama3-8b-q5_K_M.gguf5-bit (q5)5.73 GBCompatible with most CPUsNot required for inferenceNot required for inference~5.73 GB
dolphin-2.9-llama3-8b-q6_K.gguf6-bit (q6)6.6 GBCompatible with most CPUsNot required for inferenceNot required for inference~6.6 GB
dolphin-2.9-llama3-8b-q8_0.gguf8-bit (q8)8.54 GBCompatible with most CPUsNot required for inferenceNot required for inference~8.54 GB

Reference Links

Following are the list of helpful links:

DescriptionLink
Details and background about the Dolphin ModelDolphin ๐Ÿฌ (erichartford.com)
What are uncensored models?Uncensored Models (erichartford.com)
Various Dolphin Models on various base LLMscognitivecomputations (Cognitive Computations) (huggingface.co)
Dolphin Llama 3 8B GGUF model I used on LMStudio cognitivecomputations/dolphin-2.9-llama3-8b-gguf ยท Hugging Face
LM StudioLM Studio – Discover, download, and run local LLMs
Model Memory Estimator UtilityModel Memory Utility – a Hugging Face Space by hf-accelerate

By following these steps, you can deploy and utilize an uncensored LLM like Dolphin 2.9 for research, development, or any specialized application where conventional content restrictions are not desirable. I hope youโ€™ll find this insightful on your joruney of LLMs. Please let me know if Iโ€™ve missed any steps or details, and Iโ€™ll be happy to update the post.

Thanks,
Aresh Sarkari

Building an Image Captioning API with FastAPI and Hugging Face Transformers packaged with Docker

17 Apr

In this blog post, we’ll embark on an exciting journey of building an Image Captioning API using FastAPI and Hugging Face Transformers. Image captioning is a fascinating task that involves generating textual descriptions for given images. By leveraging the power of deep learning and natural language processing, we can create a system that automatically understands the content of an image and generates human-like captions. The example below, I input a image with a rider on a bike in a garage and the caption provides the exact details of the image.

Project Overview

๐Ÿ‘จโ€๐Ÿ’ป GitHub: https://github.com/askaresh/blip-image-captioning-api

The goal of this project is to develop a RESTful API that accepts an image as input and returns a generated caption describing the image. We’ll be using FastAPI, a modern and fast web framework for building APIs, along with Hugging Face Transformers, a popular library for natural language processing tasks.

The key components of our project include:

  1. FastAPI: A web framework for building efficient and scalable APIs in Python.
  2. Hugging Face Transformers: A library that provides state-of-the-art pre-trained models for various NLP tasks, including image captioning.
  3. Docker: A containerization platform that allows us to package our application and its dependencies into a portable and reproducible environment.

Implementation Details

To build our Image Captioning API, we started by setting up a FastAPI project and defining the necessary endpoints. The main endpoint accepts an image file and an optional text input for conditional image captioning.

We utilized the pre-trained BLIP (Bootstrapping Language-Image Pre-training) model from Hugging Face Transformers for image captioning. BLIP is a powerful model that has been trained on a large dataset of image-caption pairs and achieves impressive results in generating accurate and coherent captions.

To ensure a smooth development experience, and ability for it to run on AnyCloud I containerized our application using Docker. This allowed us to encapsulate all the dependencies, including Python libraries and the pre-trained model, into a portable and reproducible environment.

HF-IT-DOCKER/
โ”‚
โ”œโ”€โ”€ app/
โ”‚ โ”œโ”€โ”€ config.py
โ”‚ โ”œโ”€โ”€ main.py
โ”‚ โ”œโ”€โ”€ model.py
โ”‚ โ””โ”€โ”€ utils.py
โ”‚
โ”œโ”€โ”€ .dockerignore
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ compose.yaml
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ logging.conf
โ”œโ”€โ”€ README.Docker.md
โ””โ”€โ”€ requirements.txt

Detailed description of each file:

  • app/config.py:
    • This file contains the configuration settings for the application.
    • It defines a Settings class using the pydantic_settings library to store and manage application-specific settings.
    • The blip_model_name setting specifies the name of the BLIP model to be used for image captioning.
  • app/main.py:
    • This is the main entry point of the FastAPI application.
    • It sets up the FastAPI app, loads the BLIP model, and configures logging.
    • It defines the API endpoints, including the root path (“/”) and the image captioning endpoint (“/caption”).
    • The “/caption” endpoint accepts an image file and an optional text input, processes the image, generates a caption using the BLIP model, and returns the generated caption.
  • app/model.py:
    • This file contains the functions related to loading and using the BLIP model for image captioning.
    • The load_model function loads the pre-trained BLIP model and processor based on the specified model name.
    • The generate_caption function takes an image and optional text input, preprocesses the inputs, and generates a caption using the loaded BLIP model.
  • app/utils.py:
    • This file contains utility functions used in the project.
    • The load_image_from_file function reads an image file and converts it to the appropriate format (RGB) using the PIL library.
  • .dockerignore:
    • This file specifies the files and directories that should be excluded when building the Docker image.
    • It helps to reduce the size of the Docker image by excluding unnecessary files and directories.
  • .gitignore:
    • This file specifies the files and directories that should be ignored by Git version control.
    • It helps to keep the repository clean by excluding files that are not necessary to track, such as generated files, cache files, and environment-specific files.
  • compose.yaml:
    • This file contains the configuration for Docker Compose, which is used to define and run multi-container Docker applications.
    • It defines the services, including the FastAPI server, and specifies the build context, ports, and any necessary dependencies.
  • Dockerfile:
    • This file contains the instructions for building the Docker image for the FastAPI application.
    • It specifies the base image, sets up the working directory, installs dependencies, copies the application code, and defines the entry point for running the application.
  • logging.conf:
    • This file contains the configuration for the Python logging system.
    • It defines the loggers, handlers, formatters, and their respective settings.
    • It specifies the log levels, log file paths, and log message formats.
  • README.Docker.md:
    • This file provides documentation and instructions specific to running the application using Docker.
    • It may include information on how to build the Docker image, run the container, and any other Docker-related details.
  • requirements.txt:
    • This file lists the Python dependencies required by the application.
    • It includes the necessary libraries and their versions, such as FastAPI, Hugging Face Transformers, PIL, and others.
    • It is used by pip to install the required packages when building the Docker image or setting up the development environment.

Lessons Learned and Debugging

Throughout the development process, I encountered several challenges and learned valuable lessons:

  1. Dependency Management: Managing dependencies can be tricky, especially when working with large pre-trained models. We learned the importance of properly specifying dependencies in our requirements file and using Docker to ensure consistent environments across different systems.
  2. Debugging Permission Issues: We encountered permission-related issues when running our application inside a Docker container. Through debugging, we learned the significance of properly setting file and directory permissions and running the container as a non-root user to enhance security.
  3. Logging Configuration: Proper logging is crucial for understanding the behavior of our application and troubleshooting issues. I learned how to configure logging using a configuration file and ensure that log files are written to directories with appropriate permissions.
  4. Testing and Error Handling: Comprehensive testing and error handling are essential for building a robust API. We implemented thorough error handling to provide meaningful error messages to API users and conducted extensive testing to ensure the reliability of our image captioning functionality.

Validation of the API

After the container is up and running go to http://localhost:8004/docs and select Post method and pick try out. Upload any image of your choice and enter the text (optional) and further click Execute. You will have the caption below as the output.

Conclusion

Building an Image Captioning API with FastAPI and Hugging Face Transformers has been an incredible learning experience. By leveraging the power of pre-trained models and containerization, I created a scalable and efficient solution for generating image captions automatically.

Through this project, I gained valuable insights into dependency management, debugging permission issues, logging configuration, and the importance of testing and error handling. These lessons will undoubtedly be applicable to future projects and contribute to our growth as developers.

I hope that this blog post has provided you with a comprehensive overview of our Image Captioning API project and inspired you to explore the fascinating world of image captioning and natural language processing. Feel free to reach out with any questions or suggestions, and happy captioning!

Thanks,
Aresh Sarkari