Archive | AI RSS feed for this section

Offline Transcribing and Summarizing Audio with Whisper, Phi, FastAPI, Docker on NVIDIA GPU

9 Jan

In this blog post, weโ€™ll dive into how I built an offline comprehensive audio transcription and summarization system using OpenAI Whisper (medium) for transcription, a Microsoft Phi 3.5 Large Language Model (LLM) for summarizing, FastAPI for the REST API, and Docker for containerization. Audio content can be dense and long, so having an automated way to extract transcripts and high-level summaries can be a game-changer for meetings, interviews, podcasts, and beyond!

Github – askaresh/LocalAudioTran-LLM-Summar: Offline Audio Transcription (Whisper) and LLM based (Phi-3.5) Summarization

Why Use LLMs for Audio Summaries?

Traditional speech-to-text solutions focus on generating transcripts. However, reading pages of raw transcript text can be time-consuming. Conversational text is quite boring and often makes sense during an audio call by bringing an LLM-based summarizer into the pipeline changes the entire perspective:

  • High-Level Summaries: Quickly get the core ideas or key actions from a meeting.
  • Contextual Understanding: LLMs handle nuance like speaker changes, main topics, and action items.
  • Reduced Human Effort: Saves time sifting through entire transcripts.

High-Level Architecture

  • Audio Ingestion: The user uploads an audio file (e.g., .mp3, .wav).
  • Transcription: OpenAI Whisper medium model transcribes the audio into text.
  • LLM Summarization: A large language model (e.g., Microsoft Phi 3.5) processes the transcript and produces a condensed summary.
  • RESTful API: Built with FastAPI, handling file uploads and returning structured JSON responses.
  • Docker: Containerizes the entire application for easy deployment anywhere with a GPU.

Design Decisions

Following is the list of design decisions around this project:

  • Offline Processing – All processing is conducted locally to maximize efficiency. Utilizing a robust setup with multiple GPUs, specifically the cutting-edge NVIDIA graphics cards (A4000 and RTX 3090), ensures unparalleled performance and reliability.
  • Audio Transcription – Using OpenAI Whisper (medium) is an obvious choice, as the transcription output is quite accurate, and the model size is efficient for offline running. I tried the large model, but the output did not justify the increased GPU VRAM requirements.
  • Summarization – This aspect took the most time to refine. I initially experimented with FLAN5 and BERT models, but I found their summarization outputs to be subpar, which made the project feel unworthy. While I believe these models could perform better with extensive training, I prefer an out-of-the-box solution. Therefore, I chose Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct) as my model of choice.
  • Context Window – I quickly learned that a large content window-based model is best for generating great summaries. I selected Phi 3.5 due to its 128K context window.
  • LLM Model Quantization – My NVIDIA A4000 has 16GB of VRAM. To effectively use the Phi-3.5 model, I opted for the quantized phi3.5:3.8b-mini-instruct-q4_K_M model, which balances performance and quality output. However, I noted that the KVCache still overflows and utilizes system RAM. I also experimented with Q8 LLaMA models, but I found Q4 to be the best fit.
    • Because I am using the quantised model, I ended up using the Ollama container to run the GGUF model, which has the most straightforward implementation.
  • API/Containers – All the code utilizes FastAPI for GET/POST requests. Of course, for modularity, everything operates within a container.

Implementation Details

  1. FastAPI for the REST Endpoints
    • /transcribe: Receives an audio file, calls Whisper for transcription, and returns the text.
    • /summarize: Takes the transcribed text, calls the LLM, and returns a summary.
    • Health checks (/health) keep the container orchestration informed of readiness.
  2. Whisper
    • We used a GPU version for speed (if torch.cuda.is_available()).
    • For smaller hardware or faster inference, you can opt for "tiny" or "small" model sizes.
  3. LLM Summarization
    • Could be an open-source LLM (like Llama 2, GPT-NeoX, etc.) or something hosted. We are using Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct-q4_K_M)
    • Direct Processing: Transcript processed in a single pass using Phi model. The biggest reason to choose a large context window is to ensure the model can process the entire transcript without truncation, chunking, overlapping sections, etc as the quality gets deteriorated with chunking
    • Structured Output: Summary organized into clear sections:
      •    Overview
      •    Main Points
      •    Key Insights
      •    Action Items / Decisions
      •    Open Questions / Next Steps
      •    Conclusions
    • System Prompt does all the magic for summarisation. I highly recommend spending time and learning the System Prompt
  4. Docker
    • A Dockerfile that installs Python, PyTorch, Whisper, plus your LLM dependencies.
    • The container also runs Uvicorn for FastAPI.
    • If GPU acceleration is needed, we used an NVIDIA CUDA base image (e.g., nvidia/cuda:12.1.0-runtime-ubuntu22.04) and pass --gpus all to docker run.
  5. Optional: Streamlit UI
    • If you want a friendly front-end, spin up a UI to upload audio, track progress, and view results in real-time.
    • Alternatively, you could just expose the endpoints in FastAPI and have your favorite front-end call them.

Key Challenges and Lessons

  1. Timeouts for Large Audio
    • Whisper or LLM summarization might take a while for hour-long recordings.
    • We increased the request timeout or used asynchronous background tasks.
  2. GPU Memory Constraints
    • Large LLMs and Whisper can each use significant VRAM.
    • Consider smaller quantized (Q2, etc.) LLMs or chunk-based summarization.
  3. Accuracy vs. Speed
    • The "medium" or "large" Whisper model is more accurate but slower.
    • Summaries can get more coherent using bigger LLMs, but performance can suffer.
  4. Logging & Error Handling
    • Detailed logs ensure you catch issues early (e.g., partial transcripts, AI inference errors).
    • A robust system logs whether GPU is found, load times, and inference performance metrics.
  5. Security & Data Privacy
    • Audio recordings may contain sensitive information.
    • Ensure your container or environment has proper access controls.

Validation and Examples

  1. Meeting Transcription + Summary
    • A 30-minute internal meeting is processed, producing a transcript of 6,000 words.
    • LLM Summaries: A concise bullet-point list of decisions, tasks, and key insights.
  2. Podcast or Interview
    • Summarize multi-speaker dialogues to highlight important quotes or topics.
    • Possibly split each speakerโ€™s segment, then unify in final text.
  3. Conference Keynote
    • Summaries merged into an โ€œexecutive summaryโ€ with top-level takeaways.

Project Structure

LocalAudioTran-LLM-Summar/
โ”œโ”€ .dockerignore
โ”œโ”€ .env
โ”œโ”€ .gitignore
โ”œโ”€ README.md
โ”œโ”€ docker-compose.yml
โ”œโ”€ Dockerfile
โ”œโ”€ backend/
โ”‚  โ”œโ”€ requirements.txt
โ”‚  โ””โ”€ app/
โ”‚     โ”œโ”€ main.py
โ”‚     โ”œโ”€ services/
โ”‚     โ”‚  โ”œโ”€ transcription.py
โ”‚     โ”‚  โ”œโ”€ summarization.py
โ”‚     โ”‚  โ””โ”€ __init__.py
โ”‚     โ”œโ”€ utils/
โ”‚     โ”‚  โ””โ”€ logger.py
โ”‚     โ”œโ”€ models/
โ”‚     โ”‚  โ”œโ”€ schemas.py
โ”‚     โ”‚  โ””โ”€ __init__.py
โ”‚     โ””โ”€ __init__.py
โ”œโ”€ frontend/
โ”‚  โ”œโ”€ requirements.txt
โ”‚  โ””โ”€ src/
โ”‚     โ””โ”€ app.py
โ””โ”€ logs/
  • transcription.py loads Whisper, handles file I/O.
  • summarization.py calls your Phi3.5 LLM (Hugging Face Transformers, Ollama, etc.).
  • docker-compose.yml to spin up both the app container and optional GPU-based service.

Conclusion

By combining OpenAI Whisper (or any speech-to-text engine) with a Large Language Model (Phi 3.5 mini) summarizer inside a Docker container, weโ€™ve built a unified pipeline for turning raw audio into manageable transcripts and actionable summaries. Whether youโ€™re automating meeting minutes or analyzing podcast content, this approach saves countless hours. Feel free to experiment with chunking, smaller models, or advanced summarization prompts. Let me know how it goes!

Happy transcribing and summarizing!
Aresh Sarkari

Following are the list of helpful Links:

DescriptionLink
Microsoft Phi 3.5 Model Pagemicrosoft/Phi-3.5-mini-instruct ยท Hugging Face
OpenAI Whisper Modelopenai/whisper-medium ยท Hugging Face
Ollama Model Card Detailsphi3.5:3.8b-mini-instruct-q4_K_M
NVIDIA Docker Images (Contains Container Engine)nvidia/cuda – Docker Image | Docker Hub
IDE Editor of my choiceCursor – The AI Code Editor

Building a Comprehensive Image Analysis API with Microsoft Florence-2-large, Chainlit and Docker

8 Jul

In this blog post, we’ll embark on an exciting journey of building a comprehensive Image Analysis API using Microsoft Florence-2-large, Chainlit, and Docker. Image analysis is a fascinating field that involves extracting meaningful information from images using advanced AI techniques. By leveraging the power of Microsoft’s Florence-2-large model, we can create a system that automatically understands the content of an image and performs various analysis tasks such as captioning, object detection, expression segmentation, OCR etc..

My Florence2 Code Repository askaresh/MS-Florence2 (github.com)

Note – In the past have written a blog article on Image Captioning you can read more here – Building an Image Captioning API with FastAPI and Hugging Face Transformers packaged with Docker | AskAresh

Model Overview

Hugging Face Link – microsoft/Florence-2-large ยท Hugging Face

The Microsoft Florence-2-large model is a powerful pre-trained model designed for various image analysis tasks. Developed by Microsoft, this model is part of the Florence family, which is known for its robust performance in computer vision applications. The Florence-2-large model leverages extensive training on a vast dataset of images, enabling it to excel in tasks such as image captioning, object detection, and optical character recognition (OCR).

Key Features of Florence-2-large

  • Multitask Capabilities: The model can perform a wide range of image analysis tasks, including generating captions, detecting objects, segmenting regions, and recognizing text within images.
  • High Accuracy: Trained on diverse and extensive datasets, the Florence-2-large model achieves high accuracy in understanding and analyzing image content.
  • Scalability: Its architecture is designed to scale effectively, making it suitable for integration into various applications and systems.

Why Florence-2-large?

We chose the Florence-2-large model for our Image Analysis API due to its versatility and performance. The model’s ability to handle multiple tasks with high precision makes it an ideal choice for building a comprehensive image analysis system. By leveraging this model, we can ensure that our API delivers accurate and reliable results across different types of image analysis tasks.

Implementation Details

To build our Image Analysis API, we started by setting up a Chainlit project and defining the necessary message handlers. The main handler accepts an image file and processes it through various analysis tasks.

We utilized the pre-trained Florence-2-large model from Hugging Face Transformers for image analysis. This powerful model has been trained on a vast dataset of images and can perform multiple tasks such as image captioning, object detection, and OCR.

To ensure a smooth development experience and ability to run on any cloud, we containerized our application using Docker. This allowed us to encapsulate all the dependencies, including Python libraries and the pre-trained model, into a portable and reproducible environment.

Choosing NVIDIA Docker Image

We specifically chose the NVIDIA CUDA-based Docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04) for our containerization. This choice was driven by the need to leverage GPU acceleration for the Florence-2-large model, which significantly enhances the performance of image processing tasks. The CUDA-based image ensures compatibility with GPU drivers and provides pre-installed libraries necessary for efficient model execution.

Our project structure looks like this:

MS-FLORENCE2/
โ”‚
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ config.py
โ”‚   โ”œโ”€โ”€ model.py
โ”‚   โ””โ”€โ”€ utils.py
โ”‚
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ docker-compose.yml
โ”œโ”€โ”€ .env
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ chainlit_app.py
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ logging_config.py

Let’s break down the key components:

  1. chainlit_app.py: This is the heart of our Chainlit application. It defines the message handler that processes uploaded images and generates responses using the Florence model.
  2. app/model.py: This file contains the ModelManager class, which is responsible for loading and managing the Florence-2-large model.
  3. app/utils.py: This directory contains utility functions for image drwaing plot boxes, polygons and OCR boxes.
  4. logging_config.py: This file contains the detailed logging of this entire project and its various files
  5. Dockerfile: This file defines how our application is containerized, ensuring all dependencies are properly installed and the environment is consistent. The use of the NVIDIA CUDA-based Docker image ensures compatibility and performance optimization.

Task Prompts and Their Functions

Let’s break down the task prompts used in the Florence-2-large model and explain what each of them does:

  • <CAPTION>
    • Purpose: Generates a simple, concise caption for the image.
    • Output: A brief description of the main elements in the image.
    • Example: “A credit card bill with a price tag on it”
  • <DETAILED_CAPTION>
    • Purpose: Provides a more detailed description of the image.
    • Output: A comprehensive description including more elements and details from the image.
    • Example: “The image shows a credit card bill with a black background. The bill is printed on a white sheet of paper with a blue border and a blue header. The header reads ‘Credit Card Bill’ in bold black font. The bottom of the bill has a space for the customer’s name, address, and contact information.”
  • <OD> Object Detection
    • Purpose: Detects and locates objects within the image.
    • Output: A list of detected objects with their bounding box coordinates and labels.
    • Example: [{‘bboxes’: [[x1, y1, x2, y2], โ€ฆ], ‘labels’: [‘credit card’, ‘price tag’, โ€ฆ]}]
  • <OCR>
    • Purpose: Performs Optical Character Recognition on the image.
    • Output: Extracted text from the image.
    • Example: “Credit Card Bill\nName: John Doe\nAddress: 123 Main Stโ€ฆ”
  • <CAPTION_TO_PHRASE_GROUNDING>
    • Purpose: Locates specific phrases or objects mentioned in the caption within the image.
    • Input: Requires a caption (usually the output from ”) as additional text input.
    • Output: Bounding boxes and labels for phrases/objects from the caption found in the image.
    • Example: [{‘bboxes’: [[x1, y1, x2, y2], โ€ฆ], ‘labels’: [‘credit card’, ‘price tag’, โ€ฆ]}]
  • <DENSE_REGION_CAPTION>
    • Purpose: Generates captions for specific regions within the image.
    • Output: A list of regions with their bounding boxes and corresponding captions.
    • Example: [{‘bboxes’: [[x1, y1, x2, y2], โ€ฆ], ‘labels’: [‘Header with Credit Card Bill text’, ‘Customer information section’, โ€ฆ]}]
  • <REGION_PROPOSAL>
    • Purpose: Suggests regions of interest within the image without labeling them.
    • Output: A list of bounding boxes for potentially important regions in the image.
    • Example: {‘bboxes’: [[x1, y1, x2, y2], โ€ฆ], ‘labels’: [”, ”, โ€ฆ]}
  • <MORE_DETAILED_CAPTION>
    • Purpose: Generates an even more comprehensive description of the image than ”.
    • Output: A very detailed narrative of the image, often including subtle details and potential interpretations.
    • Example: “The image displays a credit card bill document against a stark black background. The bill itself is printed on crisp white paper, framed by a professional-looking blue border. At the top, a bold blue header prominently declares ‘Credit Card Bill’ in a large, easy-to-read font. Below this, the document is structured into clear sections, likely detailing transactions, fees, and payment information. At the bottom of the bill, there’s a designated area for customer details, including name, address, and possibly account information. The contrast between the white document and black background gives the image a formal, official appearance, emphasizing the importance of the financial information presented.”
  • <REFERRING_EXPRESSION_SEGMENTATION>
    • Purpose: Segments the image based on a textual description of a specific object or region.
    • Input: Requires a textual description as additional input.
    • Output: A segmentation mask for the described object or region.
  • <REGION_TO_SEGMENTATION>
    • Purpose: Generates a segmentation mask for a specified region in the image.
    • Input: Requires coordinates of the region of interest.
    • Output: A segmentation mask for the specified region.
  • <OPEN_VOCABULARY_DETECTION>
    • Purpose: Detects objects in the image based on user-specified categories.
    • Input: Can accept a list of categories to look for.
    • Output: Bounding boxes and labels for detected objects matching the specified categories.
  • <REGION_TO_CATEGORY>
    • Purpose: Classifies a specific region of the image into a category.
    • Input: Requires coordinates of the region of interest.
    • Output: A category label for the specified region.
  • <REGION_TO_DESCRIPTION>
    • Purpose: Generates a detailed description of a specific region in the image.
    • Input: Requires coordinates of the region of interest.
    • Output: A textual description of the contents of the specified region.
  • ‘<OCR_WITH_REGION>’
    • Purpose: Performs OCR on specific regions of the image.
    • Output: Extracted text along with the corresponding regions (bounding boxes) where the text was found.

These task prompts allow us to leverage the Florence-2-large model’s capabilities for various image analysis tasks. By combining these prompts, we can create a comprehensive analysis of an image, from basic captioning to detailed object detection and text recognition. Understanding and effectively utilizing these task prompts was crucial in maximizing the potential of the Florence-2-large model in our project.

Lessons Learned and Debugging

Throughout the development of our Florence Image Analysis project, I encountered several challenges and learned valuable lessons:

  • Flash Attention Challenges: One of the most significant hurdles we faced was integrating flash-attn into our project. Initially, we encountered installation issues and compatibility problems with our CUDA setup. We learned that:
    • Flash-attn requires specific CUDA versions and can be sensitive to the exact configuration of the environment.
    • Note we moved to NVDIA based docker image to take care of all the pre-requsites specific to CUDA/Flashattention and interoperatbility of versions, that helped tremendously
    • Building flash-attn from source was often necessary to ensure compatibility with our specific setup. environment.
    • Using the --no-build-isolation flag during installation helped resolve some dependency conflicts. Solution: We ended up creating a custom build process in our Dockerfile, ensuring all dependencies were correctly installed before attempting to install flash-attn.
  • Segmentation and OCR with Region Iterations: Implementing effective OCR, especially with region detection, proved to be an iterative process:
    • Initially, we tried using the Florence model for general OCR, but found it lacking in accuracy for structured documents.
    • We experimented with pre-processing steps to detect distinct regions in documents (headers, body, footer) before applying OCR.
    • Balancing between processing speed and accuracy was a constant challenge. Solution: We implemented a custom region detection algorithm that identifies potential text blocks before applying OCR. This improved both accuracy and processing speed.
  • Error Handling and Logging: As the project grew more complex, we realized the importance of robust error handling and comprehensive logging:
    • Initially, errors in model processing would crash the entire application.
    • Debugging was challenging without detailed logs. Solution: We implemented try-except blocks throughout the code, added detailed logging, and created a system to gracefully handle and report errors to users.
  • Optimizing for Different Document Types: We found that the performance of our system varied significantly depending on the type of document being processed:
    • Handwritten documents required different preprocessing than printed text.
    • Certain document layouts (e.g., tables, multi-column text) posed unique challenges. Solution: We implemented a document type detection step and adjusted our processing pipeline based on the detected type.
  • Balancing Between Flexibility and Specialization: While I aimed to create a general-purpose image analysis tool, we found that specializing for certain tasks greatly improved performance:
    • We created separate processing paths for tasks like receipt OCR, business card analysis, and general document processing. Solution: We implemented a modular architecture that allows for easy addition of specialized processing pipelines while maintaining a common core.

These lessons significantly improved the robustness and effectiveness of our Florence Image Analysis project.

Validation of the API with real Examples

After the container is up and running, users can access the Chainlit interface at http://localhost:8010. Here’s an example of how to use the API:

Example – <Caption>

Example – <MORE_DETAILED_CAPTION>

Example – <OCR>

Example – <OCR_WITH_REGION>

Model GPU – VRAM Consumption

Following are the list of helpful links:

DescriptionLink
Florence-2: Advancing a Unified Representation for a Variety of Vision TasksFlorence-2: Advancing a Unified Representation for a Variety of Vision Tasks – Microsoft Research
Sample Google Collab Notebook Florence2sample_inference.ipynb ยท microsoft/Florence-2-large at main (huggingface.co)
Research Paper link direct[2311.06242] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (arxiv.org)
Florence 2 Inference ChainlitFlorence 2 Inference Chainlit – Mervin Praison

Conclusion

Building this comprehensive Image Analysis API with Florence-2-large, Chainlit, and Docker has been an incredible learning experience. I must have spend atleast a week to get to this point working all the features and functionality within a Docker image. By leveraging the power of advanced AI models and containerization, we created a scalable and efficient solution for performing various image analysis tasks automatically. Through this project, we gained valuable insights into model management, error handling, GPU utilization in containerized environments, and designing interactive UIs for AI applications.

I hope that this blog post has provided you with a comprehensive overview of our Image Analysis API project and inspired you to explore the fascinating world of computer vision. Feel free to check out our GitHub repository, try out the API, and let me know if you have any questions or suggestions!.

Thanks,
Aresh Sarkari

Exploring Uncensored LLM Model – Dolphin 2.9 on Llama-3-8b

2 May

I’ve been diving deep into the world of Large Language Models (LLMs) like ChatGPT, Gemini, Claude, and LLAMA. But recently, I stumbled upon something that completely blew my mind: uncensored LLMs! ๐Ÿคฏ

As someone who loves pushing the boundaries of AI and exploring new frontiers, I couldn’t resist the temptation to try out an uncensored LLM for myself. And let me tell you, the experience was nothing short of mind-blowing! ๐ŸŽ† After setting up and running an uncensored LLM locally for the first time, I was amazed by the raw, unfiltered outputs it generated. It gave me a whole new perspective on the potential of such LLMs and why having an uncensored variant is so important for certain perspectives and society in general.

In this blog post, I’ll be sharing my journey with uncensored LLMs, diving into the nitty-gritty details of what they are, how they differ from regular LLMs, and why they exist. I’ll also be sharing my hands-on experience with setting up and running an uncensored LLM locally, so you can try it out for yourself! ๐Ÿ’ป

๐Ÿค– Introduction: Uncensored LLM vs Regular LLM

Large Language Models (LLMs) are AI systems trained on vast amounts of text data to understand and generate human-like text based on input prompts. There are two main types of LLMs: regular and uncensored.

Regular LLMs, such as those created by major organizations like OpenAI, Anthropic, Google, etc. are designed with specific safety and ethical guidelines, often reflecting societal norms and legal standards. These models avoid generating harmful or inappropriate content. (Click on each link to read their AI Principles)

Uncensored LLMs, on the other hand, are models that do not have these built-in restrictions. They are designed to generate outputs based on the input without ethical filtering, which can be useful for certain applications but also pose risks.

๐Ÿ“Š Table of Comparison

FeatureRegular LLMUncensored LLM
Content FilteringYes (aligned to avoid harmful content)No (generates responses as is)
Use CasesGeneral purpose, safer for public useSpecialized tasks needing raw output
Cultural AlignmentOften aligned with Western normsNo specific alignment
Risk of Harmful OutputLowerHigher
FlexibilityRestricted by ethical guidelinesHigher flexibility in responses

๐Ÿฌ What is the Dolphin 2.9 Latest Model?

๐ŸฌDolphin 2.9 is a project by Eric Hartford @ Cognitive Computations aimed at creating an open-source, uncensored, and commercially licensed dataset and series of instruct-tuned language models. This initiative is based on Microsoft’s Orca paper and seeks to provide a foundation for building customized models without the typical content restrictions found in conventional LLMs. The model uses a dataset that removes biases, alignment, or any form of censorship, aiming to create a purely instructional tool that can be layered with user-specific alignments.

๐Ÿฌ The Dolphin 2.9 Dataset

Following are the details of the dataset used to train the Dolphin Model: (Note the base model is Llama-3-8b)

Dataset DetailsLinks
cognitivecomputations/dolphin – This dataset is an attempt to replicate the results of Microsoft’s Orca
cognitivecomputations/dolphin ยท Datasets at Hugging Face
HuggingFaceH4/ultrachat_200k – HuggingFaceH4/ultrachat_200k ยท Datasets at Hugging Face
teknium/OpenHermes-2.5 – This is the dataset that made OpenHermes 2.5 and Nous Hermes 2 series of models.teknium/OpenHermes-2.5 ยท Datasets at Hugging Face
microsoft/orca-math-word-problems-200k – This dataset contains ~200K grade school math word problems.microsoft/orca-math-word-problems-200k ยท Datasets at Hugging Face

๐Ÿ’ป How to Run the Model Locally Using LMStudio

To run Dolphin or any similar uncensored model locally, you typically need to follow these steps, assuming you are using a system like LMStudio for managing your AI models:

  • Setup Your Environment:
    • Install LMStudio software its available on (Windows, Mac & Linux). This software allows you to manage and deploy local LLMs without you having to setup Python, Machine Learning, Transformers etc. libraries.
    • Link to Download the Windows Bits – LM Studio – Discover, download, and run local LLMs
    • My laptop config has 11th Gen Intel processor, 64 GB RAM & Nvdia RTX 3080 8 GB VRAM, 3 TB Storage.
  • Download the Model and Dependencies:
    • The best space to keep a track on models is Hugging Face – Models – Hugging Face. You can keep a track of the model releases and updates here.
    • Copy the model name from Hugging Face – cognitivecomputations/dolphin-2.9-llama3-8b
    • Paste this name in LM Studio and it will list out all the quantized models
    • In my case due to the configurations I selected 8Bit model. Please note lower the quantized version less accurate the model is.
    • Download of the model will take time depending upon your internet connection.
  • Prepare the Model for Running:
    • Within LMStudio click on the Chat interface to configure model settings.
    • Select the model from the drop down list – dolphin 2.9 llama3
    • You will be able to run it stock but I like to configure the Advanced Configurations
    • Based on your system set the GPU to 50/50 or max. I have setup for max
    • Click Relod model to apply configuration
  • Run the Model:
    • Use LMStudio to load and run the model.
    • Within the User Prompt enter what you want to ask the Dolphin model
    • Monitor the model’s performance and adjust settings as needed.
  • Testing and Usage:
    • Once the model is running, you can begin to input prompts and receive outputs.
    • Test the model with various inputs to ensure it functions as expected and adjust configurations as needed.
    • Note below was a test fun prompt across ChatGPT, Claude & Dolphin. You can clearly see the winner being Dolphin ๐Ÿค— 
  • Eject and Closing the model:
    • Once you done with the session you can select Eject Model
    • This will release the VRAM/RAM and CPU utlization back to normal

๐Ÿ’ป Quantized & GGUF Model

As home systems usually wont have the necessary GPU to run LLM models natively on consumer grade hardware. A quantized model is a compressed version of a neural network where the weights and activations are represented with lower-precision data types, such as int8 or uint8, instead of the typical float32. This reduces the model’s size and computational requirements while maintaining acceptable performance.

GGUF stands for “GPT-Generated Unified Format“. It refers to a type of large language model that is designed to be versatile and capable of performing a wide range of natural language processing tasks without requiring expensive GPU hardware for inference.

The Dolphin 2.9 GGUF models are:

Model NameQuantizationModel SizeCPUGPUVRAMRAM
dolphin-2.9-llama3-8b-q3_K_M.gguf3-bit (q3)4.02 GBCompatible with most CPUsNot required for inferenceNot required for inference~4.02 GB
dolphin-2.9-llama3-8b-q4_K_M.gguf4-bit (q4)4.92 GBCompatible with most CPUsNot required for inferenceNot required for inference~4.92 GB
dolphin-2.9-llama3-8b-q5_K_M.gguf5-bit (q5)5.73 GBCompatible with most CPUsNot required for inferenceNot required for inference~5.73 GB
dolphin-2.9-llama3-8b-q6_K.gguf6-bit (q6)6.6 GBCompatible with most CPUsNot required for inferenceNot required for inference~6.6 GB
dolphin-2.9-llama3-8b-q8_0.gguf8-bit (q8)8.54 GBCompatible with most CPUsNot required for inferenceNot required for inference~8.54 GB

Reference Links

Following are the list of helpful links:

DescriptionLink
Details and background about the Dolphin ModelDolphin ๐Ÿฌ (erichartford.com)
What are uncensored models?Uncensored Models (erichartford.com)
Various Dolphin Models on various base LLMscognitivecomputations (Cognitive Computations) (huggingface.co)
Dolphin Llama 3 8B GGUF model I used on LMStudio cognitivecomputations/dolphin-2.9-llama3-8b-gguf ยท Hugging Face
LM StudioLM Studio – Discover, download, and run local LLMs
Model Memory Estimator UtilityModel Memory Utility – a Hugging Face Space by hf-accelerate

By following these steps, you can deploy and utilize an uncensored LLM like Dolphin 2.9 for research, development, or any specialized application where conventional content restrictions are not desirable. I hope youโ€™ll find this insightful on your joruney of LLMs. Please let me know if Iโ€™ve missed any steps or details, and Iโ€™ll be happy to update the post.

Thanks,
Aresh Sarkari