LLM | AskAresh

Archive | LLM RSS feed for this section

Offline Transcribing and Summarizing Audio with Whisper, Phi, FastAPI, Docker on NVIDIA GPU

In this blog post, we’ll dive into how I built an offline comprehensive audio transcription and summarization system using OpenAI Whisper (medium) for transcription, a Microsoft Phi 3.5 Large Language Model (LLM) for summarizing, FastAPI for the REST API, and Docker for containerization. Audio content can be dense and long, so having an automated way to extract transcripts and high-level summaries can be a game-changer for meetings, interviews, podcasts, and beyond!

Github – askaresh/LocalAudioTran-LLM-Summar: Offline Audio Transcription (Whisper) and LLM based (Phi-3.5) Summarization

Why Use LLMs for Audio Summaries?

Traditional speech-to-text solutions focus on generating transcripts. However, reading pages of raw transcript text can be time-consuming. Conversational text is quite boring and often makes sense during an audio call by bringing an LLM-based summarizer into the pipeline changes the entire perspective:

High-Level Summaries: Quickly get the core ideas or key actions from a meeting.
Contextual Understanding: LLMs handle nuance like speaker changes, main topics, and action items.
Reduced Human Effort: Saves time sifting through entire transcripts.

High-Level Architecture

Audio Ingestion: The user uploads an audio file (e.g., .mp3, .wav).
Transcription: OpenAI Whisper medium model transcribes the audio into text.
LLM Summarization: A large language model (e.g., Microsoft Phi 3.5) processes the transcript and produces a condensed summary.
RESTful API: Built with FastAPI, handling file uploads and returning structured JSON responses.
Docker: Containerizes the entire application for easy deployment anywhere with a GPU.

Design Decisions

Following is the list of design decisions around this project:

Offline Processing – All processing is conducted locally to maximize efficiency. Utilizing a robust setup with multiple GPUs, specifically the cutting-edge NVIDIA graphics cards (A4000 and RTX 3090), ensures unparalleled performance and reliability.
Audio Transcription – Using OpenAI Whisper (medium) is an obvious choice, as the transcription output is quite accurate, and the model size is efficient for offline running. I tried the large model, but the output did not justify the increased GPU VRAM requirements.
Summarization – This aspect took the most time to refine. I initially experimented with FLAN5 and BERT models, but I found their summarization outputs to be subpar, which made the project feel unworthy. While I believe these models could perform better with extensive training, I prefer an out-of-the-box solution. Therefore, I chose Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct) as my model of choice.
Context Window – I quickly learned that a large content window-based model is best for generating great summaries. I selected Phi 3.5 due to its 128K context window.
LLM Model Quantization – My NVIDIA A4000 has 16GB of VRAM. To effectively use the Phi-3.5 model, I opted for the quantized phi3.5:3.8b-mini-instruct-q4_K_M model, which balances performance and quality output. However, I noted that the KVCache still overflows and utilizes system RAM. I also experimented with Q8 LLaMA models, but I found Q4 to be the best fit.
- Because I am using the quantised model, I ended up using the Ollama container to run the GGUF model, which has the most straightforward implementation.
API/Containers – All the code utilizes FastAPI for GET/POST requests. Of course, for modularity, everything operates within a container.

Implementation Details

FastAPI for the REST Endpoints
- /transcribe: Receives an audio file, calls Whisper for transcription, and returns the text.
- /summarize: Takes the transcribed text, calls the LLM, and returns a summary.
- Health checks (/health) keep the container orchestration informed of readiness.
Whisper
- We used a GPU version for speed (if torch.cuda.is_available()).
- For smaller hardware or faster inference, you can opt for "tiny" or "small" model sizes.
LLM Summarization
- Could be an open-source LLM (like Llama 2, GPT-NeoX, etc.) or something hosted. We are using Microsoft Phi 3.5 (phi3.5:3.8b-mini-instruct-q4_K_M)
- Direct Processing: Transcript processed in a single pass using Phi model. The biggest reason to choose a large context window is to ensure the model can process the entire transcript without truncation, chunking, overlapping sections, etc as the quality gets deteriorated with chunking
- Structured Output: Summary organized into clear sections:
  - Overview
  - Main Points
  - Key Insights
  - Action Items / Decisions
  - Open Questions / Next Steps
  - Conclusions
- System Prompt does all the magic for summarisation. I highly recommend spending time and learning the System Prompt
Docker
- A Dockerfile that installs Python, PyTorch, Whisper, plus your LLM dependencies.
- The container also runs Uvicorn for FastAPI.
- If GPU acceleration is needed, we used an NVIDIA CUDA base image (e.g., nvidia/cuda:12.1.0-runtime-ubuntu22.04) and pass --gpus all to docker run.
Optional: Streamlit UI
- If you want a friendly front-end, spin up a UI to upload audio, track progress, and view results in real-time.
- Alternatively, you could just expose the endpoints in FastAPI and have your favorite front-end call them.

Key Challenges and Lessons

Timeouts for Large Audio
- Whisper or LLM summarization might take a while for hour-long recordings.
- We increased the request timeout or used asynchronous background tasks.
GPU Memory Constraints
- Large LLMs and Whisper can each use significant VRAM.
- Consider smaller quantized (Q2, etc.) LLMs or chunk-based summarization.
Accuracy vs. Speed
- The "medium" or "large" Whisper model is more accurate but slower.
- Summaries can get more coherent using bigger LLMs, but performance can suffer.
Logging & Error Handling
- Detailed logs ensure you catch issues early (e.g., partial transcripts, AI inference errors).
- A robust system logs whether GPU is found, load times, and inference performance metrics.
Security & Data Privacy
- Audio recordings may contain sensitive information.
- Ensure your container or environment has proper access controls.

Validation and Examples

Meeting Transcription + Summary
- A 30-minute internal meeting is processed, producing a transcript of 6,000 words.
- LLM Summaries: A concise bullet-point list of decisions, tasks, and key insights.
Podcast or Interview
- Summarize multi-speaker dialogues to highlight important quotes or topics.
- Possibly split each speaker’s segment, then unify in final text.
Conference Keynote
- Summaries merged into an “executive summary” with top-level takeaways.

Project Structure

LocalAudioTran-LLM-Summar/
├─ .dockerignore
├─ .env
├─ .gitignore
├─ README.md
├─ docker-compose.yml
├─ Dockerfile
├─ backend/
│  ├─ requirements.txt
│  └─ app/
│     ├─ main.py
│     ├─ services/
│     │  ├─ transcription.py
│     │  ├─ summarization.py
│     │  └─ __init__.py
│     ├─ utils/
│     │  └─ logger.py
│     ├─ models/
│     │  ├─ schemas.py
│     │  └─ __init__.py
│     └─ __init__.py
├─ frontend/
│  ├─ requirements.txt
│  └─ src/
│     └─ app.py
└─ logs/

transcription.py loads Whisper, handles file I/O.
summarization.py calls your Phi3.5 LLM (Hugging Face Transformers, Ollama, etc.).
docker-compose.yml to spin up both the app container and optional GPU-based service.

Conclusion

By combining OpenAI Whisper (or any speech-to-text engine) with a Large Language Model (Phi 3.5 mini) summarizer inside a Docker container, we’ve built a unified pipeline for turning raw audio into manageable transcripts and actionable summaries. Whether you’re automating meeting minutes or analyzing podcast content, this approach saves countless hours. Feel free to experiment with chunking, smaller models, or advanced summarization prompts. Let me know how it goes!

Happy transcribing and summarizing!
Aresh Sarkari

Reference Links

Following are the list of helpful Links:

Description	Link
Microsoft Phi 3.5 Model Page	microsoft/Phi-3.5-mini-instruct · Hugging Face
OpenAI Whisper Model	openai/whisper-medium · Hugging Face
Ollama Model Card Details	phi3.5:3.8b-mini-instruct-q4_K_M
NVIDIA Docker Images (Contains Container Engine)	nvidia/cuda – Docker Image \| Docker Hub
IDE Editor of my choice	Cursor – The AI Code Editor

Tags: Docker, FastAPI, LLM, Microsoft, NVIDIA, OpenAI Whisper, Phi 3.5, RTX

Comments Leave a Comment
Categories AI, AI & MachineLearning, LLM

Unlocking the Power of Multimodal AI: A Deep Dive into LLaVA and LLaMA 3 – Demo in LM Studio

23 May

In my earlier post we explored uncensored LLM like Dolphin. Today, we shall look into the intersection of visual and language understanding what happens when a marriage takes place between Vision & LLM. One such innovation is LLaVA (Large Language and Visual Assistant), an open-source generative AI model that combines the strengths of vision encoders and large language models to create a powerful tool for general-purpose visual and language understanding. In this blog post, we’ll delve into the details of LLaVA, its underlying models, and how you can harness its capabilities using LMStudio.

What is LLaVA?

🖼️ LLaVA is a novel, end-to-end trained large multimodal model that integrates a pre-trained CLIP ViT-L/14 visual encoder with the Vicuna large language model. The integration is achieved through a projection matrix, enabling seamless interaction between visual and language data. LLaVA is designed to excel in both daily user-oriented applications and specialized domains such as science, offering a versatile tool for multimodal reasoning and instruction-following tasks.

What is LLaMA 3?

🧠 LLaMA 3 is the third iteration of the Large Language Model from Meta AI, known for its remarkable language understanding and generation capabilities. LLaMA 3 builds upon its predecessors with improved architecture, enhanced training techniques, and a broader dataset, making it one of the most advanced language models available. In the context of LLaVA, LLaMA 3 serves as the foundation for the language model component, providing robust support for complex conversational and reasoning tasks.

How to Run the Model Locally Using LMStudio

💻 Running LLaVA locally using LMStudio is a straightforward process that allows you to leverage the model’s capabilities on your own hardware. Here’s a step-by-step guide to get you started:

Setup Your Environment
- Install LMStudio: The software its available on (Windows, Mac & Linux). This software allows you to manage and deploy local LLMs without you having to setup Python, Machine Learning, Transformers etc. libraries. Link to Download the Windows Bits – LM Studio – Discover, download, and run local LLMs

Download the Model and Dependencies
- The best space to keep a track on models is Hugging Face – Models – Hugging Face. You can keep a track of the model releases and updates here.
- Copy the model name from Hugging Face – xtuner/llava-llama-3-8b-v1_1-gguf
- Paste this name in LM Studio and it will list out all the quantized models
- In my case due to the configurations I selected int4 model. Please note lower the quantized version less accurate the model is.
- Obtain the LLaVA model files, including the quantized GGUF version and MMProj files, from the official repository.
- Download of the model will take time depending upon your internet connection.

Prepare the Model for Running:
- Within LMStudio click on the Chat interface to configure model settings.
- Select the model from the drop down list – llava llama 3 v int4 GGUF
- You will be able to run it stock but I like to configure the Advanced Configurations
- Adjust the model settings to match your hardware capabilities and specific requirements.
- Based on your system set the GPU to 50/50 or max. I have setup for max
- Click Relod model to apply configuration

Run Inference: Start the model and begin running inference tasks, whether for visual chat, science QA, or other applications.

Note – If there is enough interest, I can also do a extended blogpost on Dockerized version of this model. Leave comments down below.

What are MMProj Files?

📂 MMProj files are a key component in the LLaVA ecosystem, representing multimodal projection matrices that facilitate the alignment between visual and language features. These files are crucial for the seamless integration of visual encoders and language models, enabling LLaVA to effectively interpret and generate content that spans both modalities. MMProj files are fine-tuned during the model’s training process to ensure optimal performance in various applications.

What is the Quantized GGUF Version of LLaVA?

💾 The quantized GGUF (GPT-Generated Unified Format) version of LLaVA represents a compressed and optimized variant of the model, enabling efficient deployment on consumer-grade hardware. Quantization reduces the precision of the model’s weights, significantly decreasing the memory footprint and computational requirements while maintaining a high level of performance. This makes the quantized GGUF version ideal for applications where resource constraints are a concern.

Testing the Model

🧪 Testing showcases the beauty of the LLaVA model look at the details its providing in the example images.

Example 1

Example 2

Through rigorous testing and validation, LLaVA continues to demonstrate its potential as a versatile and powerful multimodal model.

Reference Links

Following are the list of helpful links:

Description	Link
LLaVA Github Page	LLaVA (llava-vl.github.io)
Microsoft Research Paper	LLaVA: Large Language and Vision Assistant – Microsoft Research
Hugging Face GGUF model	xtuner/llava-llama-3-8b-v1_1-gguf · Hugging Face
Visual Instruction Tuning (arxiv)	[2304.08485] Visual Instruction Tuning (arxiv.org)

🌐 LLaVA represents a significant advancement in the field of multimodal AI, combining powerful visual and language understanding capabilities in a single, efficient model. By leveraging the strengths of LLaMA 3 and innovative techniques like quantization and multimodal projection, LLaVA offers a robust tool for a wide range of applications. Whether you’re a researcher, developer, or enthusiast, exploring the potential of LLaVA can open up new possibilities in the realm of AI-driven interaction and understanding.

By following the steps outlined in this post, you can get started with LLaVA and begin harnessing its capabilities for your own projects. Please let me know if I’ve missed any steps or details, and I’ll be happy to update the post.

Thanks,
Aresh Sarkari

Tags: AI, LLaMA3, LLaVA, LLM, LM Studio, Microsoft

Comments Leave a Comment
Categories AI & MachineLearning, LLM, Machine Learning

Exploring Uncensored LLM Model – Dolphin 2.9 on Llama-3-8b

2 May

I’ve been diving deep into the world of Large Language Models (LLMs) like ChatGPT, Gemini, Claude, and LLAMA. But recently, I stumbled upon something that completely blew my mind: uncensored LLMs! 🤯

As someone who loves pushing the boundaries of AI and exploring new frontiers, I couldn’t resist the temptation to try out an uncensored LLM for myself. And let me tell you, the experience was nothing short of mind-blowing! 🎆 After setting up and running an uncensored LLM locally for the first time, I was amazed by the raw, unfiltered outputs it generated. It gave me a whole new perspective on the potential of such LLMs and why having an uncensored variant is so important for certain perspectives and society in general.

In this blog post, I’ll be sharing my journey with uncensored LLMs, diving into the nitty-gritty details of what they are, how they differ from regular LLMs, and why they exist. I’ll also be sharing my hands-on experience with setting up and running an uncensored LLM locally, so you can try it out for yourself! 💻

🤖 Introduction: Uncensored LLM vs Regular LLM

Large Language Models (LLMs) are AI systems trained on vast amounts of text data to understand and generate human-like text based on input prompts. There are two main types of LLMs: regular and uncensored.

Regular LLMs, such as those created by major organizations like OpenAI, Anthropic, Google, etc. are designed with specific safety and ethical guidelines, often reflecting societal norms and legal standards. These models avoid generating harmful or inappropriate content. (Click on each link to read their AI Principles)

Uncensored LLMs, on the other hand, are models that do not have these built-in restrictions. They are designed to generate outputs based on the input without ethical filtering, which can be useful for certain applications but also pose risks.

📊 Table of Comparison

Feature	Regular LLM	Uncensored LLM
Content Filtering	Yes (aligned to avoid harmful content)	No (generates responses as is)
Use Cases	General purpose, safer for public use	Specialized tasks needing raw output
Cultural Alignment	Often aligned with Western norms	No specific alignment
Risk of Harmful Output	Lower	Higher
Flexibility	Restricted by ethical guidelines	Higher flexibility in responses

🐬 What is the Dolphin 2.9 Latest Model?

🐬Dolphin 2.9 is a project by Eric Hartford @ Cognitive Computations aimed at creating an open-source, uncensored, and commercially licensed dataset and series of instruct-tuned language models. This initiative is based on Microsoft’s Orca paper and seeks to provide a foundation for building customized models without the typical content restrictions found in conventional LLMs. The model uses a dataset that removes biases, alignment, or any form of censorship, aiming to create a purely instructional tool that can be layered with user-specific alignments.

🐬 The Dolphin 2.9 Dataset

Following are the details of the dataset used to train the Dolphin Model: (Note the base model is Llama-3-8b)

Dataset Details	Links
cognitivecomputations/dolphin – This dataset is an attempt to replicate the results of Microsoft’s Orca	cognitivecomputations/dolphin · Datasets at Hugging Face
HuggingFaceH4/ultrachat_200k –	HuggingFaceH4/ultrachat_200k · Datasets at Hugging Face
teknium/OpenHermes-2.5 – This is the dataset that made OpenHermes 2.5 and Nous Hermes 2 series of models.	teknium/OpenHermes-2.5 · Datasets at Hugging Face
microsoft/orca-math-word-problems-200k – This dataset contains ~200K grade school math word problems.	microsoft/orca-math-word-problems-200k · Datasets at Hugging Face

💻 How to Run the Model Locally Using LMStudio

To run Dolphin or any similar uncensored model locally, you typically need to follow these steps, assuming you are using a system like LMStudio for managing your AI models:

Setup Your Environment:
- Install LMStudio software its available on (Windows, Mac & Linux). This software allows you to manage and deploy local LLMs without you having to setup Python, Machine Learning, Transformers etc. libraries.
- Link to Download the Windows Bits – LM Studio – Discover, download, and run local LLMs
- My laptop config has 11th Gen Intel processor, 64 GB RAM & Nvdia RTX 3080 8 GB VRAM, 3 TB Storage.

Download the Model and Dependencies:
- The best space to keep a track on models is Hugging Face – Models – Hugging Face. You can keep a track of the model releases and updates here.
- Copy the model name from Hugging Face – cognitivecomputations/dolphin-2.9-llama3-8b
- Paste this name in LM Studio and it will list out all the quantized models
- In my case due to the configurations I selected 8Bit model. Please note lower the quantized version less accurate the model is.
- Download of the model will take time depending upon your internet connection.

Prepare the Model for Running:
- Within LMStudio click on the Chat interface to configure model settings.
- Select the model from the drop down list – dolphin 2.9 llama3
- You will be able to run it stock but I like to configure the Advanced Configurations
- Based on your system set the GPU to 50/50 or max. I have setup for max
- Click Relod model to apply configuration

Run the Model:
- Use LMStudio to load and run the model.
- Within the User Prompt enter what you want to ask the Dolphin model
- Monitor the model’s performance and adjust settings as needed.

Testing and Usage:
- Once the model is running, you can begin to input prompts and receive outputs.
- Test the model with various inputs to ensure it functions as expected and adjust configurations as needed.
- Note below was a test fun prompt across ChatGPT, Claude & Dolphin. You can clearly see the winner being Dolphin 🤗

Eject and Closing the model:
- Once you done with the session you can select Eject Model
- This will release the VRAM/RAM and CPU utlization back to normal

💻 Quantized & GGUF Model

As home systems usually wont have the necessary GPU to run LLM models natively on consumer grade hardware. A quantized model is a compressed version of a neural network where the weights and activations are represented with lower-precision data types, such as int8 or uint8, instead of the typical float32. This reduces the model’s size and computational requirements while maintaining acceptable performance.

GGUF stands for “GPT-Generated Unified Format“. It refers to a type of large language model that is designed to be versatile and capable of performing a wide range of natural language processing tasks without requiring expensive GPU hardware for inference.

The Dolphin 2.9 GGUF models are:

Model Name	Quantization	Model Size	CPU	GPU	VRAM	RAM
dolphin-2.9-llama3-8b-q3_K_M.gguf	3-bit (q3)	4.02 GB	Compatible with most CPUs	Not required for inference	Not required for inference	~4.02 GB
dolphin-2.9-llama3-8b-q4_K_M.gguf	4-bit (q4)	4.92 GB	Compatible with most CPUs	Not required for inference	Not required for inference	~4.92 GB
dolphin-2.9-llama3-8b-q5_K_M.gguf	5-bit (q5)	5.73 GB	Compatible with most CPUs	Not required for inference	Not required for inference	~5.73 GB
dolphin-2.9-llama3-8b-q6_K.gguf	6-bit (q6)	6.6 GB	Compatible with most CPUs	Not required for inference	Not required for inference	~6.6 GB
dolphin-2.9-llama3-8b-q8_0.gguf	8-bit (q8)	8.54 GB	Compatible with most CPUs	Not required for inference	Not required for inference	~8.54 GB

Reference Links

Following are the list of helpful links:

Description	Link
Details and background about the Dolphin Model	Dolphin 🐬 (erichartford.com)
What are uncensored models?	Uncensored Models (erichartford.com)
Various Dolphin Models on various base LLMs	cognitivecomputations (Cognitive Computations) (huggingface.co)
Dolphin Llama 3 8B GGUF model I used on LMStudio	cognitivecomputations/dolphin-2.9-llama3-8b-gguf · Hugging Face
LM Studio	LM Studio – Discover, download, and run local LLMs
Model Memory Estimator Utility	Model Memory Utility – a Hugging Face Space by hf-accelerate

By following these steps, you can deploy and utilize an uncensored LLM like Dolphin 2.9 for research, development, or any specialized application where conventional content restrictions are not desirable. I hope you’ll find this insightful on your joruney of LLMs. Please let me know if I’ve missed any steps or details, and I’ll be happy to update the post.

Thanks,
Aresh Sarkari

Tags: AI, Datasets, Dolphin, GGUF, Hugging Face, LLM, LM Studio, Microsoft, Quantized, Windows

Comments 3 Comments
Categories AI, LLM, Machine Learning

	askaresh on PowerGUI, a graphical user int…
	will on PowerGUI, a graphical user int…
	Weekly Newsletter –… on Windows 365 Cloud Apps — Publi…
	Windows 365 Cloud Ap… on Windows 365 Cloud Apps –…
	Weekly Newsletter –… on Windows 365 Cloud Apps –…

Search

AskAresh

Offline Transcribing and Summarizing Audio with Whisper, Phi, FastAPI, Docker on NVIDIA GPU

Why Use LLMs for Audio Summaries?

High-Level Architecture

Design Decisions

Implementation Details

Key Challenges and Lessons

Validation and Examples

Project Structure

Conclusion

Reference Links

Unlocking the Power of Multimodal AI: A Deep Dive into LLaVA and LLaMA 3 – Demo in LM Studio

What is LLaVA?

What is LLaMA 3?

How to Run the Model Locally Using LMStudio

What are MMProj Files?

What is the Quantized GGUF Version of LLaVA?

Testing the Model

Reference Links

Exploring Uncensored LLM Model – Dolphin 2.9 on Llama-3-8b

🤖 Introduction: Uncensored LLM vs Regular LLM

📊 Table of Comparison

🐬 What is the Dolphin 2.9 Latest Model?

🐬 The Dolphin 2.9 Dataset

💻 How to Run the Model Locally Using LMStudio

💻 Quantized & GGUF Model

Reference Links

Recent Posts

Follow me on Twitter

Categories

Archives

Recent Comments

Meta