Ultron AI Character Chatbot
Model Loading - 4-bit quantization initialization on NVIDIA RTX 4060 Ti
About the Project
A fine-tuned Mistral-7B character chatbot based on Ultron (Marvel), featuring a voice interface and memory-efficient inference on consumer GPUs. The project focuses on practical LLM optimization, quantization, and system-level design rather than benchmark chasing.
Highlights
- Mistral-7B fine-tuned with QLoRA to achieve consistent Ultron-style character responses
- 4-bit NF4 quantization enabling inference on 8GB consumer GPUs (~3.5–4GB VRAM usage)
- Low-latency inference with typical responses in the 2–5 second range on an RTX 4060 Ti
- Voice interface with push-to-talk input and character-styled text-to-speech output
- Multi-turn dialogue support with prompt rules and post-processing for personality consistency
- Modular, production-oriented architecture with clear separation of model, runtime, and audio layers
How it works
The system loads a QLoRA-fine-tuned Mistral-7B model using 4-bit quantization. User input (voice or text) is processed by the LLM, post-processed for character consistency, and returned via text-to-speech. A lightweight caching layer reduces repeated computation for identical prompts. Multi-turn context is maintained to preserve conversational coherence.
What I learned
LLM fine-tuning on consumer hardware: Using QLoRA to adapt a 7B model without full-precision fine-tuning.
Memory-efficient inference: Practical use of 4-bit NF4 quantization, mixed precision, and GPU-only execution.
Character consistency techniques: Prompt engineering, few-shot examples, and response post-processing.
Caching trade-offs: LRU caching improves responsiveness for repeated queries, though performance gains are workload-dependent.
Production ML considerations: Error handling, fallback paths, logging, and graceful degradation.
Voice AI integration: Speech recognition, TTS, and audio effects using FFmpeg and pyttsx3.
Current features
- Local inference with quantized Mistral-7B
- Voice input and character-specific TTS output
- Simple Retrieval-Augmented Generation (RAG) for external knowledge
- Config-driven runtime and generation settings
Future work
- Web-based UI (planned for a later version)
- REST API (FastAPI) for programmatic access
- Expanded RAG pipeline with better document indexing
- Multi-modal extensions (vision input)
Tech stack
Python • Mistral-7B • QLoRA • PyTorch • Transformers • pyttsx3 • Speech Recognition
Conversation Example - Character personality demonstration
Multi-turn Dialogue - Contextual conversation flow
Response Quality - Fine-tuned personality responses
Character Consistency - Authentic Ultron behavior
Advanced Interaction - Complex dialogue handling
Project Demonstration
This is a terminal-based application demonstrating AI/ML engineering capabilities.
Screenshots showcase:
✅ Model initialization with 4-bit quantization
✅ Character personality and dialogue consistency
✅ Multi-turn conversation with context awareness
✅ Terminal-based interface optimized for performance
✅ Fine-tuned Mistral-7B responses as Ultron character
Video demonstration with voice interaction coming soon
Back to Projects