Ultron AI Chatbot - Kacper Kowalski

Ultron AI model loading and initialization

Model Loading - 4-bit quantization initialization on NVIDIA RTX 4060 Ti

About the Project

A fine-tuned Mistral-7B character chatbot based on Ultron (Marvel), featuring a voice interface and memory-efficient inference on consumer GPUs. The project focuses on practical LLM optimization, quantization, and system-level design rather than benchmark chasing.

Highlights

Mistral-7B fine-tuned with QLoRA to achieve consistent Ultron-style character responses
4-bit NF4 quantization enabling inference on 8GB consumer GPUs (~3.5–4GB VRAM usage)
Low-latency inference with typical responses in the 2–5 second range on an RTX 4060 Ti
Voice interface with push-to-talk input and character-styled text-to-speech output
Multi-turn dialogue support with prompt rules and post-processing for personality consistency
Modular, production-oriented architecture with clear separation of model, runtime, and audio layers

How it works

The system loads a QLoRA-fine-tuned Mistral-7B model using 4-bit quantization. User input (voice or text) is processed by the LLM, post-processed for character consistency, and returned via text-to-speech. A lightweight caching layer reduces repeated computation for identical prompts. Multi-turn context is maintained to preserve conversational coherence.

What I learned

LLM fine-tuning on consumer hardware: Using QLoRA to adapt a 7B model without full-precision fine-tuning.

Memory-efficient inference: Practical use of 4-bit NF4 quantization, mixed precision, and GPU-only execution.

Character consistency techniques: Prompt engineering, few-shot examples, and response post-processing.

Caching trade-offs: LRU caching improves responsiveness for repeated queries, though performance gains are workload-dependent.

Production ML considerations: Error handling, fallback paths, logging, and graceful degradation.

Voice AI integration: Speech recognition, TTS, and audio effects using FFmpeg and pyttsx3.

Current features

Local inference with quantized Mistral-7B
Voice input and character-specific TTS output
Simple Retrieval-Augmented Generation (RAG) for external knowledge
Config-driven runtime and generation settings

Future work

Web-based UI (planned for a later version)
REST API (FastAPI) for programmatic access
Expanded RAG pipeline with better document indexing
Multi-modal extensions (vision input)

Tech stack

Python • Mistral-7B • QLoRA • PyTorch • Transformers • pyttsx3 • Speech Recognition

Conversation Example - Character personality demonstration

Multi-turn Dialogue - Contextual conversation flow

Response Quality - Fine-tuned personality responses

Character Consistency - Authentic Ultron behavior

Advanced Interaction - Complex dialogue handling

Project Demonstration

This is a terminal-based application demonstrating AI/ML engineering capabilities.

Screenshots showcase:
✅ Model initialization with 4-bit quantization
✅ Character personality and dialogue consistency
✅ Multi-turn conversation with context awareness
✅ Terminal-based interface optimized for performance
✅ Fine-tuned Mistral-7B responses as Ultron character

Video demonstration with voice interaction coming soon

Back to Projects