Running a Multi-Expert AI on Consumer Hardware

Introduction

Is it feasible to run a commercial-grade AI assistant on modest hardware? This project, which I call the "LLM Council," answers with a yes, but with significant caveats. I set out to engineer a system that intelligently delegates tasks to specialized Language Models (LLMs), performs real-time web searches, and maintains conversation context—all on a 2020 Dell Inspiron with just 8GB of RAM and no dedicated GPU.

The result is a locally hosted assistant that routes queries to experts (Math, Coding, Vision, Knowledge). While it adapts to user needs on the fly, handling complex workflows like "solve this integral" followed by "now find the derivative," the hardware limitations mean patience is required.

The Hardware Reality Check

The primary hurdle was the hardware: an Intel Core i5-1135G7 laptop with integrated graphics and only 8GB of memory. A single modern 7B parameter model typically demands 14GB of RAM in FP16 precision. Even with 4-bit quantization, running multiple models simultaneously was mathematically impossible.

This constraint forced a strategic approach:

Dynamic Loading: Only one expert is active at a time.
Optimized Models: Choosing smaller, specialized models over larger generalists.
Aggressive Context Management: keeping memory usage in check.

The Performance Trade-off

Running LLMs on CPU only comes at a real cost. The lack of a dedicated GPU means inference is significantly slower. The first query typically takes about 1 minute to process as the system warms up. Furthermore, switching between experts (e.g., from Math to Coding) introduces a noticeable delay as the system must physically unload one model from RAM and load the next one.

Solving the Routing Chaos

The core of the system is determining which expert should handle a user's request. My journey involved four distinct attempts:

1. The Failed Attempts

Simple keyword matching was too rigid (confusing "integral" in code vs. math). Using an LLM as a router was accurate (94.7%) but painfully slow (latency over 10 seconds). Semantic similarity embeddings were fast but lacked the nuance to distinguish between "NVIDIA stock price" (Research) and "NVIDIA architecture" (Knowledge).

2. The Winner: Hybrid Weighted Routing

The final solution combines the best of both worlds. It uses a Weighted Keyword Systemfor speed and precision, falling back to Semantic Similarity only when necessary.

weighted_keywords = {
    MATH: [
        (r'\b(integral|integrate|derivative)\b', 10.0),  # Strong signal
        (r'\b(limit|series|theorem)\b', 8.0),            # Medium signal
    ],
    CODING: [
        (r'\b(implement|write|code|debug)\b', 6.0),
        (r'\b(python|java|c\+\+)\b', 9.0),
    ]
}

This hybrid approach achieves 96.2% accuracy with a lightning-fast 90-120ms latency, ensuring the system feels responsive (even if the model generation takes longer!).

Meet the Council

I curated a team of specialized models that could fit within the memory constraints while delivering top-tier performance in their domains.

The Math Expert

Model: qwen2-math:7b-instruct
Selected for its superior reasoning capabilities. In tests, it solved 18/20 calculus problems correctly, significantly outperforming generalist models like Llama3.

Math expert correctly solving a calculus problem with step-by-step reasoning.

The Coding Expert

Model: deepseek-coder:6.7b-instruct
Chosen for its idiomatic code generation. It produced syntactically correct and Pythonic code in 14/15 algorithm tasks, beating out CodeLlama.

Coding expert generating idiomatic C++ code for an algorithm.

The Vision Expert

Model: llava:7b-v1.6
Vision tasks were the most memory-intensive. I initially tried Qwen2.5-VL but hit immediate OOM errors. Llava:7b proved to be the reliable "good enough" compromise for extracting text and describing images within 8GB RAM.

Vision expert successfully extracting text and describing a screenshot.

Knowledge & Research: Beyond Training Data

The Knowledge Expert (RAG)

To handle specialized technical queries, I built a Retrieval-Augmented Generation (RAG)system. It ingests over 2,600 pages from standard computer science and statistics textbooks (Tanenbaum, Silberschatz, etc.).

When you ask "Explain TCP handshakes," the system retrieves the relevant chunks from these textbooks and uses qwen3:8b to synthesize an accurate answer. This eliminates hallucinations on technical topics.

RAG system retrieving context from textbooks to answer technical questions.

The Research Expert (Web Search)

For real-time information, the system acts like a local Perplexity AI. It uses theddgs library to perform anonymous DuckDuckGo searches, then synthesizes the results into a concise answer with citations.

Research expert providing up-to-date information with cited sources.

From Terminal to "Remote Dashboard"

To make the system truly useful, it needed to be accessible away from the keyboard. I built a Gradio web interface and bound it to my local network IP.

This simple change turned my laptop into a home AI server. I can now query the "LLM Council" from my phone on the couch or my tablet in the kitchen, with full support for conversation history.

Accessing the Gradio web interface from a mobile device on the local network.

Technologies Used

Ollama: Local LLM inference
ChromaDB: Vector database for RAG
Sentence Transformers: Semantic search embeddings
Gradio: Web interface
DuckDuckGo: Real-time web search
Python: Core logic & orchestration

Lessons Learned

1. Hardware Constraints Drive Design

Every choice in this project was shaped by having 8GB of RAM. I couldn't load multiple models, so I needed fast routing. I couldn't use the best vision model, so I compromised. Constraints forced creativity—the weighted keyword router is faster AND more accurate than my original LLM-based approach.

2. LLMs Are Terrible Routers

Using an LLM to classify queries is intuitive but impractical. The latency kills user experience, and non-determinism creates confusion. Rule-based systems (with semantic fallback) are underrated for reliability.

3. RAG Is Essential But Tricky

Without retrieval, language models hallucinate domain-specific knowledge. But RAG introduces new problems: chunking strategy, context window management, and synthesis quality. I spent more time debugging RAG than building it.

Conclusion

Building the LLM Council proved that you don't need enterprise hardware to run a sophisticated multi-expert AI system. By embracing constraints, designing efficient routing logic, and making strategic model choices, I created a versatile assistant that runs entirely offline on a 5-year-old laptop.

The system is open source and available for anyone to try or improve upon.

Orchestrating AI Experts on a Budget: Building a Multi-Model System on an Old Laptop