AI News HubLIVE
In-site rewrite6 min read

Building Local AI Systems: Qwen3.6 + MCPs

This article introduces how to build local AI systems using the Qwen3.6-35B-A3B model and the Model Context Protocol (MCP), covering model architecture, hardware requirements, deployment, and a practical GitHub developer assistant example.

SourceKDnuggetsAuthor: Shittu Olumide

--> Building Local AI Systems: Qwen3.6 + MCPs - KDnuggets

-->

Join Newsletter

Introducing MCP

Every developer building with local AI hits the same wall eventually. The model works. It reasons well, writes solid code, and answers complex questions. But it cannot do everything. It cannot query your database, open a GitHub issue, or call your internal API. You are left writing custom Python wrappers for every tool you need, hardcoding the glue between model output and tool execution, and maintaining those wrappers every time an API changes.

The Model Context Protocol (MCP) was designed to solve exactly this. It is an open standard by Anthropic: a universal, pluggable protocol for AI tool connectivity. Define a tool once as an MCP server. Any MCP-compatible client, any model, any framework, can discover and call it with zero custom integration code per model.

Qwen3.6-35B-A3B is the most capable local model for this kind of work right now. It has a 262,144-token context window, a Mixture of Experts (MoE) architecture that activates only 3B of its 35B parameters per forward pass (which is why it fits on hardware that should not be able to run a 35B model), and was explicitly trained and evaluated on MCP-based agentic tasks.

This article builds a local GitHub developer assistant: an agent that reads a repository's open issues, searches the relevant code, drafts a fix, and creates a pull request. The whole thing runs on your hardware, through MCP servers, with no cloud dependency.

Understanding Qwen3.6-35B-A3B

Understanding the architecture matters here because it directly explains what hardware you need and why the model performs the way it does on agentic tasks.

The name encodes the key fact: 35B total parameters, A3B meaning 3B activated per forward pass. It is an MoE model with 256 experts per layer, routing 8 plus 1 shared experts per token. You get the knowledge capacity of a 35B model at the inference compute cost of a 3B model. That trade-off is why it fits on hardware that would collapse under a dense 35B.

The hidden layout is where Qwen3.6 diverges most from other MoE models. Each block in the 40-layer stack follows a 3:1 ratio of Gated DeltaNet layers to Gated Attention layers. DeltaNet is a linear attention mechanism; it processes sequences more efficiently than full quadratic attention, especially at long context lengths. The interleaved full Gated Attention layers provide the deep relational reasoning that linear attention alone misses. For an agent working through a 500-file repository, that combination matters: efficient processing at length combined with precise reasoning on the relevant sections.

The context window is 262,144 tokens natively, extensible to 1,010,000 with YaRN scaling. For agent work, context length is not a comfort feature; it is an operational constraint. An agent reading source files, maintaining tool call history, tracking a multi-step plan, and injecting tool results back into context needs real headroom. Most 7B and 13B models cap at 8k or 32k tokens. Running out of context mid-task means the agent loses its own history and starts hallucinating tool results.

Qwen3.6 was explicitly trained and evaluated on MCP-based agentic benchmarks. Two headline features came out of that training:

Agentic Coding. Frontend workflows and repository-level reasoning — the model handles multi-file refactoring tasks with coherent reasoning across files, not just single-file edits in isolation.

Thinking Preservation. A preserve_thinking flag that retains reasoning traces from prior turns in a multi-turn conversation. When an agent reasons through a plan in turn one and then executes tool calls in turns two through five, preserve_thinking=True keeps the turn-one reasoning available in the KV cache. Each subsequent turn benefits from that prior reasoning without paying the cost of re-deriving it.

System Requirements

There are three realistic deployment paths, and which one you use depends entirely on your hardware.

GPU inference (recommended for production agent workloads). Qwen3.6-35B-A3B in bfloat16 requires approximately 70 GB VRAM. In Q4 quantization, it fits in approximately 20–24 GB. A single RTX 4090 (24 GB) handles Q4. Two RTX 3090s with tensor parallelism handle Q4 as well. An A100 80 GB handles the full bfloat16 model.

CPU/Hybrid via KTransformers. KTransformers is the accessible path for developers without a 24 GB GPU. It offloads compute-heavy layers to GPU when available and runs the rest on CPU. With 64 GB system RAM, you can run Qwen3.6-35B-A3B in a usable (if slower) configuration. Response latency will be 30–120 seconds per turn depending on your CPU, which is workable for an agent doing background repository analysis but not for interactive coding sessions.

Smaller models for tutorial testing. The entire MCP integration pattern in this article is identical regardless of model size. If you want to follow along without the hardware for the full 35B model, use Qwen/Qwen2.5-7B-Instruct via Ollama (ollama pull qwen2.5:7b) or the Qwen3-8B model. The serving API is the same, the code is identical, and you can swap in the 35B model when hardware permits.

Software requirements:

Python 3.11+ required

python --version

python -m venv qwen-mcp-env source qwen-mcp-env/bin/activate # macOS / Linux qwen-mcp-env\Scripts\activate # Windows

Core packages

pip install \ "openai>=1.30.0" \ "qwen-agent>=0.0.10" \ "mcp>=1.0.0" \ "httpx>=0.27.0"

Serving framework -- choose one

pip install "vllm>=0.19.0" # NVIDIA GPU pip install "sglang>=0.5.10" # NVIDIA GPU (faster prefill for long context) pip install "ktransformers" # CPU/hybrid

Node.js 18+ is required for pre-built MCP servers installed via npx

node --version

Serving Qwen3.6 Locally with an OpenAI-Compatible API

Before wiring in any MCP servers, you need a running inference server. Both SGLang and vLLM expose an OpenAI-compatible API that the MCP integration layer talks to — the same API surface, just pointed at localhost instead of api.openai.com.

// SGLang (Recommended for Long-Context Agent Workloads)

Install SGLang with full dependencies

pip install "sglang[all]>=0.5.10"

Serve Qwen3.6-35B-A3B with reasoning and tool-call parsers enabled.

--reasoning-parser qwen3 correctly handles the ... blocks.

--tool-call-parser qwen3_coder routes tool call outputs to the right format.

--enable-prefix-caching is critical for agent workloads -- enables KV cache reuse

across turns, which is what makes preserve_thinking efficient in practice.

python -m sglang.launch_server \ --model-path Qwen/Qwen3.6-35B-A3B \ --host 0.0.0.0 \ --port 30000 \ --reasoning-parser qwen3 \ --tool-call-parser qwen3_coder \ --enable-prefix-caching \ --tp 2 # tensor parallel across 2 GPUs; remove if using single GPU

// vLLM

pip install "vllm>=0.19.0"

vLLM equivalent with the same critical flags

vllm serve Qwen/Qwen3.6-35B-A3B \ --host 0.0.0.0 \ --port 8000 \ --reasoning-parser qwen3 \ --tool-call-parser qwen3_coder \ --enable-prefix-caching-v2 \ --tensor-parallel-size 2

// Smaller Model via Ollama

ollama pull qwen2.5:7b ollama serve

Ollama's API is OpenAI-compatible at http://localhost:11434/v1

Once the server is running, verify it before going any further:

Health check -- should return {"status": "ok"} or similar

curl http://localhost:30000/health

Test the chat completions endpoint with a simple query

curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [{"role": "user", "content": "Reply with: ready"}], "max_tokens": 10 }'

If you get a JSON response with a choices array, the server is ready. Do not proceed to MCP setup until this works. Every integration failure you will encounter later is easier to debug when you know the serving layer is solid.

Understanding MCP and Why It Changes the Agent Architecture

Before writing any agent code, it helps to understand what MCP actually does at the protocol level, because that understanding prevents a category of bugs that come from treating MCP as just a fancier function-calling API.

MCP is a JSON-RPC 2.0 protocol running over stdio or HTTP transport. When an MCP client connects to a server, the first thing it does is call tools/list to discover what tools the server exposes. Each tool comes back with a name, a description, and an input schema defined in JSON Schema. The model reads this schema. It is the model's contract with the tool.

When the model wants to call a tool, it emits a structured tool call object. The MCP client — not the model — actually executes the call by sending a tools/call request to the server. The server handles execution and returns a result. The client injects that result back into the conversation as a tool role message. The model reads the result and decides the next step.

This separation is important. The model decides what to call and with what arguments. The client handles execution. The server handles the actual work. Your code never hardwires a tool to a model; you just tell the client which servers are available.

There are two ways to use MCP with Qwen3.6:

Via Qwen-Agent: the official qwen_agent library handles tool discovery, call parsing, result injection, and multi-turn conversation management automatically. Less code, less control. Right for most use cases.

Via the MCP Python SDK directly: you handle the agentic loop yourself using mcp.ClientSession. More code, full visibility into every message, complete control over error handling and retry logic. Right for production systems where you need to monitor every step.

This article covers both, starting with Qwen-Agent.

Building the Local GitHub Developer Assistant

The agent does four things in sequence: reads open issues from a GitHub repository, finds the relevant code, drafts a fix, and opens a pull request. All locally, all through MCP.

// Part 1: Environment and MCP Server Setup

Set your GitHub personal access token

Required by the GitHub MCP server for API calls

export GITHUB_TOKEN=ghp_your_token_here

Pre-built MCP servers install via npx -- no separate install step

npx handles this on first use when the agent starts the servers

Verify npx is available:

npx --version

Create a project directory:

mkdir qwen-github-agent cd qwen-github-agent

// Part 2: Qwen-Agent Implementation

The fastest path to a working agent. Qwen-Agent handles the full loop automatically.

github_agent_qwenagent.py

Prerequisites: pip install qwen-agent openai

npm / npx must be installed for the MCP servers

GITHUB_TOKEN env var must be set

Local serving endpoint must be running (see previous section)

#

How to run:

python github_agent_qwenagent.py

from qwen_agent.agents import Assistant

── Server configuration ──────────────────────────────────────────────────────

Point at your local serving endpoint.

Change the base_url to match whichever server you started:

SGLang: http://localhost:30000/v1

vLLM: http://localhost:8000/v1

Ollama: http://localhost:11434/v1

LLM_CONFIG = { "model": "Qwen/Qwen3.6-35B-A3B", "model_server": "http://localhost:30000/v1", "api_key": "EMPTY", # Local servers do not require a real key

Thinking mode sampling params (from the official model card best practices)

"generate_cfg": { "temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "thought_in_history": True, # This is the preserve_thinking flag in Qwen-Agent }, }

── MCP server configuration ──────────────────────────────────────────────────

Each server key names the server; the value is the stdio launch command.

Qwen-Agent starts each server as a subprocess and manages the MCP sessions.

MCP_SERVERS = { "mcpServers": { "filesystem": { "command": "npx", "args": [ "-y",

[truncated for AI cost control]