2026-06-23 06:35 UTCIn-site rewrite5 min readUpdated: 2026-06-23 06:35 UTC

GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

This tutorial provides a practical walkthrough for using GLM-5.2 through its OpenAI-compatible API, covering key features such as reasoning-effort control, streaming, function calling, tool-using agents, structured JSON output, long-context retrieval, and cost estimation.

SourceMarkTechPostAuthor: Sana Hassan

In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation.

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

Copy CodeCopiedUse a different Browser

import sys, subprocess subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False) import os, re, json, time, getpass from openai import OpenAI PROVIDERS = { "zai": {"base_url": "https://api.z.ai/api/paas/v4/", "model": "glm-5.2", "env": "ZAI_API_KEY"}, "openrouter": {"base_url": "https://openrouter.ai/api/v1", "model": "z-ai/glm-5.2", "env": "OPENROUTER_API_KEY"}, "together": {"base_url": "https://api.together.xyz/v1", "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"}, "requesty": {"base_url": "https://router.requesty.ai/v1", "model": "zai/glm-5.2", "env": "REQUESTY_API_KEY"}, "huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"}, } PROVIDER = "zai" CFG = PROVIDERS[PROVIDER] MODEL = CFG["model"] def load_api_key(env_name): try: from google.colab import userdata v = userdata.get(env_name) if v: return v except Exception: pass if os.environ.get(env_name): return os.environ[env_name] return getpass.getpass(f"Enter your {env_name}: ") client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"]) PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40 _USAGE = {"in": 0, "out": 0, "calls": 0} def _track(usage): if usage: _USAGE["in"] += getattr(usage, "prompt_tokens", 0) or 0 _USAGE["out"] += getattr(usage, "completion_tokens", 0) or 0 _USAGE["calls"] += 1 def get_reasoning(obj): """Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field).""" val = getattr(obj, "reasoning_content", None) if val: return val extra = getattr(obj, "model_extra", None) or {} if extra.get("reasoning_content"): return extra["reasoning_content"] try: return obj.to_dict().get("reasoning_content") except Exception: return None def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto", stream=False, max_tokens=2048, temperature=1.0, tool_stream=False): """ effort: None | "high" | "max" (GLM-5.2 thinking-effort level; max is the model default) thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency) GLM-specific params go through extra_body so any OpenAI client works. """ extra = {"thinking": {"type": "enabled" if thinking else "disabled"}} if effort and thinking: extra["reasoning_effort"] = effort if tool_stream: extra["tool_stream"] = True kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens, temperature=temperature, stream=stream, extra_body=extra) if tools: kwargs.update(tools=tools, tool_choice=tool_choice) if stream: kwargs["stream_options"] = {"include_usage": True} return client.chat.completions.create(**kwargs)

We set up the complete foundation for using GLM-5.2 through an OpenAI-compatible API. We define multiple provider options, load the API key securely, create the OpenAI client, and set up token-cost tracking for the entire notebook. We also build a reusable chat wrapper so that every subsequent demo can use thinking mode, reasoning effort, streaming, tool calling, and provider-specific parameters cleanly.

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

Copy CodeCopiedUse a different Browser

def demo_basic(): print("\n=== 1. BASIC CHAT / SANITY CHECK =========================") resp = chat([{"role": "system", "content": "You are a concise technical assistant."}, {"role": "user", "content": "In one sentence, what is GLM-5.2 best at?"}], thinking=False, max_tokens=200) _track(resp.usage) print(resp.choices[0].message.content.strip()) def demo_effort(): print("\n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========") problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. " "Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. " "At what clock time do they meet? Show the key steps briefly.") for label, kw in [("thinking OFF", dict(thinking=False)), ("effort=high", dict(thinking=True, effort="high")), ("effort=max", dict(thinking=True, effort="max"))]: t0 = time.time() resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw) dt = time.time() - t0 _track(resp.usage) msg, u = resp.choices[0].message, resp.usage print(f"\n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---") r = get_reasoning(msg) if r: print(" [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...") print(" : " + " ".join((msg.content or '').split())[:350]) def demo_streaming(): print("\n=== 3. STREAMING: reasoning channel vs answer channel ====") stream = chat([{"role": "user", "content": "Explain why the sky is blue, then give a one-line TL;DR."}], thinking=True, effort="high", stream=True, max_tokens=1200) saw_r = saw_a = False usage = None for chunk in stream: if getattr(chunk, "usage", None): usage = chunk.usage if not chunk.choices: continue delta = chunk.choices[0].delta r = get_reasoning(delta) if r: if not saw_r: print("\n[thinking] ", end="", flush=True); saw_r = True print(r, end="", flush=True) if getattr(delta, "content", None): if not saw_a: print("\n\n ", end="", flush=True); saw_a = True print(delta.content, end="", flush=True) print() _track(usage)

We start testing GLM-5.2 with basic chat, reasoning-effort control, and streaming output. We first run a simple sanity check, then compare the same problem across thinking-off, high-effort, and max-effort modes to observe changes in latency and output tokens. We also stream the model response so we can view the reasoning channel and the final answer separately as the response is being generated.

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

Copy CodeCopiedUse a different Browser

def tool_calculator(expression: str): if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""): return {"error": "unsupported characters"} try: return {"result": eval(expression, {"builtins": {}}, {})} except Exception as e: return {"error": str(e)} _CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000, "sao paulo": 22_400_000, "mexico city": 21_800_000} def tool_city_population(city: str): return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())} TOOLS = [ {"type": "function", "function": { "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.", "parameters": {"type": "object", "properties": {"expression": {"type": "string"}}, "required": ["expression"]}}}, {"type": "function", "function": { "name": "city_population", "description": "Look up the metro population of a city.", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}, ] TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population} def run_tool_loop(messages, max_rounds=6, effort="max"): """Full loop: model -> tool_calls -> execute -> feed results back -> repeat.""" for _ in range(max_rounds): resp = chat(messages, tools=TOOLS, thinking=True, effort=effort, max_tokens=1500, temperature=0.3) _track(resp.usage) m = resp.choices[0].message if not getattr(m, "tool_calls", None): return m.content messages.append({ "role": "assistant", "content": m.content or "", "tool_calls": [{"id": tc.id, "type": "function", "function": {"name": tc.function.name, "arguments": tc.function.arguments}} for tc in m.tool_calls]}) for tc in m.tool_calls: try: args = json.loads(tc.function.arguments or "{}") except json.JSONDecodeError: args = {} result = TOOL_IMPLS.get(tc.function.name, lambda k: {"error": "unknown"})(args) print(f" ↳ {tc.function.name}({args}) -> {result}") messages.append({"role": "tool", "tool_call_id": tc.id, "content": json.dumps(result)}) return "(stopped: max tool rounds reached)" def demo_tools(): print("\n=== 4. FUNCTION / TOOL CALLING ===========================") q = ("How many times larger is Tokyo's metro population than Mexico City's? " "Use the tools, then answer with the ratio to one decimal place.") print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split())) def demo_agent(): print("\n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========") task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), " "then compute the combined population of the top two and report it. " "Use the tools for every lookup and sum; never guess numbers.") ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."}, {"role": "user", "content": task}]) print("Final:", " ".join((ans or "").split()))

We connect GLM-5.2 to external tools and build a small tool-using workflow. We define a calculator and a city-population lookup tool, register them in an OpenAI-style tool schema, and create a loop in which the model requests tool calls and receives tool results. We then use this setup for a direct function-calling task and a small multi-step agent that looks up populations, ranks cities, and performs calculations without guessing.

Structured JSON Output and Long-Context Retrieval with GLM-5.2

Copy CodeCopiedUse a different Browser

[truncated for AI cost control]