AI News HubLIVE
站内改写6 min read

AI Engineering for Developers

This post is what I wish someone had handed me the first time I had to ship an AI feature. I spent fifteen years writing backends, operating Kubernetes clusters, debugging Terraform, and arguing about API design. Then L…

SourceHacker News AIAuthor: kiyanwang

This post is what I wish someone had handed me the first time I had to ship an AI feature. I spent fifteen years writing backends, operating Kubernetes clusters, debugging Terraform, and arguing about API design. Then LLMs landed in production and a lot of the rules I trusted stopped applying. The system is now non-deterministic by default, the input is a string of natural language, and your unit tests cannot tell you whether the output is good. This is a tour through AI engineering for engineers who already know how to ship software. I will assume you can read Python, you understand HTTP and queues, you have rolled out things on Kubernetes, and you have not yet trained or finetuned a model. We will go from "what is a foundation model" to "how do you run agents in production on Google Cloud" without skipping the parts that matter. Two notes before we start. First, I work mostly on GCP, so we go deeper there. Second, the model and pricing landscape is moving every quarter. I am writing this in May 2026, with Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 as the current frontier. Whenever you read this, check the docs. Introduction to AI Engineering The rise of AI engineering: from language models to LLMs to foundation models Language models started as statistical machinery for predicting the next token. Then transformers showed up, scale kept paying off, and "large language model" became an industry. Foundation models are the next abstraction: pretrained on enormous, mixed corpora, exposed via an API, and capable of being adapted to many tasks without retraining. The same Gemini 3.1 Pro that drafts a marketing email can also classify support tickets, generate SQL, summarize a 1M-token codebase, and call tools. What changed for engineers: the model is no longer the product. The product is the system around the model. That system is what AI engineering is about. Foundation model use cases Roughly speaking, foundation models are good at: code (Copilot, Cursor, Codex), writing (drafts, edits, summaries), image and video (Imagen 4, Veo 3.1, Gemini 3 Pro Image), education (tutoring, explanation, grading), conversational bots (support, sales, internal helpdesks), information aggregation (search assistants, research agents), data organization (extracting structure from unstructured text), and workflow automation (agents that touch JIRA, GitHub, Salesforce). They are mediocre or dangerous at: precise arithmetic without tools, real-time facts without grounding, and anything where being subtly wrong is unacceptable. If a use case maps cleanly to "transform unstructured input into structured output, with a tolerance for noise", it is probably a fit. If it maps to "must be exactly right, every time, on adversarial inputs", do not start there. AI engineering vs ML engineering vs full-stack engineering ML engineering is about building and training models: data pipelines, feature engineering, hyperparameter tuning, distributed training. AI engineering is about building applications on top of pretrained models: prompts, retrieval, evaluation, agents, inference serving, observability. Full-stack engineering is what most of you already do. In practice, an AI engineer is a backend engineer with three extra responsibilities: keeping the system grounded (RAG, tools, structured outputs), keeping it evaluated (eval pipelines, online metrics, regression tests), and keeping it cheap and fast enough (model routing, caching, inference optimization). You usually do not train models. You orchestrate them. The AI engineering stack and its three layers Three layers, top to bottom: Application layer. Your code. Prompts, RAG, agents, UI, business logic. Model development layer. Finetuning, model merging, distillation, dataset engineering. Optional for most teams. You buy from a vendor or finetune a small open model. Infrastructure layer. GPUs, inference servers (vLLM, TGI, TensorRT-LLM), vector databases, gateways, observability, CI/CD. Most teams live in layer 1, occasionally dip into layer 2, and rent layer 3 from a cloud. That is fine. The art is knowing when you actually need to go down a layer. Layer 1 is larger than the bullet makes it look. Writing prompts, building retrieval pipelines, wiring tools together, running evals, deploying endpoints, instrumenting traces, and maintaining all of it as models change underneath you: that is a full-time job. The craft is in the application layer. You go to layer 2 when prompt engineering and RAG have plateaued and you need the model to behave differently in a way you cannot get by changing the input. You go to layer 3 when cost, data residency, or hardware constraints make rented inference impractical. Most teams that reach layer 3 didn't plan to; they got pushed there by one of those constraints. Start in layer 1 and be honest about why you're moving down. How to adapt an LLM: prompt engineering, RAG, finetuning Three knobs, in order of cost: Prompt engineering. Cheapest, fastest, most underrated. You change the input. The model is unchanged. RAG. You give the model new context at runtime by retrieving from your data. Solves "the model does not know about my company". Finetuning. You change the model weights. Solves "I need a specific style, format, or behavior the model will not give me consistently with prompts". Default: prompt first, then RAG, then finetune. Do not skip steps. You could have teams burn six weeks finetuning when better retrieval and a system prompt rewrite would have shipped the same week. Choosing an LLM You are picking among five rough buckets in 2026: Closed frontier: GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro. Best quality, highest cost. Closed mid-tier: Claude Sonnet 4.6. Gemini 2.5 Pro, GPT-5.4. Still excellent for most tasks. Closed cheap: Claude Haiku 4.5, Gemini Flash-Lite, GPT-5.4 nano. The default for high-volume work. Open weights: Llama 4, Gemma, Mistral, DeepSeek, Qwen. Run on your own GPUs. Specialized: Voyage embeddings, Cohere rerankers, code-specific models. Pick by: task fit, cost at projected volume, latency, context window, output structure (does it support JSON mode, tool use, structured outputs), and where it can run (data residency, regional endpoints). Almost no one should be using just one. Route cheap requests to cheap models. Planning AI applications AI features are different to plan because output quality is not binary. A regular CRUD endpoint either works or it doesn't. An AI feature sits on a quality gradient, and where it lands depends on factors you don't fully control: model behavior, prompt iteration, data distribution, and the edge cases your actual users bring. That uncertainty doesn't mean you can't plan. It means your plan needs explicit quality checkpoints, not just delivery dates. Four checkpoints I always run through before committing: Use case evaluation. Is this a real problem? Is it tolerant of probabilistic output? What is the cost of being wrong? Setting expectations. A demo is not a product. Plan for at least 2x the dev time of a regular feature, mostly spent on evaluation and edge cases. Milestone planning. Get to "barely works" fast. Eval pipeline second. Production hardening third. Maintenance. Models drift. Prompts rot. Data changes. Budget for ongoing eval, not just initial dev. The "barely works" milestone matters more than it sounds. Ship it to real users, watch what breaks, then fix. Trying to perfect an AI feature in isolation before anyone touches it is how teams spend three months and ship nothing. Challenges in development, deployment, and maintenance The challenges split cleanly across the project lifecycle. Development problems hit you first. Deployment problems hit you at launch. Maintenance problems never stop. Development. Prompts are not code in the traditional sense. They cannot be unit-tested deterministically. You need eval datasets the same day you start writing them. Deployment. Inference is slow, expensive, and bursty. Caching, batching, and routing matter more than they do in regular APIs. Maintenance. Vendors deprecate models. Tokenizers change underneath you (Anthropic noted that Opus 4.7 ships with a new tokenizer that "may use up to 35% more tokens for the same fixed text" at the same rate card as Opus 4.6). Hallucinations evolve. You need monitoring and red-teaming, not just uptime alerts. Industry use cases and ROI Where I have seen AI features pay back in production: Customer support deflection: cheap, measurable, often a real chunk of tier-1 ticket volume diverted away from agents. Internal search and RAG over docs: hard to measure, but eats Slack-as-a-search-engine quickly. Code assistance: every serious dev team is using something now. Document automation: contracts, invoices, claims, anything with structured extraction. Where I have seen it not pay back: anything user-facing where a wrong answer is a brand crisis, anything trying to replace a deterministic API, and demos that someone built without ever talking to the people who would maintain it. Understanding Foundation Models Training data: multilingual and domain-specific models Foundation models are shaped by their training data more than by their architecture. A model trained 80% on English internet text will be visibly worse at, say, Italian legal text than at English product reviews. Multilingual models like Gemini and Claude do reasonably well across major languages, but coverage is uneven and the long tail (smaller languages, dialects) is rough. Domain-specific models exist (Med-PaLM, BloombergGPT, Codestral) and they outperform general models on their domain by a measurable but not huge margin. Most of the time, RAG over your domain data plus a strong general model wins on both quality and operational simplicity. Model architecture and model size Almost everything in production today is a decoder-only transformer, occasionally with a mixture-of-experts (MoE) twist. Size still matters but is no longer destiny. A well-tuned 70B can beat a poorly-tuned 400B for many tasks. Reasoning models (Gemini 3.1 Pro thinking levels, o-series, Claude with extended thinking) have shifted the relevant axis from "how many parameters" to "how much test-time compute do you give it". Small Language Models (SLMs), multimodal models, domain-specific and reasoning models Not every task needs a frontier model, and not every input is text. This section maps the model taxonomy to the engineering decisions they affect. SLMs. Gemma 3, Phi-4, Llama 3.1 8B. Run on a single GPU or even a laptop. Great for classification, routing, simple summarization, on-device inference. Multimodal. Gemini 3 Pro takes text, images, video, audio; Claude takes text and images; GPT-5.5 handles text and images. Vision is now a default capability, not a bolt-on. Domain-specific. Worth it only if you have evaluated and a general model fails consistently. Reasoning. Models that emit long internal chains of thought before responding. Better at math, code, planning. Slower and pricier per call. SLMs are the workhorses for tasks where a heavy model is wasteful: route a request, classify an intent, detect a language, summarize a short paragraph. A Gemma 3 or Phi-4 running on a single L4 GPU handles thousands of requests per minute at a fraction of the cost of a frontier API call. The tradeoff is a capability ceiling: push SLMs past their sweet spot and quality drops fast. Multimodal support has quietly become the default rather than a feature. The practical shift is that you no longer need to treat images, PDFs, charts, and screenshots as edge cases that require a separate pipeline. They're first-class inputs. The engineering question is whether to pass them to the model raw or to pre-process them (extract text, describe images) to control cost and latency. Reasoning models add a third axis beyond capability and cost: time. The model thinks before it answe [truncated for AI cost control]