Meituan Trained a 1.6T-Parameter AI Model Without Nvidia GPUs
Meituan released LongCat-2.0, a 1.6-trillion-parameter Mixture-of-Experts model trained and deployed entirely on domestic AI ASIC superpods without Nvidia GPUs. The model quietly appeared on OpenRouter as Owl Alpha, achieving high usage rankings. While not the absolute best in performance, it demonstrates the viability of training trillion-parameter models on China's domestic compute infrastructure, reducing reliance on Nvidia.
XYZ Labs
Jul 02, 2026
Meituan just released LongCat-2.0, a 1.6-trillion-parameter foundation model. But the real story is not the parameter count.
The real story is this: according to the Chinese analysis and Meituan’s own model-card language, LongCat-2.0 was trained and deployed on AI ASIC superpods, without relying on Nvidia GPUs. For China’s AI industry, that is the line that turns a normal model launch into a geopolitical hardware story.
LongCat-2.0 uses a Mixture-of-Experts architecture with 1.6T total parameters and about 48B activated parameters per token. Before its official release, it reportedly appeared on OpenRouter under the anonymous name Owl Alpha, where it entered the top three by total usage. In Claude Code Agent scenarios, the article says LongCat-2.0 ranked second globally by usage, behind only Claude Opus 4.8.
If you only look at model performance, LongCat-2.0 is not a clean “best in the world” story.
The technical community generally sees its agentic capability as close to Claude Opus 4.6, but behind the newer Claude Opus 4.8. It is also not necessarily China’s strongest coding model; the source article says community feedback places it slightly above GLM-5.1 in coding, but behind GLM-5.2.
It is not even the highest-usage Chinese model on OpenRouter, because free or heavily subsidized models from Tencent, Alibaba, and DeepSeek have also appeared near the top.
But add one qualifier and the whole story changes:
LongCat-2.0 is a trillion-parameter-class model with “zero Nvidia content” in the training-to-inference loop.
The article’s core claim is that from training to inference deployment, the model ran on a domestic compute cluster. Meituan’s public model card phrases this more carefully: both the full training run and large-scale deployment were built entirely on AI ASIC superpods, with pretraining across more than 35 trillion tokens and no rollbacks or irrecoverable loss spikes.
That matters because China’s previous domestic-compute narratives often covered narrower milestones: running inference on local chips, or doing post-training on local chips. LongCat-2.0 is presented as something more ambitious: a full trillion-parameter training-and-serving pipeline.
One developer quoted in the Chinese article put it neatly: earlier efforts were like building a house elsewhere, then using domestic compute to decorate it. LongCat-2.0 is more like laying the foundation, building the house, moving in, and finding that it is actually livable.
Why “From Scratch” Matters
The article is careful about the hardware details, and that caution is important.
Meituan’s official material says “domestic AI compute chips” and “AI ASIC superpod.” It does not publicly name the exact chip model, nor does it officially state the total number of cards.
The widely repeated “50,000 Ascend 910C cards” figure comes from Chinese media reports and community inference. Other reports use vaguer language such as “ten-thousand-card scale,” while some put the range around 50,000 to 60,000 cards. The Ascend 910C identification is also a community inference based on clues such as 200Gbps RDMA and 64GB HBM per die, not a Meituan or Huawei confirmation in the LongCat context.
So the precise wording should be: LongCat-2.0 appears to have been trained on a large domestic AI ASIC cluster, widely reported as roughly 50,000 cards and widely inferred to involve Huawei Ascend 910C-class hardware.
That is still a big deal.
The source article draws a sharp distinction between two types of breakthroughs.
Earlier domestic-chip milestones often involved taking an existing large model and doing continued training or full-parameter post-training. That is difficult and valuable, but it is not the same as training a trillion-parameter model from random initialization.
Meituan’s claim is much harder: starting from zero, training a 1.6T-parameter model on more than 30T tokens, reducing daily failure rates by more than 70%, improving training MFU by 1.5x, and completing the process without rollback or unrecoverable loss spikes.
From-scratch pretraining is brutally unforgiving. A loss spike, communication timeout, or silent data corruption event can waste millions in electricity and compute time. Doing it on a non-Nvidia stack means the challenge is not only raw FLOPs. It is the whole system: chips, interconnect, operators, communication libraries, fault recovery, monitoring, and training stability.
That is why the article argues the key question has shifted from “can Chinese chips train a giant model?” to “can they do it stably enough to become normal?”
The Real Bottleneck Is the Software Stack
The article does not pretend domestic chips have magically erased Nvidia’s lead.
One practitioner quoted in the piece says the challenge is not simply compute. Domestic cards often have less memory per card, require more cards, and may have weaker communication bandwidth. That creates utilization problems. On the software side, Nvidia’s CUDA ecosystem, operators, tooling, and debugging stack are mature. Moving to a domestic compute platform means rebuilding and re-optimizing a great deal of infrastructure.
In other words: the hard part is not only making an AI chip. It is making the whole machine behave like a reliable AI factory.
The article says Meituan’s engineering indicators include a 1.5x improvement in training MFU, a 70%+ reduction in average daily failure rate, over 30% MFU, and a 14% improvement in key operator efficiency. Those numbers point to the real work: operator adaptation, communication optimization, HCCL exception handling, and automatic fault recovery.
The significance, then, is not that LongCat-2.0 beats every frontier model. It does not.
The significance is that Meituan appears to have shown an industrial-scale path for training and serving a trillion-parameter model without a single Nvidia card in the loop.
For a Chinese AI industry shaped by U.S. export controls, that is a meaningful step from “possible” toward “operational.”
The 97% Sparsity Question
If the hardware story is the external shock, the architecture story may be the more interesting technical signal.
LongCat-2.0 inherits two core ideas from LongCat-Flash: Zero-computation Experts and Shortcut-connected MoE, or ScMoE.
Zero-computation Experts are exactly what they sound like: experts inside the MoE pool that do no computation and return the input unchanged. A router dynamically decides, for each token, how many real experts and how many zero-computation experts to use.
This turns activated parameters from a fixed value into a range. In LongCat-Flash, the activated range was about 18.6B to 31.3B, averaging 27B. In LongCat-2.0, the range is about 33B to 56B, averaging 48B.
The clever part is not merely saving compute. The model learns to spend more compute on harder tokens and less compute on easier ones.
The article highlights one line from Meituan’s official blog as more important than the 1.6T parameter count: excluding N-gram Embedding, LongCat-2.0’s MoE sparsity has reached about 97%, and adding another 135B expert parameters brought negligible performance gains.
That suggests top MoE models may be approaching a sparsity wall. DeepSeek-V3 had about 671B total parameters and 37B activated, roughly 94% sparse. DeepSeek-V4-Pro is 1.6T / 49B, around 97% sparse. LongCat-2.0 is also around 97%.
If adding more experts no longer buys measurable gains, the next improvements may have to come from attention mechanisms, context efficiency, post-training data, routing quality, and inference optimization rather than simply expanding the expert pool.
Owl Alpha Was the Better Benchmark
The most convincing “benchmark” for LongCat-2.0 may not be a public leaderboard.
Before release, the model reportedly appeared anonymously on OpenRouter as Owl Alpha. According to the article, by the end of June it ranked third globally by total monthly token usage, first in the Hermes harness, and second in Claude Code monthly usage, behind Claude Opus itself.
That matters because OpenRouter usage is paid token consumption, not a static benchmark. Developers did not know whose model it was. They used it to write code, call tools, and run agents. That is a different kind of vote.
The article argues this is why LongCat-2.0 is compelling despite not being the absolute strongest model: it seems to hit real developer pain points around repository-level code understanding and end-to-end task execution.
What This Changes
LongCat-2.0 probably does not single-handedly change the direction of AI.
But together with DeepSeek-V4, GLM-5.2, and Kimi K2.7, it pushes a combination that used to sound like a lab demo into something closer to industrial reality: trillion-parameter open models, domestic compute, low-cost agentic capability, and serious developer usage.
For China, the value is strategic. It reduces dependence on a single foreign supplier. It creates a path around export controls. It gives domestic AI companies a stronger argument that frontier-scale training does not have to mean Nvidia-only training.
For global readers, the important point is simpler:
If a food-delivery giant can train a 1.6T-parameter model without Nvidia GPUs and get real agentic-coding usage, then China’s AI stack is broader than most outsiders assume.
The next question is not whether LongCat-2.0 is better than Claude or GPT. It is whether the “no Nvidia” training path becomes repeatable.