Summer in Charlotte (夏洛特的夏)

Life

周五和周六的夏洛特特别热闹。街上挤满了人。我下楼去看。路被读起来了。一个黑人大妈在守路,我和她聊了起来。她说香蕉队和什么队在这里搞比赛,有很多好玩的节目,大家都来玩。我就给她聊美国的流行文化。我说我打篮球,但是baseball不太懂。她说她来自加勒比一个小岛,以前住在纽约布鲁克林。我们讨论了篮球,baseball,soccor,还有rugby,football。她说上周这里有soccer,也是人特别多。我记得的确是。晚上还有夜市。整个夜市可能有上百个摊位,我就看到两三个亚洲脸孔的人的。他们私下告诉我,一个展厅,三天1000块,一个token 2块,一碗云吞4个token,然后主办方和摊位五五分成。我说下次我也来申请一个摊位。我吃了几碗,给钱他们不要,我最后还是坚持放了20块在桌子上。

Charlotte was especially lively on Friday and Saturday this week. The streets were packed, so I went downstairs to see what was happening. The roads had been blocked off, and a Black lady was standing guard at one of the barricades. We started chatting. She told me the Savannah Bananas and another team were in town playing a game, with all kinds of entertainment going on around it, so everyone had come out to join the fun.

That led us into a conversation about American popular culture. I told her I play basketball but do not really understand baseball. She said she was from a small Caribbean island and used to live in Brooklyn, New York. We ended up talking about basketball, baseball, soccer, rugby, and football. She mentioned that there had been a soccer event here just last week and that it had been just as crowded. I remembered that too.

Later that night there was also a night market. It must have had well over a hundred vendors, though I only spotted two or three Asian faces. A few of them quietly told me how the economics worked: a booth cost $1,000 for three days, each token was worth $2, a bowl of wontons cost four tokens, and the organizer split the proceeds fifty-fifty with the vendors. I told them I might apply for a booth myself next time. I had a few bowls, and when I tried to pay, they did not want to take my money. In the end I still insisted on leaving $20 on the table.

AI

Incompressible Knowledge Probes

The whole saga surrounding Bojie Li's Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity is a perfect microcosm of where the AI industry is right now: a mix of profound breakthroughs obscured by layers of hype, flawed methodologies, and a desperate obsession with scale. When the preprint first dropped, claiming GPT-5.5 was sitting at an astronomical 9.7 trillion parameters and Claude Opus 4.6 at 5.3 trillion, the noise was deafening. Even the estimate for Gemini 3.1 Pro was pegged at a ridiculous 40 trillion. As someone who spends hours optimizing vLLM and writing custom PagedAttention implementations just to squeeze a few extra tokens of throughput out of our GPU clusters, those numbers felt entirely divorced from hardware reality.

Thankfully, the sanity check published by the METR researchers, Benjamin and Lawrence—Sanity-checking "Incompressible Knowledge Probes"—brought things back down to earth. Reading their takedown of the methodology was incredibly satisfying. It was a classic engineering post-mortem. The premise of the original paper—using a model's capacity to memorize obscure, incompressible facts as a proxy to extrapolate parameter count via regression against open-weight models—is conceptually clever. But execution is everything. The fact that the entire codebase was practically generated by Claude Code, riddled with unchecked "AI slop," and lacking human validation is a stark warning.

The fatal flaw they uncovered—the hidden floor=0 scoring mechanism—was a masterclass in how subtle bugs can completely destroy a regression model. By artificially capping the penalty for hallucinated answers from smaller, open-source calibration models (like the Qwen and DeepSeek variants), the original author artificially compressed the performance gap between small and massive models. When you plot a regression line through artificially elevated lower bounds, the trajectory of the slope flattens out. Consequently, to account for the genuinely high performance of frontier models, the extrapolated x-axis (parameter count) had to stretch into the stratosphere.

Once Benjamin and Lawrence removed that zero-point floor and cleaned up the ambiguously phrased questions in the dataset (which were penalizing frontier models for being too accurate or nuanced), the numbers collapsed. GPT-5.5 dropped by a factor of 6.6x to a much more reasonable 1.5 trillion parameters. Claude Opus 4.7 settled around 1.1 trillion.

The LLM Parameter Lie

That correction was necessary—but it still does not give us ground truth, and Li's regression may have swung too far in the other direction. For the leading closed models, no lab has published an authoritative parameter count. OpenAI, Anthropic, and Google treat frontier scale and architecture as trade secrets. What we have instead are leaks, third-party reconstructions, and the kind of careful industry mapping done by outlets like LifeArchitect.ai and DoDataThings' "The LLM Parameter Lie"—and as of early 2026, that is the best anyone can do.

The picture that emerges is more coherent than the headline numbers on Twitter, even if every figure comes with an asterisk. GPT-4 is widely believed to be a MoE model with roughly 1.76 trillion total parameters, activating on the order of 280 billion per token—meaning the vast majority of weights sit idle on any given forward pass. Independent analysts place the Claude Opus 4.x line (4.6 / 4.7) in the same architectural regime: MoE, with total parameters somewhere around 5 trillion. GPT-5, GPT-5.5, and Gemini 3.1 Pro remain genuinely opaque; the 9.7T and 40T figures from Li's paper were artifacts of bad methodology, not credible disclosures. And then there is Claude Mythos 5, still unreleased but already leaking into public view through an internal Anthropic draft that surfaced in late March 2026. Coverage from AI Analytics Diaries and AI Magicx puts it at 10 trillion total parameters—with estimated active compute per forward pass still in the 800 billion to 1.2 trillion range thanks to MoE routing. That is a staggering headline number, but it is not the number that should drive engineering decisions.

The industry consensus has shifted accordingly, and the DoDataThings piece names it plainly: the LLM parameter lie. Nearly every frontier model now runs on MoE—so a 10T model might activate only a fraction of its weights for any given task. More importantly, inference-time compute has become a separate scaling axis. OpenAI demonstrated this with o4-mini: estimated somewhere between 100B and 300B parameters, it scores 92.7% on AIME 2025 not because it is a bigger memorization engine, but because it is allowed to think longer at inference. The practical takeaway is that top closed models probably cluster in the 2T–5T total-parameter band—Mythos 5 being the outlier chasing extreme scale—but what actually matters for deployment is active parameter count and the compute budget available at inference time, not the vanity metric printed on a slide.

Juniper Vasara

That framing is exactly what I have been arguing in the recent drafts for Modern Distributed AI Systems. We are no longer in the era of brute-force parameter bloat. We have firmly entered the era of MoE and inference-time compute—and that brings me to the architectural debates we are having internally regarding Juniper Vasara. When we are building a risk computation platform for capital markets, we don't need a monolithic 10-trillion-parameter encyclopedia that has memorized the capital of every obscure province on Earth. That is System 1 thinking—fast, associative, and prone to hallucination when faced with novel stochastic calculus problems. What we need for advanced derivatives pricing, mortgage modeling, and dynamic risk simulation is System 2 thinking.

Inference-time compute is arguably the most elegant paradigm shift we've seen since the transformer itself. Instead of forcing a model to pre-compute the entire universe of logic during its training phase, we allocate a massive compute budget at the moment of inference. Watching a smaller, highly optimized model utilize a hidden chain-of-thought, explore multiple algorithmic pathways via tree-search, and run self-correcting verifiers before outputting a final Python script or a risk metric is beautiful. It mirrors how a quant actually thinks. You don't just blurt out the price of a complex exotic option; you derive it, check your boundary conditions, realize you dropped a negative sign, backtrack, and recalculate.

By utilizing inference-time compute, a 100-billion-parameter base model, given sixty seconds to "think," can easily outperform a trillion-parameter behemoth that is forced to answer instantly. The scaling laws have shifted from the x-axis of training flops to the y-axis of inference flops. For our infrastructure, this means our focus on deploying highly efficient C++ and Python backends to handle asynchronous, long-running agentic reasoning loops is exactly the right bet. We can leverage smaller, cheaper open-source models, hook them into robust Python environments, and let them chew on complex pricing engines using extended inference compute, rather than paying astronomical API costs for bloated, dense models.

Closing Thoughts

Reflecting on all this, I feel a renewed sense of clarity about our engineering roadmap. The noise of the internet will always chase the largest numbers, the flashiest benchmarks, and the most sensational preprint papers. But the real edge lies in efficiency, rigorous evaluation, and designing systems that actually reason rather than just recall.

The sun is starting to dip below the treeline outside the window. I think it’s time to close the laptop, step away from the terminal, and perhaps order some hot pot ingredients for dinner. A bit of calligraphy practice as Xuanxin Jushi might be the perfect way to clear the mind and reset. Tomorrow, it's back to the code, back to building, and back to pushing the boundaries of what these systems can actually do when we teach them how to think, rather than just what to remember.