A 22-year-old reverse-engineered Anthropic's secretive Claude Mythos architecture into an open-source implementation called Open Mythos. The core innovation replaces deep stacked layers with a recurrent loop that reuses weights multiple times, challenging the assumption that bigger models require more parameters. Research shows this approach can match larger transformers with nearly half the parameters while enabling deeper reasoning during inference.
Key Points
Open Mythos Architecture
Kai Gomez built Open Mythos as a fully open-source PyTorch implementation based on public research and speculation about Claude Mythos. The architecture uses a Recurrent Depth Transformer (RDT) that runs a smaller set of layers up to 16 times instead of stacking hundreds of layers with different weights.
How RDT Works
RDT consists of three parts: a prelude that encodes the input once, a recurrent block that loops multiple times, and a coda that produces the output. Each iteration updates the hidden state using the previous state, the original input signal, and the transformer computation. The input is reinjected every loop to prevent the model from drifting away from what it was supposed to process.
tldr.md
8 Lpa To 55 Lpa In 4 Months No Bs Breakdown Ft Dhairyasheel
Open Mythos uses approximately 384 experts in a mixture-of-experts setup, but only activates a small subset per input. In Kimiko 2.6, only eight experts are selected per input. Each loop can activate different experts, so the computation is not simply repeating the same process over and over.
Parameter Efficiency
Research demonstrates that a 770 million parameter RDT can match the performance of a 1.3 billion parameter standard transformer trained on the same data. This achieves nearly half the parameters with similar output quality, challenging a core assumption in AI scaling.
Latent Space Reasoning
All reasoning in RDT happens entirely in latent space with no intermediate tokens generated. Sixteen iterations occur inside hidden state vectors, then a single output is produced. This differs fundamentally from chain-of-thought prompting, which makes reasoning visible through text. RDT operates in continuous space and can represent multiple reasoning paths simultaneously, similar to breadth-first search.
Systematic Generalization
Experiments on systematic generalization show that RDT can handle combinations of knowledge it never saw during training, whereas standard transformers tend to fail when exact combinations are not in the dataset. Another test on depth extrapolation found that when models were trained on reasoning chains up to 20 steps then tested on 30-step problems, standard transformers collapsed while the recurrent model added more loops and continued.
Stability Mechanisms
Recurrent architectures face stability issues where the hidden state can explode with too many loops. Open Mythos addresses this using linear time-invariant injection based on the Park K paper, which constrains the system so the hidden state remains stable regardless of loop count. Adaptive computation time prevents overthinking by giving each token a learn signal that decides when to stop looping.
Memory Efficiency
Open Mythos uses multi-latent attention similar to DeepSeek, which compresses key-value pairs into a lower-rank representation and reduces memory usage by up to 10 to 20 times. Depth-wise LoRA adapters add small parameter modifications at each loop step, so even though base weights are shared, each iteration is not identical.
Scaling Hypothesis
The research suggests scaling might shift from training bigger models toward letting models think longer during inference. This represents a completely different direction from current approaches focused primarily on parameter count.
Moonshot AI Kimiko 2.6
Moonshot AI released Kimiko 2.6, a 1 trillion parameter model using mixture of experts with 384 experts, multi-head latent attention, SwiGLU activation, and a 400 million parameter vision encoder for multimodal capability. The model can spawn up to 300 agents for complex workflows, breaking tasks into sub-steps and executing them in parallel. Claw groups allow the model to bring humans into the loop, splitting tasks between AI agents and real people.
Kimiko Benchmark Performance
Company-reported benchmarks claim Kimiko 2.6 outperforms GPT 5.4 and Claude Opus 4.6 on multiple benchmarks. On HLE full, which contains approximately 2,500 doctorate-level questions across more than 100 fields, Kimiko 2.6 scored 54, Opus scored 53, and GPT 5.4 scored 52.1.
XAI Voice APIs
XAI released new speech-to-text and text-to-speech APIs under the Grok ecosystem. The technology is already deployed in Tesla vehicles, Starlink support systems, and mobile apps. The STT supports 25 languages, real-time and batch transcription, speaker diarization, word-level timestamps, and 12 audio formats. The TTS offers five voices (Aura, Eve, Leo, Rex, and Sal) across 20 languages with expressive tags like laughter or sighs.
XAI Performance Claims
XAI reports that Grok's STT has a 5% error rate on phone call entity recognition, compared to 11 Labs at 12%, Deepgram at 13.5%, and AssemblyAI at 21.3%. Pricing is 0.10perhourforbatchtranscription,0.20 for streaming, and $4.20 per 1 million characters for text-to-speech.
If You Remember Nothing Else
RDT reuses a smaller set of layers multiple times instead of stacking more layers, achieving similar performance with nearly half the parameters.
Reasoning happens entirely in latent space with no intermediate tokens, enabling multiple reasoning paths simultaneously.
Recurrent transformers handle novel knowledge combinations and extend reasoning depth beyond training, whereas standard transformers collapse.
Moonshot AI's Kimiko 2.6 can spawn 300 agents for parallel task execution.
XAI's voice APIs claim significantly lower error rates than competitors and are already production-tested at scale.
Watch Out For
All benchmark performance numbers for Kimiko 2.6 are company-reported. Every company tends to highlight their strongest results.
XAI's voice API performance metrics are self-reported. 11 Labs has years of optimization in voice quality and nuance that might not show up in benchmark tests.
The research on RDT efficiency is based on specific experiments. Real-world performance at scale remains to be proven.