Real-time AI is hitting a wall. As models grow more capable, their ability to process and generate responses quickly has not kept pace. The reason? Each token a model produces requires its own computational pass—a constraint that turns latency into a hard limit for applications demanding instant feedback, from chatbots to autonomous systems.

Now, a team of researchers has introduced multi-token prediction (MTP), a training method that bypasses this limitation by teaching models to generate multiple tokens in a single forward pass. The breakthrough doesn’t rely on speculative decoding’s auxiliary model or complex sampling strategies. Instead, it uses self-distillation—where a smaller, faster model guides a larger one—to refine predictions across multiple tokens simultaneously. The result? A threefold reduction in inference time for the same computational budget.

The implications are immediate. In benchmarks, MTP-enabled models achieve 75% faster token generation than traditional autoregressive approaches while maintaining near-identical accuracy. For industries where latency directly impacts user experience—such as customer service bots or live translation tools—this could be a game-changer.

ram

Unlike speculative decoding, which requires additional hardware or intricate decoding pipelines, MTP integrates seamlessly into existing training workflows. By adjusting a single hyperparameter—the number of tokens predicted per forward pass—the method scales effortlessly across model sizes. Early tests suggest it outperforms speculative decoding in both simplicity and real-world performance, particularly in low-latency environments.

The shift toward multi-token prediction isn’t just about speed. It’s a fundamental rethinking of how models process information. If adopted widely, it could unlock new classes of applications where real-time interaction was once impossible—from ultra-low-latency search engines to AI-driven live event transcription. The question now isn’t whether models can handle more tokens faster, but how quickly the industry will embrace this new standard.