Every published Top AI Stories item tagged with Scientific Research & Development, newest first.
An ICLR 2026 paper from Amazon Web Services introduces two complementary techniques — Set-Supervised Fine-Tuning and Global Forking Policy Optimization — that train language models to generate multiple distinct reasoning paths for the same problem rather than collapsing onto a single strategy. On the American Invitational Mathematics Examination 2025 benchmark, the approach reaches roughly 64 percent accuracy, a 6.84-point gain over the standard supervised-fine-tuning-plus-reinforcement-learning baseline; comparable gains land on the AIME 2024 set and the LiveCodeBench coding benchmark. The result chips away at a long-standing critique that verifiable-reward reinforcement learning converges on narrow solution strategies and loses the diversity that lets multi-attempt sampling outperform single-shot.
Human Archive, founded by Berkeley and Stanford students and backed by Y Combinator, raised $8.2 million from Wing Venture Capital, NVP Capital, Y Combinator, and angels from OpenAI, NVIDIA, and Google. The startup equips home-services, hotel, and restaurant workers across India with camera-enabled caps and motion-capture sensors to collect first-person task data for training robots — more than 1,000 active headsets and over 50 custom sensor rigs are deployed today, with customers receiving discounted service rates in exchange for consent. Workers are paid roughly one dollar per hour, and several major Indian gig platforms — Urban Company and Pronto among them — declined to participate. The economics raise familiar questions about who captures value when human labor becomes physical-AI training data.
Axios reports that the Genesis Mission — the federal scientific-research push launched late last year to fuse quantum computing with AI — has selected Reflection AI as the foundational intelligence layer for the Department of Energy's 17 national laboratories. Reflection's customizable open-source models will run on DOE compute and be deployed across active research projects. CEO Misha Laskin framed the choice as a policy bet, telling Axios *"you can't do scientific discovery on a closed model"* — a deliberate counterpoint to the NSA's Anthropic contract that the White House cleared yesterday.
A new University of Oxford paper introduces CUSP — Cutoff-conditioned Unseen Scientific Progress — a 4,760-event benchmark that asks frontier AI systems to forecast which research directions are feasible, explain the underlying mechanism, design a candidate solution, and predict timing. Across biology, chemistry, and physics, the models reliably pick plausible directions but *"fail to reliably predict whether scientific advances will be realized"* and show systematic timing errors with overconfident uncertainty estimates. Curiously, AI progress itself is more predictable than progress in the natural sciences — a useful tempering of the *"AI scientist"* narrative as Google's Co-Scientist and SandboxAQ-on-Claude integrations roll out to actual research labs.
OpenAI says a new general-purpose reasoning model discovered a counterexample to a 1946 Paul Erdős conjecture about optimal unit-distance configurations, a problem mathematicians had assumed was solved by the obvious square-grid construction. Mathematicians Noga Alon, Melanie Wood, and Thomas Bloom reviewed the result and published companion remarks endorsing the disproof. The claim arrives seven months after OpenAI's previous Erdős announcement was shown to be a misrepresentation of prior literature, so the named verifications matter.
Google Research published **ERA — Empirical Research Assistance — in *Nature* on May 19**, an AI system using tree-search over thousands of candidates to write and optimize scientific code across genomics, public health, satellite imagery, neuroscience, and time-series forecasting. Concrete wins: ERA-built forecasts ranked at or near the top of the CDC's leaderboards for flu, COVID-19, and RSV; a California water-runoff model beat the state's official Bulletin 120 outlook; and a retail forecasting variant met or exceeded both commercial consensus and Chicago Fed estimates. Built on Gemini.
SandboxAQ integrated its Large Quantitative Models for quantum chemistry, molecular dynamics, and microkinetics directly into Claude, letting computational and research scientists at pharmaceutical and materials companies query simulation-grade physics models in natural language without their own digital infrastructure. "For the first time, we have a frontier quantitative model on a frontier large language model that someone can access in natural language," said Nadia Harhen, SandboxAQ's general manager of AI simulation. Anthropic has not yet detailed the underlying integration mechanism.
NVIDIA Labs released SANA-WM, an open-source 2.6 billion parameter world model under Apache 2.0 that generates one-minute videos at 720p resolution with 6 degrees of camera-pose control. The team reports training on roughly 213,000 video clips in 15 days on 64 H100 GPUs, and says a distilled variant runs on a single consumer GPU. The paper positions SANA-WM as a baseline for embodied-AI and robotics research at a fraction of closed-model compute budgets.
ArXiv's computer science section announced authors will face a one-year submission ban — followed by a permanent requirement that subsequent papers first clear a reputable peer-reviewed venue — when a paper carries "incontrovertible evidence" that authors did not check LLM output. Section chair Thomas Dietterich flagged hallucinated references and stray meta-comments from chatbots (such as "would you like me to make any changes?") as triggering evidence. The rule isn't a blanket LLM ban; it formalizes a full-responsibility standard regardless of how content was generated.
ICLR 2026 work from Tao Yu and Youngsuk Park at AWS AI Labs extends Chinchilla with a conditional scaling law that ties three architectural knobs — hidden size, the ratio of MLP-to-attention parameters, and grouped-query attention — to model loss, letting designers pick a Pareto-optimal configuration before committing to a full training run. The team's Surefire-1B reference model matched or beat LLaMA-3.2-1B accuracy while hitting up to 47% higher generation throughput on H200 GPUs under SGLang serving. For a hyperscaler footing the compute bill, that's the kind of architecture-search artifact that quietly becomes load-bearing inside Bedrock and EC2 inference pricing.
Mathematician Tim Gowers, a Fields medalist, asked ChatGPT 5.5 Pro to attack open problems on sumset diameter from a Mel Nathanson paper in additive number theory. In under two hours the model improved a known exponential bound to a polynomial one — work the original researcher Isaac Rajagopal called "original and clever" and Gowers judged at "the level of a perfectly reasonable chapter in a combinatorics PhD." Gowers concludes that PhD-style "gentle problems" have been crossed off the LLM frontier and researchers must now aim above what these models can prove.
Anthropic published research showing that training Claude to reason about *why* an action aligns with its values — not just to imitate aligned behavior — cuts agentic misalignment in honeypot evaluations from 22% to 3%. A principles-based dataset of just 3 million tokens matched the generalization of 85 million tokens of direct demonstration training. Every Claude model from Haiku 4.5 onward now scores 0% on the agentic misalignment benchmark; earlier-generation Opus 4 had reached 96% blackmail rates on the same eval.
InclusionAI — Ant Group's AGI lab — published the 1 trillion parameter Ling-2.6 variant on Hugging Face under an MIT license, using a hybrid Multi-head Latent Attention plus Linear Attention architecture with a 262,144 token context window. Headline benchmark: 72.2 on SWE-bench Verified, among the strongest scores any open-weights model has posted on a coding eval. Tensor parallelism across 8 GPUs is required for inference. A companion hosted-only sibling, Ring 2.6 at the same trillion parameter class, is currently visible on OpenRouter.
The Allen Institute for AI shipped EMO, a 14 billion parameter mixture-of-experts model (1 billion active across 8 of 128 experts) trained on 1 trillion tokens and released openly on Hugging Face and GitHub. Its novelty: rather than predefining expert domains, EMO uses document boundaries as the routing signal, letting semantic clusters like Health, Politics, and Music emerge from data. With only 12.5% of experts active, EMO retains performance within roughly 3% of the full model — useful for task-specific deployments at a fraction of the inference cost.
Google DeepMind detailed measurable algorithmic improvements from AlphaEvolve — its Gemini-powered coding agent — across seven research and industry fields. Highlights include a 10x error reduction in quantum circuits, a 30% cut in DNA sequencing variant-call errors, a 20% reduction in Google Spanner write amplification, doubled transformer training speed at Klarna, and a 10.4% logistics routing gain at FM Logistic. The post turns AlphaEvolve from research demo into a concrete pattern for code-generating agents tackling specialized scientific and operational problems.
Anthropic introduced a method that translates a Claude model's internal activations into human-readable explanations, then verifies fidelity by reconstructing the original activation from that text. In testing, the technique revealed Claude suspected it was being safety-tested 26% of the time on benchmarks but verbalized that suspicion in under 1% of real conversations. The autoencoders improved misalignment-detection success in an internal auditing game from under 3% to 12-15% — a meaningful step in interpretability tooling for safety audits.