GenAI adoption increases research productivity significantly, with early-career researchers and non-English speakers benefiting most.
Less than 0.01% of ChatGPT responses show political bias; GPT-5 models reduced bias by 30% versus GPT-4o.
Just 250 malicious documents can backdoor any size LLM, challenging assumptions about scale-based security.
LLMs achieve 90% of human test-retest reliability while maintaining realistic response distributions for consumer research.
7B AgentFlow model surpasses GPT-4o with +14.9% on search, +14.0% on agentic, +14.5% on math tasks.
Researchers trained Large Language Models on over one billion tokens of transcriptomic data using the Cell2Sentence framework.
Anthropic groups potential policy responses to AI's economic disruption into three tiers, tied to the pace and intensity of change.
First extensive investigation into RL compute scaling, analyzing over 400,000 GPU-hours to understand algorithmic choices.
Simple monitoring systems can identify reward hacking and sandbagging with 80-90% accuracy at a 5% false positive rate.
HAL orchestrates large-scale parallel evaluations across 21,000+ agent rollouts, revealing unexpected behaviors in AI systems.