AI
The Silent Crisis in AI Evaluation: Why Our Benchmarks Are Blind to the Next Leap
5/20/2026

AI
OpenAI updates Agents SDK with native sandboxing and an in-distribution harness for deploying and testing agents on long-horizon tasks
4/16/2026

AI
QCon London 2026: Reliable Retrieval for Production AI Systems
3/17/2026

Cloud
Evaluating Azure Local: Strategies for Testing and Deployment
3/17/2026

AI
AI Agent Evaluation: Building Quality into Your Cloud-Native Ecosystem
3/17/2026

AI
Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned
3/16/2026

AI
Improving LLM-as-a-Judge Evaluators: Calibration, Bias Mitigation, and Statistical Validation
3/5/2026

AI
Microsoft Open Sources Evals for Agent Interop Starter Kit to Benchmark Enterprise AI Agents
2/27/2026

AI
AI's Mathematical Renaissance: How Reasoning Models Are Transforming Mathematics and Evaluation
2/17/2026

AI
When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models
2/5/2026

LLMs
FACTS Benchmark Suite Introduced to Evaluate Factual Accuracy of Large Language Models
1/12/2026

LLMs
LLM poetry and the 'greatness' question
1/11/2026