Making Claude a Chemist: Anthropic's Quiet Push Into Scientific AI
#AI

Making Claude a Chemist: Anthropic's Quiet Push Into Scientific AI

Trends Reporter
6 min read

Anthropic just showed a general-purpose language model can match dedicated chemistry software at predicting NMR spectra, hinting at a future where AI handles the grunt work of molecular analysis.

Anthropic published a white paper last week that signals something worth paying attention to: their latest Claude models can predict NMR spectra as accurately as ChemDraw and MestReNova, the tools chemists have relied on for decades. This is not a flashy demo or a hypothetical roadmap. It is a measured, benchmarked comparison showing a general-purpose AI model performing a task that previously required specialized, domain-locked software.

The work, led by Anthropic chemist David Kamber, tested three Claude models (Opus 4.7, Opus 4.6, and Sonnet 4.6) against the two dominant NMR prediction tools on 20 compounds pulled from recent ChemRxiv preprints. The compounds were chosen to represent distinct structural challenges: chloropyridazines with tricky hydrogen bonding, Boc-protected maleimides with unusual carbonyl environments, spiroketones with diastereotopic protons, and silyl methanesulfonamides with shielded carbon atoms. Each family probes a different failure mode that trips up prediction algorithms.

Featured image

The results land in a zone that should make both AI skeptics and chemistry traditionalists reconsider their assumptions. On hydrogen NMR prediction, Opus 4.7 achieved an average error of 0.079 ppm, well under the 0.20 ppm threshold chemists consider correct, and outperformed both ChemDraw and MestReNova on peak placement accuracy. On carbon NMR, it essentially tied with MestReNova at around 1.4 ppm average error. The model also matched splitting patterns more often than any dedicated tool, and predicted sub-peak spacing within 0.5 Hz roughly 80 percent of the time, compared to 26 to 35 percent for the classical software.

A graph of the four scaffold classes

A graphic depicting the per-tool MAE/RMSE summary across 20 compounds

What makes this noteworthy is not that an AI model can do chemistry. Machine learning tools have been applied to NMR prediction for years, and specialized neural networks have shown promise in academic benchmarks. The difference here is that Claude is a general-purpose language model without chemistry-specific fine-tuning. It was not trained on NMR datasets or optimized for spectral prediction. It arrived at these results through the same reasoning capabilities it uses for writing code or analyzing documents. That suggests the underlying architecture and training data have captured something fundamental about how molecular structure maps to electromagnetic response.

The inverse task, structure elucidation from spectra, is where the results get more interesting and more speculative. Anthropic gave Opus 4.7 fifteen problems: given an NMR spectrum and molecular formula, propose the correct structure. On eight simpler targets, single-ring or two-fragment molecules, Claude recovered the correct structure on all three attempts using only spectra and mass spectrometry data. On seven denser targets involving fused rings and spirocycles, it needed a hint, the starting material structure, but still returned the correct answer on all three runs for four of those seven and on two of three for the remaining three.

Chart showing the structure elucidation

This is the part of the paper that feels like it is reaching beyond what the current evidence supports, and Anthropic acknowledges this. The evaluation was small, twenty compounds for forward prediction and fifteen for inverse, and each structural class contributes a single type of failure mode. The authors are explicit that these numbers should be read as indicative rather than precise. A proper evaluation would require hundreds of compounds across dozens of scaffold classes, with enough within-class variation to separate model performance from structural difficulty.

There are also gaps in what was tested. The assessment covered only three solvents, DMSO-d6, CDCl3, and D2O, leaving methanol-d4, benzene-d6, and acetone-d6 untouched. 2D NMR experiments like COSY, HSQC, and HMBC, which provide connectivity information that 1D spectra cannot, were out of scope. Stereochemistry, which often requiresNOE data or coupling constant analysis, was not evaluated. Complex natural products, the compounds where structure elucidation is most laborious and most valuable, were excluded entirely.

The counter-argument writes itself: this is a carefully constructed benchmark on a narrow slice of chemistry, and real-world structure elucidation is messier than any controlled evaluation can capture. Chemists do not just need to match peaks to atoms. They need to account for sample impurities, instrument calibration differences, concentration effects, temperature dependence, and the dozens of practical variables that make every real spectrum slightly different from its theoretical prediction. Dedicated NMR software handles some of these through user-configurable parameters. Claude, at least in this evaluation, was given clean inputs and expected to produce clean outputs.

But the counter-argument to the counter-argument is that this is exactly how progress works in applied science. You start with controlled benchmarks, demonstrate feasibility, identify failure modes, and then expand the scope incrementally. Anthropic is not claiming Claude replaces chemists or even replaces NMR software. They are claiming it can now participate in the workflow as a useful tool for routine spectral analysis and, with some caveats, structure elucidation.

A chart depicting within-tolerance accuracy per compound

The broader context matters here. Chemistry has been one of the harder scientific domains for AI to crack, not because the underlying physics is intractable, but because the data is scattered, inconsistent, and often locked behind journal paywalls. The CAS registry contains over 290 million substances, but structured, annotated spectral data for a fraction of those. Traditional machine learning approaches have struggled with this sparsity. Frontier language models, trained on the full breadth of scientific literature, patents, and preprints, may have a structural advantage: they can reason about chemistry from first principles and textual context rather than requiring curated datasets.

Anthropic lists four areas where they plan to push next: chemical structure reading and rendering, synthetic route planning, reaction mechanism explanation, and chemical literature understanding. These are progressively harder problems. NMR prediction is relatively constrained, with well-defined inputs and outputs. Retrosynthesis planning requires navigating vast combinatorial spaces with incomplete information. Mechanism explanation demands understanding electron flow, thermodynamics, and kinetics simultaneously. Literature understanding requires parsing the messy, inconsistent way chemists actually communicate.

The paper is modest in its claims and specific about its limitations, which is refreshing in a field where press releases often outrun the evidence. Anthropic is not saying Claude is a chemist. They are saying Claude can now do one of the things chemists spend a lot of time doing, and do it competently. That is a smaller claim, but a more credible one, and potentially more useful in practice.

The question worth watching is whether this pattern holds as the evaluation expands. If Claude maintains competitive performance across hundreds of compounds and dozens of structural families, it changes the economics of routine spectral analysis. If performance degrades on more complex or unusual scaffolds, it defines the current boundary of what general-purpose models can do in chemistry. Either way, the field now has a clearer benchmark for where AI-assisted chemistry actually stands, not where we hope it will stand.

The full white paper is available on Anthropic's research blog, and the evaluation compounds were drawn from preprints on ChemRxiv. Anthropic is accepting applications for their AI for Science program at [email protected] for researchers working on problems where Claude could help with multimodal scientific reasoning.

Comments

Loading comments...