When Peer Review Meets Code: The Hidden Challenge of Vetting Research Software
Share this article
The Unseen Workload of Peer‑Reviewed Software
In the traditional scientific workflow, a paper is written, a peer‑reviewer reads it, and the editor decides whether to accept it. The peer‑review process is designed to catch errors in reasoning, methodology, and data analysis. But when the methodology is a simulation or a diagnostic algorithm written in MATLAB, Python, or C++, the reviewer’s job turns into a code audit.
The blog post by Mirawelner (2024) highlights a real‑world scenario: a research team translating a 20‑year‑old MATLAB codebase into C++ to detect heart arrhythmia. The author notes that the original code was written by graduate students with little software‑engineering training, resulting in a tangled, undocumented codebase that would be daunting even for seasoned developers.
"The code that goes along with my spectroscopy project from Purdue … doesn’t actually do anything that hasn’t been done before—it just does it with less data. The ‘output’ is just a plot describing what happened inside the code. If you look at the ‘output,’ all you can surmise is that the code displays plots, suggesting that the algorithm being simulated works the way we say it will. It would be entirely possible to write code that fakes those plots…" – Mirawelner, 2024
This excerpt underscores a key issue: simulation code can be easily mimicked. A reviewer who only sees the plotted results cannot confirm that the underlying algorithm is correct or that the implementation faithfully follows the paper’s description.
Why Peer Review of Code Is Hard
- Expertise Gap – Peer reviewers are typically domain experts (physicists, biologists, etc.) rather than software engineers. They may lack the skills to understand low‑level implementation details.
- Time Constraints – Reviewers already juggle multiple manuscripts. Diving into hundreds of lines of legacy code is a significant time investment.
- Reproducibility vs. Verification – Running the code on the reviewer’s machine can confirm that it runs, but not that it matches the scientific claim.
- Incentive Structure – Reviewers are not paid, and the academic reward system rarely values code review. As the author notes, "reviewers would have to look for hidden bugs in huge codebases… such an undertaking is difficult."
A Technical Illustration
Consider a simple MATLAB snippet that simulates a damped harmonic oscillator:
% Original MATLAB code
t = 0:0.01:10; % time vector
x0 = 1; v0 = 0; % initial conditions
k = 0.5; c = 0.1; m = 1;
% Analytical solution
x = x0*exp(-c/(2*m)*t).*cos(sqrt(k/m - (c/(2*m))^2)*t);
plot(t,x)
Translating this to C++ requires careful handling of numerical integration, floating‑point precision, and memory management. A reviewer who only sees the final plot cannot know whether the C++ implementation uses the same equations or a different discretization.
Potential Solutions
- Mandatory Software Submissions – Top journals already require code for reproducibility. Extending this to all journals would force authors to deliver clean, documented code.
- Automated Test Suites – Authors should provide unit tests and integration tests that validate key properties. Reviewers can run these tests locally.
- Open‑Source Review Platforms – Platforms like review.openreview.net allow reviewers to comment on code directly, fostering a collaborative audit process.
- Domain‑Specific Code Reviewers – Journals could maintain a roster of software‑engineering experts who specialize in scientific code.
- Incentivization – Journals could publish reviewer contributions or offer badges for thorough code reviews, similar to GitHub’s review system.
- Education and Training – Embedding software‑engineering modules in PhD curricula would raise the baseline quality of research code.
The Future of Peer Review for Software‑Intensive Research
The tension between scientific rigor and software quality is not a new one, but the stakes are higher now that many discoveries depend on complex code. As the author concludes, "I think there is a way to solve this problem, but it is not so trivial as requiring reviewers to inspect the simulation code that goes along with the paper. They aren't going to do that unless you pay them or incentivize them somehow."
A pragmatic path forward blends automation (tests, continuous integration) with human oversight (expert reviewers, open‑review comments). By treating code as a first‑class research artifact—subject to the same scrutiny as data and theory—journals can preserve the credibility of scientific claims while acknowledging the realities of modern research software.
Source: Mirawelner, “How Should We Peer Review Software?” (2024).