Article illustration 1

For many developers, PyTorch represents both opportunity and intimidation. As one developer discovered, "PyTorch is the biggest codebase I have ever had to deal with, and the first time I opened it I wasn't trying to contribute at all, I was just trying to figure out why my PyTorch code was not working."

This common experience—frustration leading to contribution—forms the basis of a valuable lesson about engaging with large open-source projects. What began as a debugging session for a NumPy compatibility issue evolved into a systematic approach for understanding and improving the PyTorch ecosystem.

The Journey Begins: Debugging in the Depths

The developer's entry point into PyTorch was familiar territory: broken code. A NumPy upgrade had introduced a new numpy.bool_ scalar type that PyTorch's interop code couldn't handle, resulting in the error:

TypeError: 'numpy.bool' object cannot be interpreted as an integer
``n
This wasn't just a theoretical problem. PyTorch's own NumPy tests were failing (tracked in issue #157973), and downstream projects like pyhf—a statistical modeling library used in high-energy physics—were forced to pin their NumPy dependency just to keep CI running.

"Chasing those problems pulled me into parts of PyTorch I'd never expected to touch," the developer explains. "And along the way I ended up shipping three upstream fixes and affecting a small change in pyhf itself."

## The Three-Step Approach to Contribution

Through this process, a pattern emerged that would guide subsequent contributions: a loop of three essential steps: Stabilize, Isolate, Generalize.

### Step 1: Stabilize

The first contribution addressed the NumPy compatibility issue directly. In PR #158036, the developer fixed the root cause in PyTorch's Python C API:

```python
# When PyTorch sees a NumPy scalar, it now checks explicitly if it's a numpy.bool_
# and calls PyObject_IsTrue to get its truth value instead of treating it as an integer
if torch::utils::is_numpy_bool(scalar) {
    return PyObject_IsTrue(scalar);
}

This targeted fix included specific tests to ensure future refactors wouldn't silently reintroduce the bug. This stabilization step addressed the immediate pain point while preventing future regressions.

Step 2: Isolate

The second class of issues involved mysterious compiler errors from torch.compile. When users called ndarray.astype("O") or astype(object), Dynamo would fail with unhelpful error messages:

torch.dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors:
... got TypeError("data type 'O' not understood")

The solution, implemented in PR #157810, was to add an explicit guardrail:

# When NumpyNdarrayVariable.call_method sees .astype("O") or .astype(object)
if dtype == "O" or dtype == object:
    raise torch._dynamo.exc.Unsupported(
        "object-dtype NumPy arrays are not supported by torch.compile"
    )

This isolation step transformed an opaque internal failure into a clear, actionable error message. The change also revealed a previously "expected to fail" test that could now be updated to reflect the correct behavior.

Step 3: Generalize

The third contribution addressed a more subtle issue with torch.func.jacfwd failing when encountering torch.nn.functional.one_hot inside torch.compile(dynamic=True). The existing implementation used a scatter-based approach that confused shape inference and dynamic tracing.

In PR #160837, the developer rewrote the behavior using a purely functional approach:

# Instead of building a zeros tensor and scattering
# the new implementation uses elementwise comparison
eq(self.unsqueeze(-1), arange(num_classes)).to(kLong)

This generalization step didn't just fix one failing test—it redesigned the operation to work predictably across the entire transform ecosystem: vmap, jacfwd, torch.compile(dynamic=True), and related functions. The change required updates to C++ batch rules and comprehensive testing across different execution modes.

Lessons from the Trenches

Working inside PyTorch wasn't glamorous, the developer admits. "It felt like reading unfamiliar subsystems until my brain couldn't take it anymore, running the same failing test again and again as per usual with tests, and slowly building a mental map of where things lived."

What made the process productive was treating every review comment and CI failure as a clue, not a verdict. The maintainers weren't asking for perfection—they valued clarity, good tests, and changes that fit the project's design.

Communication proved essential. The first pull request attracted over forty comments from PyTorch maintainers. "At first, it felt overwhelming, but actively engaging with their feedback—asking clarifying questions, explaining my thought process, and being open to suggestions—really helped move the discussion forward."

The Path Forward

"If you want to contribute to a large project like PyTorch, you do not need to wait until you feel like an expert," the developer concludes. "You can start from the same place I did, with a real bug that affects you and a desire to make it go away in a principled way."

The three-step approach—Stabilize, Isolate, Generalize—scales from tiny bug fixes to significant features. It provides a framework for navigating any codebase that feels larger than you are:

  1. Stabilize what you can see
  2. Isolate the behavior you want to change
  3. Generalize the fix so it helps more than one user

This approach transforms the intimidating prospect of contributing to massive projects into an achievable, methodical process. As the developer discovered, "you can make meaningful changes even when the system feels too big to hold in your head."

Source: Michael Gathara's Blog