inaturalist-clumper 0.1 released – a modest tool for clustering iNaturalist observations | LavX News

Simon Willison ships the first public version of inaturalist-clumper, a Python utility that groups nearby iNaturalist sightings into spatial “clumps”. The release bundles a simple CLI, a JSON output format, and a few heuristics for handling duplicate records, but it remains an early‑stage prototype that assumes static data and offers limited control over clustering parameters.

What’s claimed

The author announces inaturalist-clumper 0.1, a small library that “groups iNaturalist sightings into clumps”. It is presented as part of the workflow that powers a personal blog where the author republishes his own observations. The release note points to an example JSON file that shows how the tool aggregates nearby records.

What’s actually new

A command‑line interface – inaturalist-clumper can be invoked with a CSV or JSON export from iNaturalist and will emit a new JSON file where each entry contains a clump_id and a list of the original observation IDs that belong to that clump.
Simple spatial heuristic – The current algorithm uses a fixed radius (default 500 m) and a basic “first‑come, first‑served” assignment: the first observation in a batch becomes the seed of a clump, and any later observation whose great‑circle distance to the seed is below the radius is added to that clump. If an observation falls outside all existing seeds, it starts a new clump.
Duplicate handling – Observations that share the same taxon_id and are within the radius are collapsed into a single representative entry, reducing noise when a user uploads multiple photos of the same organism.
Packaging – The project is published on PyPI and the source lives on GitHub (github.com/simonw/inaturalist-clumper). Installation is as simple as pip install inaturalist-clumper.
Example output – The author provides a sample JSON file that demonstrates the schema: each clump includes the centroid coordinates, a list of member observation IDs, and a count of distinct taxa.

Limitations and open questions

Static radius only – The radius is hard‑coded (or set via a single CLI flag). Real‑world clustering often needs adaptive bandwidths that reflect observation density; a fixed 500 m circle will over‑cluster in dense urban parks and under‑cluster in remote areas.
No support for temporal constraints – iNaturalist observations can span years. The current version ignores timestamps, so a 2015 sighting and a 2025 sighting that happen to be geographically close will be merged, which may not be desirable for longitudinal analyses.
Single‑pass greedy algorithm – Because the first observation becomes the seed, the order of the input file influences the resulting clumps. A more robust approach would run a full DBSCAN‑style density‑based clustering or allow multiple passes to refine seed selection.
Limited metadata propagation – Aside from IDs and coordinates, most observation fields (e.g., quality grade, user confidence, media URLs) are dropped in the output. Users who need richer context will have to re‑join the original dataset manually.
Scalability – The implementation loads the entire dataset into memory and performs O(n²) distance checks in the worst case. For a typical power‑user export (a few thousand rows) this is acceptable, but larger projects (tens of thousands of records) will hit performance walls.
No test suite or benchmarks – The release does not include unit tests or any performance numbers. Until those are added, it is hard to gauge reliability across edge cases such as observations on opposite sides of the International Date Line.

Practical use cases

Blogging workflow – The author’s primary use case is to condense a personal observation feed into a tidy list of “hotspots” for a weekly blog post. The tool reduces the visual clutter of many photos taken at the same location.
Pre‑processing for visualisation – Researchers who want to plot observation density on a map can use the clump centroids as a lightweight proxy for heat‑map generation.
Data cleaning – Duplicate uploads (common when users add new photos to an existing observation) can be collapsed before downstream analysis.

Where to go from here

If you plan to adopt inaturalist-clumper in a production pipeline, consider the following steps:

Add a density‑based algorithm – Fork the repo and replace the greedy radius check with scikit‑learn’s DBSCAN or HDBSCAN, which automatically adapts to varying point densities.
Expose temporal filters – A --max-age-days flag would let you ignore stale observations when forming clumps.
Stream processing – Refactor the code to read observations in chunks and use a spatial index (e.g., an R‑tree via rtree or geopandas) to keep memory usage low.
Unit tests and CI – Contribute a minimal test suite that checks edge cases such as observations at the poles, crossing the antimeridian, and handling empty inputs.
Documentation – The current README is brief; a more thorough guide with example commands, expected JSON schema, and a troubleshooting section would lower the barrier for non‑Python users.

Bottom line

inaturalist-clumper 0.1 is a functional, if simplistic, utility that solves a narrow problem: turning a flat list of iNaturalist observations into grouped clusters for personal publishing. It works, the code is openly available, and the author is clearly iterating based on real‑world usage. However, the tool is not yet ready for large‑scale data science projects without additional clustering logic, performance optimisations, and better handling of metadata. Users should treat it as a starting point rather than a finished product.

#Python #iNaturalist #clustering #geospatial #data-cleaning

inaturalist-clumper 0.1 released – a modest tool for clustering iNaturalist observations