Why macOS Is Underrepresented in Public AI Research Datasets
#AI

Why macOS Is Underrepresented in Public AI Research Datasets

Startups Reporter
4 min read

MacPaw Research reveals the significant gap in macOS representation in public AI datasets and introduces GUIrilla, a new framework to address this issue.

Computer-use AI has moved from research demos to mainstream products over the past year, with major AI labs now shipping products that explicitly support desktop control on macOS. Gartner predicts that 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from less than 5% in 2025. This category is no longer experimental. What remains less visible is that almost none of the open research data behind these systems comes from macOS. {{IMAGE:2}}

The problem lies in how computer-use AI learns. These systems learn by watching what happens on screens, with training data typically consisting of screenshots, accessibility metadata, and reasoning explanations paired with tasks. Models trained on this data learn what buttons look like, how windows behave, how applications connect their screens, and what sequences of actions accomplish specific goals. The more diverse and representative the data, the better the resulting agent performs across real applications.

The issue is that the open research datasets the field relies on are heavily skewed toward Windows and Android. In an analysis of OS-ATLAS, one of the largest publicly available synthetic datasets for computer-use AI with over 13 million GUI elements across multiple platforms, macOS accounts for just 0.06% of all samples. That is not a typo. Out of every ten thousand interface samples in the dataset, only six are from Mac.

The reason for this disparity is primarily technical. macOS does not expose its application interfaces in the same ways Windows or Android does. The accessibility APIs that exist are powerful, but working with them at scale requires specialist platform knowledge. Tooling to automate this kind of collection has not existed in any practical, open-source form.

This matters in 2026 because products are now shipping to Mac users at scale. Recent benchmarks such as OSWorld show major progress on computer-use task completion, yet the same agent running against macOS-specific workflows works with a fraction of the underlying knowledge. Closing that gap requires more and better open data.

To address this issue, MacPaw Research has developed GUIrilla, a framework born from their work in human-computer interaction. The team built three components and open-sourced all of them.

The first is GUIrilla itself, an automated system that installs macOS applications, navigates through them screen by screen, and maps everything it finds without any human annotation. The framework produces a graph-based representation of an application: which screens exist, how they connect, what interactive elements live on each, and what actions transition between them. The full implementation is available on GitHub.

The second component is GUIrilla-Task, the dataset that came out of running the framework at scale. It contains 27,171 tasks across 1,108 macOS applications, each paired with screenshots and structured interface data. The team believes it is the largest publicly available dataset of Mac app interactions released to date. It is hosted on Hugging Face, free to use under permissive terms.

The third, and probably the most practical for the broader developer community, is macapptree. This is a small Python library that lets any developer or researcher extract the accessibility metadata of any Mac application in a clean, readable format. Buttons, menus, text fields, view hierarchies, and how screens connect—the same structural layer that Apple originally built for screen readers, exposed in a format that AI systems and developers can actually work with. It requires no specialist Mac platform knowledge to use. The code is available on GitHub.

For researchers training computer-use agents, GUIrilla-Task serves as a drop-in expansion of macOS coverage in any existing computer-use training pipeline. Combined with existing datasets like OS-ATLAS or AndroidWorld, it provides the macOS slice that has been missing. For researchers building UI-understanding benchmarks, the dataset includes both screenshots and structured accessibility data, supporting both vision-based and structure-based models.

For developers building anything that needs to programmatically understand a Mac application, macapptree offers the lightest-weight option available. The original paper includes practical examples of using it for screen representation, vision-based accessibility generation, and UI search use cases.

The performance of computer-use AI on Mac is a research problem before it is a product problem. The models that ship in consumer and enterprise products are downstream of the data and tooling that exist in the open research community. If macOS continues to be underrepresented in that research, the agents that operate on Mac will continue to lag those that operate on Windows and Android, regardless of how good the underlying models become.

The broader shift the industry is calling Software 3.0 is, in practice, the shift to systems where AI agents take actions on behalf of users rather than only chatting with them. That shift cannot happen well on Mac without open, high-quality data about how Mac applications actually work.

GUIrilla, GUIrilla-Task, and macapptree represent MacPaw's contribution to making this possible. Everything is open-source under permissive licenses, with the paper available on arXiv and the full collection of datasets and models living on the MacPaw Research page on Hugging Face.

MacPaw Research is the research unit of MacPaw, a global technology company founded in Kyiv, Ukraine, with offices in Boston, MA and the EU. Its core focus is deep and applied research in Local LLM Inference and AI Memory, with broader directions such as human-computer interaction also in scope.

Comments

Loading comments...