Researchers from Nvidia, Microsoft, and Basecamp Research used AI to analyze genomic data from over 1 million species, generating potential new gene editing tools and drug therapies including AI-designed enzymes.

An international research consortium including Nvidia, Microsoft, and London-based Basecamp Research has applied machine learning to genomic data from over 1 million species, generating potential new pathways for gene editing and drug development. The project, detailed in unpublished research, utilized Basecamp's massively scaled genomic database to train models that generated novel enzyme designs targeting cancers and antibiotic-resistant superbugs.
What's Claimed
According to project documentation, the team applied transformer-based neural networks to predict protein structures and functions across evolutionary boundaries. The AI systems reportedly identified previously unknown enzymatic functions across microbial, plant, and animal genomes, with particular focus on CRISPR-Cas variants for gene editing and antimicrobial compounds. Nvidia's involvement centered on accelerating the protein folding predictions using their BioNeMo framework and GPU clusters.
Technical Approach
The methodology combines three key innovations:
- Dataset Scale: Basecamp's database aggregates genomic sequences from extreme environments (deep-sea vents, arctic ice, tropical rainforests) covering organisms not found in public repositories like GenBank
- Multi-Modal Modeling: Researchers combined protein language models with geometric deep learning to predict how hypothetical enzyme designs would interact with biological targets
- Active Learning: The system prioritized sampling from genomic regions showing high evolutionary divergence, based on the hypothesis that extremophiles might yield novel protein functions
Early validation in wet labs confirmed several AI-designed enzymes could successfully catalyze reactions that existing databases suggested were biologically implausible. One enzyme variant demonstrated unexpected efficiency in breaking down polyethylene terephthalate (PET) plastics under low-temperature conditions.
Limitations and Challenges
Despite promising in vitro results, significant hurdles remain:
- Functional Validation: Only 12 of 1,800 AI-generated enzyme designs showed measurable biological activity in initial tests
- Computational Cost: Training the largest models required approximately 3.2 million GPU hours on Nvidia H100 clusters
- Biological Complexity: The models struggle with post-translational modifications and protein behavior in living systems
- Scalability: Current infrastructure can only process 7% of the database's full genomic sequences
Microsoft researchers noted in supplementary materials that the protein-folding accuracy dropped from 92% on known structures to 74% when predicting entirely novel protein folds absent from training data.
Practical Implications
The most immediate application may be in enzyme engineering for industrial biotechnology rather than human therapeutics. Basecamp has partnered with synthetic biology firms to develop plastic-degrading enzymes, leveraging the AI-generated designs. For drug discovery, the approach could accelerate target identification phase but faces the same clinical trial bottlenecks as traditional methods. The database architecture itself may prove more valuable than initial therapeutic candidates, providing a template for future biodiversity-based AI training.
This project exemplifies the emerging paradigm of 'evolutionary AI' – using nature's diversity as training data rather than purely synthetic datasets. However, the translation from computational prediction to clinical application remains uncertain, with typical drug development timelines suggesting any therapies would take 7-10 years to reach market if successful.

Comments
Please log in or register to join the discussion