SandboxAQ and NVIDIA Create Digital Super Brain With 5 Million Synthetic Molecules To Discover New Drugs

Lidi Garcia
Jun 20
4 min read

SandboxAQ, an NVIDIA-backed AI company, has created a massive dataset called SAIR to help scientists discover new drugs faster. The database contains millions of computer simulations that show how potential drugs might bind to proteins in the body. With this, AI can quickly predict which drugs are most likely to work, saving years of testing and a lot of resources. This promises to speed up the creation of new treatments for a variety of diseases.

Discovering new drugs is a time-consuming, expensive, and extremely complex process. Before a drug can be tested in people, scientists need to figure out whether the drug molecule binds correctly to the target protein in the human body.

This binding is essential because it determines whether the drug will be able to stop a biological process related to a disease. To do this, they follow something like this:

1- First, they need to know the 3D structure of the protein, which usually requires expensive and time-consuming experiments, because the shape of the protein helps them understand how it works.

2- Then, they test thousands of possible molecules that could become medicines on the computer.

3- They simulate how each molecule could fit into the protein, using a technique called “docking”, which shows the position and orientation of the molecule at the site where it attaches.

4- Finally, they calculate how well this molecule can attach to the protein, using models based on physics or artificial intelligence.

This process is repeated until they find a molecule that binds strongly enough to become a possible new medicine. From there, this molecule undergoes new tests and moves on to drug development.

For some time now, scientists have been looking for a way to make this process much faster and more straightforward, by creating artificial intelligence models that can predict the potency of a drug right away, based solely on information about the protein and the molecule, without even having to generate the 3D structure first.

To help speed up this crucial phase of research, SandboxAQ, an artificial intelligence startup that spun out of Google and is supported by NVIDIA, has launched an innovative dataset called SAIR (Structurally Augmented IC50 Repository).

The idea is to compress steps that used to take a long time into a single, quick prediction, made with the help of artificial intelligence.

SAIR is a gigantic database with more than 5.2 million pairs of three-dimensional structures of proteins and potential drugs (small molecules). These structures were created using computers, but are based on real data from scientific experiments.

The goal is to enable scientists to use artificial intelligence to quickly predict whether a drug will bind to the protein they are studying. This is something that previously required several time-consuming and expensive lab tests. By speeding up this step, scientists can save resources and move faster toward finding new treatments.

How was the data created? Rather than generating this information in the lab, which would take years, SandboxAQ used powerful NVIDIA computer chips to simulate the interactions between proteins and drugs.

They created several different “poses” (shapes in which the drug molecule could fit into the protein) and calculated the strength of that binding using real experimental data as a basis. This technique generated synthetic (i.e., computer-generated) data that is extremely close to what would be observed in real experiments.

This work was made possible by using a model called Boltz-1x, which specializes in predicting the folding and fitting of molecules in 3D, with accuracy reaching up to 94%.

Examples of 3D co-folded protein-drug complexes found in the SAIR release. Credit: SandboxAQ. To better understand what the SAIR dataset offers, let’s start by looking at the figure above, which shows three examples. In the image, you see shapes that look like coiled, colored ribbons; these shapes represent proteins that exist in the human body. The smaller parts highlighted by a kind of gray “cloud” are the drug molecules (or pharmaceuticals) that are “attached” to these proteins. The SAIR dataset provides these predicted protein structures and shows exactly where and how the drug molecules bind, what we call “pose.” In addition, SAIR also includes experimental data on the “potency” of the drug, that is, how strongly it can bind to the protein. This information, about how the drug fits and how well it binds, is essential in the search for new drugs.

With SAIR, scientists can train artificial intelligence models to predict in a matter of minutes or hours what could previously take months or years. This means that the development of new drugs could become much faster, and the cost of testing drug candidates should fall significantly.

In the future, it is hoped that these models will even be able to create new molecules from scratch, rather than testing them one by one against large databases.

In addition, by publicly releasing this dataset, SandboxAQ allows other researchers around the world to benefit from this innovation, while also offering their own models for commercial use.

The expectation is that these tools will produce results that are as reliable as laboratory tests, but in a virtual, cheaper and much faster way.

Interesting Engineering

https://interestingengineering.com/innovation/5-million-ai-drug-structures-sandboxaq

Reuters

https://www.reuters.com/business/healthcare-pharmaceuticals/nvidia-backed-ai-startup-sandboxaq-creates-new-data-speed-up-drug-discovery-2025-06-18/

Nvidia-backed AI startup SandboxAQ creates new data to speed up drug discovery

By Stephen Nellis

SandboxAQ

https://www.sandboxaq.com/post/sair-the-structurally-augmented-ic50-repository