How Ginkgo Bioworks is teaching AI to speak DNA and transform bioengineering
Last year Ginkgo Bioworks announced a major AI partnership with Google Cloud to apply large language models to biology, especially focused on proteins.
During a presentation at Google, Ginkgo’s Chief Scientific Officer Barry Canton described how their team is teaching AI how to speak DNA.
Key takeaways from Barry’s talk
Why apply AI to biology: Because AI handles complexity well, and biology is complex.
The hope is that AI will reduce the amount of experimentation, making it cheaper and faster for Ginkgo to deliver cell programs to customers. This could accelerate drug discovery, and many other applications.
In particular, large language models make sense for biology because there are similarities between natural human language and the biological “language” of DNA.
Biology is fundamentally programmable, creating opportunities for bioengineering platforms to impact multiple markets
The common genetic code of living things is what enables Ginkgo to be a platform company serving Therapeutics, Industrial Biotech, and Agriculture.
If you get really good at manipulating that code in one context, that expertise translates into other areas as well. The same enzyme classes show up in many different commercial programs.
Yet software alone is not enough—experimental data is essential to success in biology
Data from physical experiments is still crucial for biology, because biology is not yet predictable—it is too complex. When we make a genetic change, we can’t predict what will happen, we must observe these results in experiments.
Physical experiments are expensive and slow. The hope is that AI may help reduce the time and cost of these experiments.
Ginkgo’s vast amount of data takes advantage of economies of scale, allowing them to deliver better results to each new partner. Their data grows with every partnership.
Ginkgo’s library of metagenomic, wild-type gene sequences contains over 2 billion genes vs. 246 million genes in public databases. This amount has more than doubled since 2018.
In particular, AI is revolutionizing protein science and design.
Machine learning (e.g., AlphaFold) combined with the Protein Data Bank have led to disruptive developments.
AI + large datasets are powering a revolution in science because are expanding the opportunity space of questions and solutions. Barry quoted Brian VanDahl, Head of Global Research Technologies at Novo Nordisk:
“Science is currently undergoing a revolution, driven by scientists being able to ask bigger questions by the combination of expanded data sets and the groundbreaking tools to decipher these. Large scale datasets coupled with AI is opening up a greater opportunity space within biology – we no longer have to limit ourselves to the questions that can be addressed by traditional methods.” -Brian VanDahl
There is a migration from population-level cell data, toward single-cell data, which has the potential to provide more useful results.
Barry is excited about increasing amounts of cellular-level data, for example high-content imaging at the single-cell level.
For more, see the Nov 2023 Turing Post interview of Barry Canton Creating large language model that "speaks DNA" (turingpost.com)