Omic.ly Weekly 53

December 9, 2024

Hey There!

Thanks for spending part of your week with Omic.ly!


This Week's Headlines

1) Foundation models are all the rage today - Now there's one for genomes!

2) Epigenetics! You've heard of it, but what do you know about it?

3) Dolly the sheep stole our hearts in 1997


Here's what you missed in this week's Premium Edition:

HOT TAKE: With lawsuits percolating in the background, CAP, Congress and others start working on VALID 2.0

Upgrade Here

Or if you already have a premium sub:

Read It Here


EvoAI: a genome foundation model for building prokaryotes

Generative AI models stormed into our lives just a few short years ago.

And they've allowed us to prompt engineer them to create new images or text.

For example, a user could ask OpenAI's DALL-E image generator to create a picture of salmon in a stream and it might sometimes create images of what appear to be live fish swimming in water.

Other times it spits out pictures of salmon fillets floating through those same scenes, but those kinds of issues just come with the tokenized statistical territory of generative AI models.

These models work by breaking down images or text into smaller more digestible units (tokens) to then learn how these components are all inter-related.

They can then use the statistics generated during that learning to create a new image or some text based on what they've seen before.

In the example above, a model trained on images of fish (the living kind and the sushi kind) could generate a wide variety of interesting images of "salmon" in a stream!

But the results of these models generally get better as you add context, so if you're really interested in seeing live fish swimming in a stream, be explicit!

"Brian, why did you just go on some weird tangent about AI model hallucinations?"

Because, there's a new foundation model for prokaryotic genomes and it too suffers from odd hallucinations!

But, if you understand how these models work, it is possible to prompt them to give you pictures of swimming fish, or in the case of a genome model, novel and functional protein complexes!

In this week's paper, researchers at Stanford's Arc Institute developed Evo, a foundation model "trained on 2.7 million raw prokaryotic (bacteria) and phage genome sequences."

The exact details of how the model was trained are a bit too math-y for my simple brain, but what makes this model unique is that it was trained at a single-nucleotide resolution but with a longer context length (131,000 base pairs).

The authors say this makes it better at predicting how things interact at a distance, which is important when you're developing a genome foundation model because lots of regulatory sequences act at a distance.

After some in silico benchmarking of their model they then got the genius idea to see if they could prompt it to generate new CRISPR proteins and guide RNAs.

But because they didn't want their model to spit out "fillets" when they were looking for "live fish," they retrained it around the context of CRISPR genes (about 8kb) and the results of that experiment can be seen in the figure above:

a) Overview of a CRISPR protein-RNA complex b) Overview of model retraining on an 8kb CRISPR specific context c) Shows that prompts mostly spit out what they're prompted to spit out! d) Pre-training (bottom) creates more structures than no pre-training (top) e) Positional entropy (randomness of sequence) for the best CRISPR-Cas the model generated, EvoCas9-1 f) Performance of the new CRISPR (right side - worked really well) to old CRISPR (left) g-j) Comparison of new CRISPR components to old CRISPR

The authors go on to state that "the EvoCas9-1 amino acid sequence shares 79.9% identity with the closest Cas9 in the database of Cas proteins used for model fine-tuning and 73.1% identity with SpCas9. Evo-designed sgRNA is 91.1% identical to the canonical SpCas9 sgRNA."

They also showed that they could use this model to generate novel transposons systems, and scale it up to create de novo prokaryotic genomes.

While they didn't validate the latter and said that these "genomes" definitely wouldn't be functional because they suffer from the same problems as other generative models ie hallucinations, refinements in the future could allow Evo to scale to Eukaryotic (multicellular) organisms.

###

Nguyen E, et al. 2024. Sequence modeling and design from molecular to genome scale with Evo. Science. DOI: 10.1126/science.ado9336


There are lots of ‘omes: There's the genome, transcriptome, proteome, and metabolome - but the one above them all is the epigenome.

Before I get there though, I need to introduce you to chromatin!

We all know about chromosomes which we instantly see in our minds as the highly compacted 'X' formations they take on during cell division, but chromosomes are actually made up of a supercomplex of DNA and protein that we refer to as chromatin.

Chromatin is made of millions of nucleosomes which themselves are composed of 146 bases of DNA that are coiled 1.67 times around a core protein complex called a histone.

Now, chromatin can exist in a cell in two different formats: euchromatin (open and accessible for expression) and heterochromatin (closed and not accessible for expression).

"But what controls whether a region of the genome is accessible for expression or not?"

That's the epigenome!

In Greek, Epi means 'above' or 'before' and epigenetics is why regular old genetics isn't super straightforward.

While the raw sequence of the bases within our genome is important, when, where and how much of the different parts of a genome are expressed determines cellular function.

And what controls expression is whether the cellular machinery that converts DNA to RNA has access to the genome.

Epigenetics is the study of all of the non-DNA sequence alterations that impact gene expression.

The two most common of these changes that we know about involve DNA methylation and histone modification.

Methylation: Cytosine, one of the DNA bases, can be modified with the addition of a methyl group (CH3) to the C5 position. This acts as a repressor of transcription, meaning, this mark can prevent the conversion of DNA into RNA because methylation makes the area unrecognizable.

Histone Modifications: histones, the protein core of nucleosomes that make up chromatin, can also be modified. Histone acetylation (COCH3) is associated with open chromatin and the methylation of histones closes or compacts it.

But don't worry, this gets even more complicated, because the addition of those regulatory marks to DNA and histones can be controlled by cis and trans acting elements!

Cis acting elements: regions of non-coding DNA that regulate gene expression ie binding sites for proteins.

Trans acting elements: proteins that bind to DNA to regulate gene expression.

In the context of epigenetics, cis acting DNA elements are recognized and recruit trans-acting proteins that can modify DNA and histones to regulate the expression of the genome. Cis and trans acting elements aren't limited to epigenetic controls of expression though - this is just one of their many functions.

Now I remember why I avoid writing about epigenetics.

Because, it gets extra complicated!

How these elements function, what regions of the genome are open and closed, and which genes are expressed IS DIFFERENT in every single cell in our bodies.

Sequencing to the rescue?


6LL3, better known as Dolly the sheep, showed that mammalian cloning was possible in 1997.

So, why was a sheep the first choice for this ground-breaking work?

Well, they’re cheap, and Ian Wilmut and his team at the Roslin Institute in Scotland were interested in creating a mammalian platform for the development of biological pharmaceuticals to treat human disease.

The first step of that pioneering vision was to show that it was possible to clone a mammal!

Dolly was created by transferring the nucleus (contains the DNA!) of a mammary epithelial cell from a 6-year old Finn Dorsett ewe into an Oocyte (fancy term for an egg) of a Scottish Black Face ewe that had been enucleated (had its nucleus removed).

The egg with the new nucleus was then implanted in a surrogate and Dolly was born!

Basically, Dolly is the 'clone' of the 6-year old Finn Dorsett sheep.

Although the authors do not use the word 'clone' anywhere at all in the original article.

But you may be wondering why this is such a big deal.

And the answer to that is because it’s really hard to clone animals!

Cells in adults have undergone differentiation – meaning, parts of their genomes have been deactivated so that the cells only produce the proteins required for their specific function within the body.

The researchers tricked these cells into removing all these deactivation marks and then inserted the 'reprogrammed' nuclei into Oocytes to see if they could create live lambs.

Technically, 8 lambs were reported in this paper, but Dolly was the only one that was derived from adult tissue.

The figure above shows the microsatellite markers of the recipient ewes (the surrogates), the cells (embryo derived – SEC1, fetal-derived – BLW1, and mammary-derived – OME), and the lambs that were born.

6LL3 shares the same microsatellite markers as the mammary cells she was derived from, but not the same microsatellite markers of any of the surrogate ewes.

Dolly is a bonafide copy!

And it should come as no surprise that the popular reporting on Dolly at the time highlighted how close humanity was to creating designer babies.

Or, that scientists should stop playing god.

But this work, along with earlier work in amphibians by John Gurdon, fueled the later discovery of induced pluripotent stem cells (iPSC) by Shinya Yamanaka and led to his subsequent Nobel Prize in 2012.

The scientific value of cellular reprogramming and proof that it could be done in mammals (Dolly!) was a tremendous first step in using this and similar techniques to help us understand the complexities of our genomes, how they function, and, ultimately, the usefulness of cloning (and stem cells!) for the development of treatments for human disease.

###

Wilmut I et al. 1997. Viable offspring derived from fetal and adult mammalian cells. Nature. DOI: 10.1038/385810a0


Were you forwarded this newsletter?

LOVE IT.

If you liked what you read, consider signing up for your own subscription here:

Subscribe to Omic.ly