A genome is a genetic blueprint that determines an organism"s characteristics. Deoxyribonucleic acid (DNA), and usually in the case of viruses, Ribonucleic acid (RNA) are the building blocks of genomic sequences. And manipulating these nucleic acids directly can lead to tangible changes in the organism.
As such, developments in genetic engineering focus on our ability to manipulate genomic sequences. But this is a daunting task. For example, precisely controlling a specific class of engineered RNA molecules called "toehold switches" can lend vital insight into cellular environments and potential diseases. However, previous experiments have shown that toehold switches are not tractable, many don"t respond to modifications even though they have been engineered to produce the desired output in response to a given input based on known RNA folding rules.
Considering this, two teams of researchers from the Wyss Institute at Harvard University and MIT have developed a set of machine learning algorithms that can improve this process. Specifically, they used deep learning to analyze a large volume of toehold switch sequences to accurately predict which toeholds perform their intended tasks reliably thereby allowing researchers to identify high-quality toeholds for their experiments. Their findings have been published in Nature in two separate papers today.
With any machine learning problem, the first step is to collect domain-specific data to train the model on. The researchers collected a large dataset composed of toehold switch sequences. Alex Garruss, co-first author and a graduate student working at the Wyss stated:
"We designed and synthesized a massive library of toehold switches, nearly 100,000 in total, by systematically sampling short trigger regions along the entire genomes of 23 viruses and 906 human transcription factors."
Since there were two separate teams, the researchers tried their hands with two different techniques to approach the problem. The authors of the first paper decided to analyze toehold switches not as sequences of bases, but as 2D images of base-pair possibilities. This approach, called Visualizing Secondary Structure Saliency Maps, or VIS4Map, successfully identified physical elements of the toehold switches that influenced their performance, providing insight into RNA folding mechanisms that had not been discovered using traditional analysis techniques.
Authors of the second paper created two different deep learning architectures that approached the challenge of identifying "susceptible" toehold switches using orthogonal techniques. The first model was based on convolutional neural network (CNN) and multi-layer perceptron (MLP), that treated the toehold sequences as 1D images, or lines of nucleotide bases. Using an optimization technique called Sequence-based Toehold Optimization and Redesign Model (STORM), it identified patterns of bases and potential interactions between those bases to mark the toeholds of interest.
The second architecture modeled the problem to the domain of natural language processing (NLP), treating each toehold sequence as a phrase consisting of patterns of words. The task was then to train a model to combine these words, or nucleotide bases, to make a coherent phrase. This model was integrated with the CNN-based model to create Nucleic Acid Speech (NuSpeak). This optimization technique redesigned the last nine nucleotides of a given toehold switch while keeping the remaining 21 nucleotides intact. This allowed for the creation of specialized toeholds that detect the presence of specific pathogenic RNA sequences and could be used to develop new diagnostic tests.
To test both models, the researchers sensed fragments from SARS-CoV-2, the viral genome that causes COVID-19, using their optimized toehold switches. NuSpeak improved the sensors" performance by an average of 160%. On the other hand, STORM created better versions of four SARS-CoV-2 viral RNA sensors, improving their performance by up to 28 times. Apropos these impressive results, co-first author of the second paper, Katie Collins an MIT student at the Wyss Institute, stated:
"A real benefit of the STORM and NuSpeak platforms is that they enable you to rapidly design and optimize synthetic biology components, as we showed with the development of toehold sensors for a COVID-19 diagnostic."
Diogo Camacho, a corresponding author of the second paper and a Senior Bioinformatics Scientist and co-lead of the Predictive BioAnalytics Initiative at the Wyss Institute stated:
“Perhaps the most important aspect of the tools we developed in these papers is that they are generalizable to other types of RNA-based sequences such as inducible promoters and naturally occurring riboswitches, and therefore can be applied to a wide range of problems and opportunities in biotechnology and medicine.”
Moving forward, as Camacho envisioned, the teams are looking to generalize their algorithms to map them onto other problems in synthetic biology to potentially accelerate the development of biotechnology tools.