Stanford researchers present HyenaDNA: A long-range genomic basis model with context lengths up to 1 million tokens at single nucleotide resolution

https://hazyresearch.stanford.edu/blog/2023-06-29-hyena-dna

In recent years, there have been rapid advances in Artificial Intelligence (AI) that have the potential to completely transform industries and push the boundaries of what is possible. One area that has garnered significant attention from researchers is the development of more robust and efficient models for natural language tasks. Against this backdrop, researchers are constantly striving to develop models that can handle longer tokens, as the number of tokens in a model determines its ability to process and understand text. Also, a higher number of tokens allows the model to take into account a larger context, thus allowing the model to process large sequences of data. However, in terms of long-context models, most attention has been paid to natural language, and there has been a significant oversight from the field inherently dealing with long sequences: genomics, which involves studying different aspects of the material genetics of an organism, such as structure, evolutionary elements, etc. Similar to the approach taken in natural language models, researchers have proposed the use of foundation models (FMs) in genomics to capture generalizable features from unstructured genome data. These FMs can then be fine-tuned for various tasks, such as location of genes, identification of regulatory elements, etc.

However, existing genomic models based on the Transformer architecture face unique challenges when it comes to DNA sequencing. One such limitation is quadratic scaling of attention which limits the modeling of long-range interactions within DNA. Furthermore, prevailing approaches rely on fixed k-mers and tokenizers to aggregate significant DNA units, often resulting in the loss of individual DNA features. However, unlike natural language, this loss is crucial, as even subtle genetic variations can have a profound impact on protein function. Hyena, a recently introduced LLM, has emerged as a promising alternative to attention-based models using implicit convolutions. This innovative approach demonstrated comparable quality to attention-based models, allowing processing of longer context lengths while significantly reducing computational time complexity. Inspired by these findings, a team of researchers from Stanford and Harvard University set out to investigate whether Hyenas’ capabilities could be harnessed to effectively capture the essential long-range dependencies and individual characteristics of DNA needed to analyze genomic sequences.

This led to the development of HyenaDNA, a genomic FM with an unprecedented ability to process context lengths of up to 1 million tokens at the single nucleotide level, which represents a dramatic 500-fold increase over existing attention-based models. Harnessing the power of Hyenas long-range capabilities, HyenaDNA exhibits unmatched scalability, training up to 160 times faster than FlashAttention-equipped Transformers. HyenaDNA uses a stack of Hyena operators as a basis for modeling DNA and its complex interactions. The model uses unsupervised learning to learn the distribution of DNA sequences and understand how genes are encoded and how non-coding regions perform regulatory functions in gene expression. The model performs exceptionally on several challenging genomic tasks such as long-range species classification tasks. It also achieves best-in-class results on 12 out of 17 datasets compared to Nucleotide Transformer using models with significantly fewer parameters and pre-training data.

Check out 100s AI Tools in our AI Tools Club

As mentioned above, during pre-training, HyenaDNA reaches an impressive context length of up to 1 million tokens, allowing the model to effectively capture long-range dependencies within genomic sequences. Additionally, modeling capability is further enhanced using single nucleotide resolution and tokenization with global context available at each layer. To address training instability and further speed up the process, the researchers also carefully introduced a sequence length warm-up scheduler, resulting in a 40% reduction in training time for species classification tasks. Another significant advantage of HyenaDNA is its parameter efficiency. The researchers also make a groundbreaking observation regarding the relationship between pattern size and quality, indicating that with longer sequences and a smaller vocabulary, HyenaDNA exhibits superior performance despite its significantly smaller size than previous genomic FMs.

The researchers evaluated the performance of HyenaDNA on several downstream tasks. On the GenomicBenchmarks dataset, the pre-trained models achieved new state-of-the-art (SOTA) performance across all eight datasets, significantly outperforming previous approaches. Additionally, on Nucleotide Transformer benchmarks, HyenaDNA achieved SOTA results on 12 out of 17 datasets with significantly fewer parameters and less pre-training data. To explore the potential of in-context learning (ICL) in genomics, the researchers also conducted a series of experiments. They introduced the concept of a soft prompt token, allowing input to drive the output of a frozen pre-trained HyenaDNA model without the need to update model weights or attach a decoding head. Increasing the number of soft prompt tokens has greatly improved the accuracy on the GenomicBenchmarks datasets. The model has also demonstrated remarkable performance in long-range tasks. HyenaDNA competed effectively against BigBird, a SOTA sparse transformer model, in a challenging chromatin profiling task. Furthermore, in a long-range species classification task, the model demonstrated its efficiency by yielding positive results when the context length was increased to 450K and 1M tokens.

These results highlight the remarkable capabilities of HyenaDNA in handling complex genomic tasks and its potential to address long-range dependencies and species differentiation. They predict that this advance will be crucial in driving AI-assisted drug discovery and therapeutic innovations. It also has the potential to enable genomics-based models to learn and analyze complete patient genomes in a personalized way, further enhancing the understanding and application of genomics.


Check out thePaper ANDBlogs. Don’t forget to subscribeour 25k+ ML SubReddit,Discord channel,ANDEmail newsletter, where we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us atAsif@marktechpost.com

Featured tools:

  • Aragon: Get amazing professional headshots effortlessly with Aragon.
  • StoryBird AI: Create personalized stories using AI
  • Taplio: Transform your LinkedIn presence with Taplios AI-powered platform
  • Otter AI: Get a meeting assistant that records audio, writes notes, automatically captures slides and generates summaries.
  • Notion: Notion AI is a robust generative AI tool that assists users with tasks such as summarizing notes
  • tinyEinstein: tinyEinstein is an AI Marketing manager that helps you grow your Shopify store 10x faster with almost no time investment on your part.
  • AdCreative.ai: Supercharge your advertising and social media game with AdCreative.ai, the ultimate AI solution.
  • SaneBox: SaneBox’s powerful artificial intelligence automatically organizes your email and the other smart tools ensure that your email habits are more efficient than you can imagine
  • Motion: Motion is a smart tool that uses artificial intelligence to create daily schedules that take into account your meetings, tasks and projects.

Check out 100s AI Tools in the AI ​​Tools Club

Khushboo Gupta is a Consulting Intern at MarktechPost. He is currently pursuing his B.Tech in Indian Institute of Technology (IIT), Goa. She is passionate about Machine Learning, Natural Language Processing and Web Development. She likes to learn more about the technical field by participating in different challenges.

Unveil tomorrow’s breakthroughs today – sign up for our AI newsletter for exclusive insights into cutting-edge AI research!

#Stanford #researchers #present #HyenaDNA #longrange #genomic #basis #model #context #lengths #million #tokens #single #nucleotide #resolution
Image Source : www.marktechpost.com

Leave a Comment