University of Toronto researchers present scGPT: a basic model for single cell biology based on a generative transformer pre-trained in a repository of more than 33 million cells

https://www.biorxiv.org/content/biorxiv/early/2023/07/02/2023.04.30.538439.full.pdf

Screenshot 2023-07-05 at 7.37.51 PM — https://www.biorxiv.org/content/biorxiv/early/2023/07/02/2023.04.30.538439.full.pdf

Natural language processing and computer vision are just examples of the fields where pre-trained generative models have been incredibly successful. In particular, a viable strategy for building foundation models is to combine various large-scale datasets with pre-trained transformers. The study investigates the feasibility of basic models for further research in cell biology and genetics by drawing connections between language and biological constructions (where texts constitute genes and respectively characterize words and cells). Researchers have been at the forefront of building scGPT, a basic model for single cell biology based on a pre-trained generative transformer spanning a repository of more than one million cells, using the growing body of single cell sequencing data. Results show that scGPT, a pre-trained generative transformer, efficiently extracts key biological information related to genes and cells. The script can be enhanced for use in various applications by using transfer learning in new ways. These challenges include gene network inference, genetic perturbation prediction, and multi-batch integration. View the scGPT source code.

By facilitating the detailed characterization of individual cell types and improving our knowledge of disease pathogenesis, single cell RNA sequencing (scRNA-seq) paves the way for studying cellular heterogeneity, tracing lineages, elucidating the pathogenetic mechanisms and the development of patient-specific therapeutic approaches.

Given the exponential growth of sequencing data, there is an urgent need to create methods that can effectively exploit, improve and adapt to these new trends. Generative pre-training of foundation models is an effective strategy to overcome this difficulty. By learning from massive data sets, generative pre-training has recently seen tremendous success in various domains. Popular uses include NLG (natural language generation) and computer vision. These basic models, including DALL-E2 and GPT-4, are based on the principle of pre-training transformers on large-scale heterogeneous datasets that can be easily tailored to specific downstream tasks and scenarios. Not only that, but these pre-trained generative models always outperform their custom-trained counterparts.

Check out 100s AI Tools in our AI Tools Club

Researchers draw on the NLG self-supervised pre-training method to improve modeling of massive amounts of single-cell sequencing data. The self-attention transformer has been shown to be a useful and efficient framework for modeling text input tokens.

Using generative pre-training on more than a million cells, these scientists offer the first attempt to build a basic single-cell model, dubbed scGPT. They present novel approaches to pre-training massive amounts of single-cell omics data, addressing both methodological and engineering problems as they arise. They use a fast-access in-memory data structure to store hundreds of datasets, allowing them to handle huge amounts of data. They modify the transformer architecture to simultaneously learn cell and gene representations, and build a unified generative pre-training approach tailored to non-sequential omics data. To enable the use of the pre-trained model in various downstream activities, they also provide standard pipelines with activity specific goals for model tuning.

Through these three components, the scGPT model highlights the revolutionary potential of the single cell foundation concept. This starts with scGPT, the first large-scale generative foundation model that supports the transfer of learning to various downstream activities. Demonstrate the effectiveness of pre-training worldwide, perfecting the on-demand approach as a generalist solution for computational applications in single-cell omics by achieving state-of-the-art performance on cell type annotation, genetic perturbation prediction, batch correction and multi-omics integration.

Notably, scGPT is the only baseline model capable of incorporating scATAC-seq and other single-cell omics data. Second, scGPT reveals important biological insights into condition-specific gene-gene interactions by comparing gene embedding and attention weights between refined and crude pre-trained models. Thirdly, the results show a law of scale: better pre-trained embeds and higher performance in downstream tasks result from using more data in the pre-training phase. This finding underscores the promising possibility that foundation models could steadily improve as more and more sequencing data becomes available to the research community. In light of these findings, they hypothesize that using pre-trained basic models will significantly increase our understanding of cell biology and lay the groundwork for future advances in the field. Making scGPT templates and workflow publicly available helps strengthen and accelerate research in these and related fields.

The script is a new pre-trained generative core model that uses pre-trained transformers to make sense of a large volume of single-cell data, as described by the study authors. Self-paced pre-training has proven effective in language models such as chatGPT and GPT4. In studying single cells, they used the same strategy to decipher intricate biological connections. To better model the different facets of cellular processes, scGPT uses transformers to simultaneously learn gene and cell embedding. Single-cell GPT (scGPT) captures gene-to-gene interactions at the single cell level, adding a new degree of interpretability using the attention mechanism of transformers.

The researchers used extensive studies in zero-shot scenarios and fine-tuning to demonstrate the value of pre-trainings. The trained model is already a feature extractor for any dataset. Demonstrates impressive extrapolation ability, showing substantial cell aggregation in zero-shot studies. Furthermore, there is a high degree of congruence between gene networks learned in scGPT and previously established functional relationships. We are confident in the models’ ability to uncover relevant discoveries in single-cell biology because they capture gene-gene interactions and effectively reflect known biological information. Furthermore, with some adjustments, the information learned from the pre-trained model can be used for various subsequent activities. The optimized scGPT model routinely beats ground-trained models on tasks such as cell-type annotation, multi-batch and multi-omic integration. This shows how the pre-trained model benefits subsequent activities by improving accuracy and biological relevance. Overall, the tests demonstrate the utility of pre-training in scGPT, demonstrating its ability to generalize, capture gene networks, and improve performance on subsequent tasks using transfer learning.

Main features

The generalist strategy allows you to perform integrated multi-omics analysis and perturbation prediction using a single model for a single cell study.
We can identify condition-specific gene-gene interactions using learned attention weights and gene embeddings.
He identified a scaling law that demonstrates continuous improvement in model performance as the data load increases.
There are now many pre-trained base models for different solid organs available in the scGPT model zoo (see github) as well as a complete pan-cancer model. Start digging into the data using the best possible starting point model.

Pre-training should take place on a much larger data set including multi-omics data, spatial omics, and a broad range of disease states. The model can learn causal links and estimate how genes and cells respond over time if perturbation and temporal data are included in the pre-training phase. To better understand and interpret the learnings of pre-trained models, it would be ideal to validate the model on a broader set of biologically significant tasks. Furthermore, they aim to study context-sensitive knowledge for single-cell data. The pre-trained model must grasp and adapt to new jobs and environments without further tuning in a zero-shot setup. They can enhance the utility and applicability of scGPTs in a variety of study settings by teaching them to understand the various subtleties of studies and unique needs. They expect the pre-training paradigm to be easily implemented in single cell research and lay the foundation for capitalizing on the knowledge accumulated in the rapidly expanding cell atlases.

Check out thePaperANDGithub link. Don’t forget to subscribeour 25k+ ML SubReddit,Discord channel,ANDEmail newsletterwhere we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us atAsif@marktechpost.com

Featured tools:

Aragon: Get amazing professional headshots effortlessly with Aragon.
StoryBird AI: Create personalized stories using AI
Taplio: Transform your LinkedIn presence with Taplios AI-powered platform
Otter AI: Get a meeting assistant that records audio, writes notes, automatically captures slides and generates summaries.
Notion: Notion AI is a robust generative AI tool that assists users with tasks such as summarizing notes
tinyEinstein: tinyEinstein is an AI Marketing manager that helps you grow your Shopify store 10x faster with almost no time investment on your part.
AdCreative.ai: Supercharge your advertising and social media game with AdCreative.ai, the ultimate AI solution.
SaneBox: SaneBox’s powerful artificial intelligence automatically organizes your email and the other smart tools ensure that your email habits are more efficient than you can imagine
Motion: Motion is a smart tool that uses artificial intelligence to create daily schedules that take into account your meetings, tasks and projects.

Check out 100s AI Tools in the AI Tools Club

Dhanshree Shenwai is a software engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with keen interest in AI applications. He is enthusiastic about exploring new technologies and advancements in today’s changing world, making everyone’s life easier.

StoryBird.ai has just released some great features. Generate an illustrated story from a prompt. Check it out here. (Sponsored)

#University #Toronto #researchers #present #scGPT #basic #model #single #cell #biology #based #generative #transformer #pretrained #repository #million #cells
Image Source : www.marktechpost.com

Leave a Comment Cancel reply