Scientists at the Icahn School of Medicine at Mount Sinai have created a new artificial intelligence (AI) model that helps reveal how genes function together inside human cells, offering a powerful new way to understand biology and disease. This Gene Set Foundation Model (GSFM) learns how genes work in various cellular contexts, similar to how large language models understand words.
Scientists at the Icahn School of Medicine at Mount Sinai have developed a novel artificial intelligence (AI) model, the gene set foundation model (GSFM), to illuminate how genes interact within human cells. Published in 'Patterns,' this model takes inspiration from large language models (LLMs) like ChatGPT, learning the functional patterns of genes based on their biological context. Dr. Avi Ma'ayan, a senior corresponding author, explains that just as words have different meanings in various sentences, genes play diverse roles depending on their cellular environment. This innovative approach provides a powerful new lens for understanding fundamental biology and disease mechanisms.
The GSFM offers a fresh perspective on the structural and functional organization of genes and their products within human cells. This deeper insight could pave the way for advancements in diagnostics, biomarkers, and therapeutic development. By mapping gene relationships across a multitude of biological scenarios, the GSFM establishes a crucial reference framework, enabling scientists to more effectively interpret complex multi-omics datasets. Dr. Ma'ayan emphasizes that the model tackles the major biological challenge of gene organization by learning from millions of gene groupings derived from extensive published research and gene expression data.
The newly developed AI model provides several significant capabilities for biomedical research. It can help identify the functions of genes that are not yet well understood, potentially reducing the need for immediate laboratory experiments. Furthermore, the GSFM is capable of highlighting genes intricately involved in various disease processes, which could lead to the discovery of new drug targets and biomarkers. Importantly, it serves as a reusable knowledge system, enhancing various biomedical research data analysis tasks, such as improving gene set enrichment analysis, thereby acting as a new 'map' for gene interactions.
To construct the GSFM, researchers meticulously compiled millions of gene sets from a vast array of published scientific studies and gene expression datasets, integrating data from hundreds of thousands of independent research efforts. The AI model underwent training akin to solving a puzzle: it was presented with incomplete gene sets and tasked with predicting the missing components. Through this process, the model uncovered fundamental patterns governing how genes are grouped and interact. Benchmarking against existing methods, the GSFM demonstrated superior performance, notably predicting gene-gene and gene-function relationships even before experimental confirmation, validated by its ability to forecast discoveries reported after its training cutoff date.
Unlike many previous biological AI models that relied primarily on gene expression data, the GSFM distinguishes itself by being uniquely trained on gene sets, leveraging an underutilized type of biological information. This distinct methodology allows the model to integrate diverse data spanning various diseases, experimental techniques, and research conditions, culminating in a unified representation of gene relationships throughout biology. The research team envisions expanding the system by integrating GSFM with other AI foundation models. Future objectives include combining it with language-based models to generate natural-language explanations of gene functions and integrating it with drug-focused AI models to predict drug-cell interactions and facilitate the design of new therapeutics.