UPDF AI

A long-context language model for deciphering and generating bacteriophage genomes

Bin Shao,Jiawei Yan

2024 · DOI: 10.1038/s41467-024-53759-4
Nature Communications · 14 Citations

TLDR

MegaDNA, a long-context genomic language model, generates DNA sequences up to 96 K base pairs with annotated proteins and potential regulatory elements and regulatory element activity.

Abstract

Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions. MegaDNA, a long-context genomic language model, generates DNA sequences up to 96 K base pairs with annotated proteins and potential regulatory elements. It predicts essential genes, genetic variant effects, and regulatory element activity.