Chikhi Rayan, Lemane Téo, Loll-Krippleber Raphaël, Montoliu-Nerin Mercè, Raffestin Brice, Camargo Antonio Pedro, Miller Carson J, Fiamenghi Mateus Bernabe, Agustinho Daniel Paiva, Majidian Sina, Autric Greg, Hugues Maxime, Lee Junkyoung, Faure Roland, Curry Kristen D, Moura de Sousa Jorge A, Rocha Eduardo P C, Koslicki David, Medvedev Paul, Gupta Purav, Shen Jessica, Morales-Tapia Alejandro, Sihuta Kate, Roy Peter J, Brown Grant W, Edgar Robert C, Korobeynikov Anton, Steinegger Martin, Lareau Caleb A, Peterlongo Pierre, Babaian Artem
Institut Pasteur, Université Paris Cité, CNRS UMR3525, Paris, France.
Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France.
bioRxiv. 2025 Sep 1:2024.07.30.605881. doi: 10.1101/2024.07.30.605881.
The breadth of life's diversity is unfathomable, but public nucleic acid sequencing data offers a window into the dispersion and evolution of genetic diversity across Earth. However the rapid growth and accumulation of sequence data have outpaced efficient analysis capabilities. The largest collection of freely available sequencing data is the Sequence Read Archive (SRA), comprising 27.3 million datasets or 5 × 10 basepairs. To realize the potential of the SRA, we constructed Logan, a massive sequence assembly transforming short reads into long contigs and compressing the data over 100-fold, enabling highly efficient petabase-scale analysis. We created Logan-Search, a -mer index of Logan for free planetary-scale sequence search, returning matches in minutes. We used Logan contigs to identify >200 million plastic-degrading enzyme homologs, and validate novel enzymes with catalytic activities exceeding current reference standards. Further, we vastly expand the known diversity of proteins (30-fold over UniRef50), plasmids (22-fold over PLSDB), P4 satellites (4.5-fold), and the recently described Obelisk RNA elements (3.7-fold). Logan also enables ecological and biomedical data mining, such as global tracking of antimicrobial resistance genes and the characterization of viral reactivation across millions of human BioSamples. By transforming the SRA, Logan democratizes access to the world's public genetic data and opens frontiers in biotechnology, molecular ecology, and global health.
生命多样性的广度深不可测,但公共核酸测序数据为洞察地球遗传多样性的分布与进化提供了一扇窗口。然而,序列数据的快速增长和积累已经超过了高效分析能力。最大的免费可用测序数据集集合是序列读取存档库(SRA),它包含2730万个数据集或5×10个碱基对。为了实现SRA的潜力,我们构建了洛根(Logan),这是一个大规模序列组装工具,可将短读长转化为长重叠群,并将数据压缩100多倍,从而实现高效的PB级规模分析。我们创建了洛根搜索(Logan-Search),这是一个针对洛根的k-mer索引,用于免费的全球规模序列搜索,可在数分钟内返回匹配结果。我们使用洛根重叠群鉴定了超过2亿个塑料降解酶同源物,并验证了催化活性超过当前参考标准的新型酶。此外,我们极大地扩展了已知的蛋白质多样性(比UniRef50多30倍)、质粒多样性(比PLSDB多22倍)、P4卫星多样性(4.5倍)以及最近描述的方尖碑RNA元件多样性(3.7倍)。洛根还能够进行生态和生物医学数据挖掘,例如全球追踪抗菌抗性基因以及对数百万份人类生物样本中的病毒再激活进行表征。通过对SRA进行改造,洛根使人们能够平等地获取全球公共遗传数据,并在生物技术、分子生态学和全球健康领域开拓了新的前沿。