Suppr超能文献

快速搜索所有已存入的细菌和病毒基因组数据。

Ultrafast search of all deposited bacterial and viral genomic data.

机构信息

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.

EMBL-EBI, Hinxton, UK.

出版信息

Nat Biotechnol. 2019 Feb;37(2):152-159. doi: 10.1038/s41587-018-0010-1. Epub 2019 Feb 4.

Abstract

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

摘要

大量未经处理的细菌和病毒基因组序列数据不断增加,并存储在全球档案库中。能够针对这些数据进行序列搜索词查询,将有助于促进基础研究和实时基因组流行病学及监测等应用。但目前的方法无法实现这一点。为了解决这个问题,我们将微生物群体基因组学知识与专为网络搜索设计的计算方法相结合,生成了一种可搜索的数据结构,名为 Bitsliced Genomic Signature Index(BIGSI)。我们使用比以前的方法少四个数量级的存储空间,对包含 447,833 个细菌和病毒全基因组序列数据集的全球语料库进行了索引。我们应用 BIGSI 搜索功能快速查找了耐药基因 MCR-1、MCR-2 和 MCR-3,确定了 2,827 个质粒的宿主范围,并量化了存档数据集中的抗生素耐药性。我们的索引可以随着新的(未处理或组装的)序列数据集的不断增加而逐步增长,并且可以扩展到数百万个数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cfc6/6420049/e7f6e15258d4/emss-80982-f001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验