Suppr超能文献

E2FM:用于基因组序列集合的加密和压缩全文索引。

E2FM: an encrypted and compressed full-text index for collections of genomic sequences.

机构信息

Centro Reti, Sistemi e Servizi Informatici/CRESSI, Università degli Studi della Campania "Luigi Vanvitelli," Napoli 80133, Italy.

Istituto di Calcolo e Reti ad Alte Prestazioni/ICAR, Consiglio Nazionale delle Ricerche, Napoli 80131, Italy.

出版信息

Bioinformatics. 2017 Sep 15;33(18):2808-2817. doi: 10.1093/bioinformatics/btx313.

Abstract

MOTIVATION

Next Generation Sequencing (NGS) platforms and, more generally, high-throughput technologies are giving rise to an exponential growth in the size of nucleotide sequence databases. Moreover, many emerging applications of nucleotide datasets-as those related to personalized medicine-require the compliance with regulations about the storage and processing of sensitive data.

RESULTS

We have designed and carefully engineered E 2 FM -index, a new full-text index in minute space which was optimized for compressing and encrypting nucleotide sequence collections in FASTA format and for performing fast pattern-search queries. E 2 FM -index allows to build self-indexes which occupy till to 1/20 of the storage required by the input FASTA file, thus permitting to save about 95% of storage when indexing collections of highly similar sequences; moreover, it can exactly search the built indexes for patterns in times ranging from few milliseconds to a few hundreds milliseconds, depending on pattern length.

AVAILABILITY AND IMPLEMENTATION

Source code is available at https://github.com/montecuollo/E2FM .

CONTACT

ferdinando.montecuollo@unicampania.it.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

下一代测序(NGS)平台和更广泛的高通量技术正在导致核苷酸序列数据库的大小呈指数级增长。此外,核苷酸数据集的许多新兴应用,如与个性化医疗相关的应用,需要遵守关于存储和处理敏感数据的法规。

结果

我们设计并精心设计了 E 2 FM-index,这是一种新的全文索引,占用空间极小,针对 FASTA 格式的核苷酸序列集合进行压缩和加密,并进行快速模式搜索查询进行了优化。E 2 FM-index 允许构建占用输入 FASTA 文件所需存储空间的 1/20 的自索引,从而在索引高度相似序列的集合时可以节省大约 95%的存储空间;此外,它可以根据模式长度在几毫秒到几百毫秒的时间范围内精确搜索构建的索引中的模式。

可用性和实现

源代码可在 https://github.com/montecuollo/E2FM 获得。

联系人

ferdinando.montecuollo@unicampania.it

补充信息

补充数据可在 Bioinformatics 在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验