Centro Reti, Sistemi e Servizi Informatici/CRESSI, Università degli Studi della Campania "Luigi Vanvitelli," Napoli 80133, Italy.
Istituto di Calcolo e Reti ad Alte Prestazioni/ICAR, Consiglio Nazionale delle Ricerche, Napoli 80131, Italy.
Bioinformatics. 2017 Sep 15;33(18):2808-2817. doi: 10.1093/bioinformatics/btx313.
Next Generation Sequencing (NGS) platforms and, more generally, high-throughput technologies are giving rise to an exponential growth in the size of nucleotide sequence databases. Moreover, many emerging applications of nucleotide datasets-as those related to personalized medicine-require the compliance with regulations about the storage and processing of sensitive data.
We have designed and carefully engineered E 2 FM -index, a new full-text index in minute space which was optimized for compressing and encrypting nucleotide sequence collections in FASTA format and for performing fast pattern-search queries. E 2 FM -index allows to build self-indexes which occupy till to 1/20 of the storage required by the input FASTA file, thus permitting to save about 95% of storage when indexing collections of highly similar sequences; moreover, it can exactly search the built indexes for patterns in times ranging from few milliseconds to a few hundreds milliseconds, depending on pattern length.
Source code is available at https://github.com/montecuollo/E2FM .
ferdinando.montecuollo@unicampania.it.
Supplementary data are available at Bioinformatics online.
下一代测序(NGS)平台和更广泛的高通量技术正在导致核苷酸序列数据库的大小呈指数级增长。此外,核苷酸数据集的许多新兴应用,如与个性化医疗相关的应用,需要遵守关于存储和处理敏感数据的法规。
我们设计并精心设计了 E 2 FM-index,这是一种新的全文索引,占用空间极小,针对 FASTA 格式的核苷酸序列集合进行压缩和加密,并进行快速模式搜索查询进行了优化。E 2 FM-index 允许构建占用输入 FASTA 文件所需存储空间的 1/20 的自索引,从而在索引高度相似序列的集合时可以节省大约 95%的存储空间;此外,它可以根据模式长度在几毫秒到几百毫秒的时间范围内精确搜索构建的索引中的模式。
源代码可在 https://github.com/montecuollo/E2FM 获得。
ferdinando.montecuollo@unicampania.it。
补充数据可在 Bioinformatics 在线获得。