Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil.
Polytechnic School, Centro Universitário UniDomBosco, Curitiba, Paraná, Brazil.
PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020.
The recent decrease in cost and time to sequence and assemble of complete genomes created an increased demand for data storage. As a consequence, several strategies for assembled biological data compression were created. Vertical compression tools implement strategies that take advantage of the high level of similarity between multiple assembled genomic sequences for better compression results. However, current reviews on vertical compression do not compare the execution flow of each tool, which is constituted by phases of preprocessing, transformation, and data encoding. We performed a systematic literature review to identify and compare existing tools for vertical compression of assembled genomic sequences. The review was centered on PubMed and Scopus, in which 45726 distinct papers were considered. Next, 32 papers were selected according to the following criteria: to present a lossless vertical compression tool; to use the information contained in other sequences for the compression; to be able to manipulate genomic sequences in FASTA format; and no need prior knowledge. Although we extracted performance compression results, they were not compared as the tools did not use a standardized evaluation protocol. Thus, we conclude that there's a lack of definition of an evaluation protocol that must be applied by each tool.
最近测序和组装完整基因组的成本和时间的降低,导致了对数据存储的需求增加。因此,出现了几种用于组装生物数据压缩的策略。垂直压缩工具实施的策略利用了多个组装基因组序列之间的高度相似性,以获得更好的压缩效果。然而,目前关于垂直压缩的综述并没有比较每个工具的执行流程,该流程由预处理、转换和数据编码三个阶段组成。我们进行了系统的文献综述,以识别和比较现有的用于组装基因组序列的垂直压缩工具。综述主要集中在 PubMed 和 Scopus 上,共考虑了 45726 篇不同的论文。然后,根据以下标准选择了 32 篇论文:提出了一种无损的垂直压缩工具;利用其他序列中的信息进行压缩;能够处理 FASTA 格式的基因组序列;以及不需要先验知识。尽管我们提取了性能压缩结果,但由于工具没有使用标准化的评估协议,因此没有进行比较。因此,我们得出结论,缺乏每个工具都必须应用的评估协议的定义。