Silva Milton, Pratas Diogo, Pinho Armando J
IEETA-Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
Department of Electronics Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
Entropy (Basel). 2021 Apr 26;23(5):530. doi: 10.3390/e23050530.
Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.
最近,科学界见证了蛋白质序列数据生成量的大幅增加,引发了越来越重要的新挑战,即高效存储和改进数据分析。对于这两种应用,数据压缩是一个直接的解决方案。然而,在文献中,特定蛋白质序列压缩器的数量相对较少。此外,这些专门的压缩器在压缩率上仅比最佳通用压缩器略有提高。在本文中,我们提出了AC2,一种用于蛋白质(或氨基酸)序列的新型无损数据压缩器。AC2使用神经网络,通过堆叠泛化方法将专家与个体缓存哈希记忆模型结合到最高上下文阶数。与之前的压缩器(AC)相比,我们分别在无参考模式和基于参考模式下实现了2%-9%和6%-7%的增益。这些增益是以计算速度慢三倍为代价的。与AC相比,AC2还改善了内存使用情况,内存需求降低了约七倍,且不受序列输入大小的影响。作为一种分析应用,我们使用AC2来测量每个SARS-CoV-2蛋白质序列与整个UniProt数据库中每个病毒蛋白质序列之间的相似性。结果一致显示,与穿山甲冠状病毒的相似性更高,其次是蝙蝠和人类冠状病毒,为当前一个有争议的主题提供了关键结果。AC2可根据GPLv3许可免费下载。