Graduate Program in Health Sciences, Universidade Federal de Ciências da Saúde de Porto Alegre (UFCSPA), Rua Sarmento Leite, 245 - Centro Histórico, Porto Alegre, RS 90050-170, Brazil.
Database (Oxford). 2023 Aug 11;2023. doi: 10.1093/database/baad053.
The advancement of genetic sequencing techniques led to the production of a large volume of data. The extraction of genetic material from a sample is one of the early steps of the metagenomic study. With the evolution of the processes, the analysis of the sequenced data allowed the discovery of etiological agents and, by corollary, the diagnosis of infections. One of the biggest challenges of the technique is the huge volume of data generated with each new technology developed. To introduce an algorithm that may reduce the data volume, allowing faster DNA matching with the reference databases. Using techniques like lossy compression and substitution matrix, it is possible to match nucleotide sequences without losing the subject. This lossy compression explores the nature of DNA mutations, insertions and deletions and the possibility that different sequences are the same subject. The algorithm can reduce the overall size of the database to 15% of the original size. Depending on parameters, it may reduce up to 5% of the original size. Although is the same as the other platforms, the match algorithm is more sensible because it ignores the transitions and transversions, resulting in a faster way to obtain the diagnostic results. The first experiment results in an increase in speed 10 times faster than Blast while maintaining high sensitivity. This performance gain can be extended by combining other techniques already used in other studies, such as hash tables. Database URL https://github.com/ghc4/metagens.
遗传测序技术的进步导致了大量数据的产生。从样本中提取遗传物质是宏基因组研究的早期步骤之一。随着这些过程的发展,对测序数据的分析使得能够发现病原体,并因此诊断感染。该技术面临的最大挑战之一是,每种新技术都会产生大量的数据。为了引入一种可能减少数据量的算法,从而实现更快地将 DNA 与参考数据库进行匹配。通过使用有损压缩和替代矩阵等技术,可以在不丢失主题的情况下匹配核苷酸序列。这种有损压缩探索了 DNA 突变、插入和缺失的性质,以及不同序列可能是同一主题的可能性。该算法可以将数据库的总体大小减少到原始大小的 15%。根据参数的不同,它可以将原始大小减少多达 5%。虽然与其他平台相同,但匹配算法更合理,因为它忽略了转换和颠换,从而可以更快地获得诊断结果。第一个实验的速度比 Blast 快 10 倍,同时保持了高灵敏度。通过结合其他已经在其他研究中使用的技术,例如哈希表,可以扩展这种性能提升。数据库 URL:https://github.com/ghc4/metagens。