Irisarri Iker, Burki Fabien, Whelan Simon
Department of Organismal Biology (Program in Systematic Biology), Uppsala University, Uppsala, Sweden.
Department of Biodiversity and Evolutionary Biology, Museo Nacional de Ciencias Naturales, Madrid, Spain.
Methods Mol Biol. 2021;2231:147-162. doi: 10.1007/978-1-0716-1036-7_10.
Large-scale multigene datasets used in phylogenomics and comparative genomics often contain sequence errors inherited from source genomes and transcriptomes. These errors typically manifest as stretches of non-homologous characters and derive from sequencing, assembly, and/or annotation errors. The lack of automatic tools to detect and remove sequence errors leads to the propagation of these errors in large-scale datasets. PREQUAL is a command line tool that identifies and masks regions with non-homologous adjacent characters in sets of unaligned homologous sequences. PREQUAL uses a full probabilistic approach based on pair hidden Markov models. On the front end, PREQUAL is user-friendly and simple to use while also allowing full customization to adjust filtering sensitivity. It is primarily aimed at amino acid sequences but can handle protein-coding nucleotide sequences. PREQUAL is computationally efficient and shows high sensitivity and accuracy. In this chapter, we briefly introduce the motivation for PREQUAL and its underlying methodology, followed by a description of basic and advanced usage, and conclude with some notes and recommendations. PREQUAL fills an important gap in the current bioinformatics tool kit for phylogenomics, contributing toward increased accuracy and reproducibility in future studies.
系统发育基因组学和比较基因组学中使用的大规模多基因数据集通常包含从源基因组和转录组继承的序列错误。这些错误通常表现为非同源字符片段,源于测序、组装和/或注释错误。缺乏检测和去除序列错误的自动工具会导致这些错误在大规模数据集中传播。PREQUAL是一个命令行工具,可识别并掩盖未比对的同源序列集中具有非同源相邻字符的区域。PREQUAL使用基于配对隐马尔可夫模型的全概率方法。在前端,PREQUAL用户友好且易于使用,同时还允许完全定制以调整过滤灵敏度。它主要针对氨基酸序列,但也可以处理蛋白质编码核苷酸序列。PREQUAL计算效率高,具有高灵敏度和准确性。在本章中,我们简要介绍了PREQUAL的动机及其基础方法,随后描述了基本和高级用法,并以一些注意事项和建议作为结尾。PREQUAL填补了当前系统发育基因组学生物信息学工具包中的一个重要空白,有助于提高未来研究的准确性和可重复性。