Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium.
Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium.
Bioinformatics. 2018 Sep 15;34(18):3118-3125. doi: 10.1093/bioinformatics/bty309.
Evolutionary information is crucial for the annotation of proteins in bioinformatics. The amount of retrieved homologs often correlates with the quality of predicted protein annotations related to structure or function. With a growing amount of sequences available, fast and reliable methods for homology detection are essential, as they have a direct impact on predicted protein annotations.
We developed a discriminative, alignment-free algorithm for homology detection with quasi-linear complexity, enabling theoretically much faster homology searches. To reach this goal, we convert the protein sequence into numeric biophysical representations. These are shrunk to a fixed length using a novel vector quantization method which uses a Discrete Cosine Transform compression. We then compute, for each compressed representation, similarity scores between proteins with the Dynamic Time Warping algorithm and we feed them into a Random Forest. The WARP performances are comparable with state of the art methods.
The method is available at http://ibsquare.be/warp.
Supplementary data are available at Bioinformatics online.
进化信息对于生物信息学中蛋白质的注释至关重要。检索到的同源物的数量通常与预测的与结构或功能相关的蛋白质注释的质量相关。随着可用序列数量的增加,快速可靠的同源性检测方法至关重要,因为它们直接影响预测的蛋白质注释。
我们开发了一种具有准线性复杂度的判别、无对齐算法,能够实现理论上更快的同源性搜索。为了达到这个目标,我们将蛋白质序列转换为数值生物物理表示。我们使用一种新颖的基于离散余弦变换压缩的向量量化方法将这些表示压缩到固定长度。然后,我们使用动态时间规整算法计算每个压缩表示之间的相似性得分,并将它们输入到随机森林中。WARP 的性能可与最先进的方法相媲美。
该方法可在 http://ibsquare.be/warp 上获得。
补充数据可在 Bioinformatics 在线获得。