IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2471-2482. doi: 10.1109/TCBB.2020.2974221. Epub 2021 Dec 8.
Recognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. The best approaches use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this problem with the best possible performance. A major issue is that the number of possible models to combine is large and the use of all of these models is impractical. In this paper we present a methodology for combining many sources of information to recognize any functional site using "floating search", a powerful heuristics applicable when the cost of evaluating each solution is high. We present experiments on four functional sites in the human genome, which is used as the target genome, and use another 20 species as sources of evidence. The proposed methodology shows significant improvement over state-of-the-art methods. The results show an advantage of the proposed method and also challenge the standard assumption of using only genomes not very close and not very far from the human to improve the recognition of functional sites.
识别基因的功能位点,如翻译起始位点、供体和受体剪接位点以及终止密码子,是生物信息学中许多当前问题的一个重要组成部分。最好的方法是使用复杂的分类器,如支持向量机。然而,随着序列数据的快速积累,需要结合多种来源的证据的方法,因为不太可能有一种单一的分类器能够以最佳的性能解决这个问题。一个主要的问题是,要结合的可能模型的数量很大,使用所有这些模型是不切实际的。在本文中,我们提出了一种使用“浮动搜索”结合多种信息源来识别任何功能位点的方法,这是一种适用于评估每个解决方案成本很高的强大启发式方法。我们在人类基因组中的四个功能位点上进行了实验,并使用另外 20 个物种作为证据来源。所提出的方法与最先进的方法相比有显著的改进。结果表明,所提出的方法具有优势,同时也对仅使用与人类不太近也不太远的基因组来提高功能位点识别的标准假设提出了挑战。