Department of Structural and Molecular Biology, UCL, London WC1E 6BT, UK.
Bioinformatics. 2010 Mar 15;26(6):745-51. doi: 10.1093/bioinformatics/btq034. Epub 2010 Jan 29.
Accurate prediction of the domain content and arrangement in multi-domain proteins (which make up >65% of the large-scale protein databases) provides a valuable tool for function prediction, comparative genomics and studies of molecular evolution. However, scanning a multi-domain protein against a database of domain sequence profiles can often produce conflicting and overlapping matches. We have developed a novel method that employs heaviest weighted clique-finding (HCF), which we show significantly outperforms standard published approaches based on successively assigning the best non-overlapping match (Best Match Cascade, BMC).
We created benchmark data set of structural domain assignments in the CATH database and a corresponding set of Hidden Markov Model-based domain predictions. Using these, we demonstrate that by considering all possible combinations of matches using the HCF approach, we achieve much higher prediction accuracy than the standard BMC method. We also show that it is essential to allow overlapping domain matches to a query in order to identify correct domain assignments. Furthermore, we introduce a straightforward and effective protocol for resolving any overlapping assignments, and producing a single set of non-overlapping predicted domains.
The new approach will be used to determine MDAs for UniProt and Ensembl, and made available via the Gene3D website: http://gene3d.biochem.ucl.ac.uk/Gene3D/. The software has been implemented in C++ and compiled for Linux: source code and binaries can be found at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/
Supplementary data are available at Bioinformatics online.
准确预测多结构域蛋白(占大规模蛋白质数据库的>65%)的结构域内容和排列,为功能预测、比较基因组学和分子进化研究提供了有价值的工具。然而,将多结构域蛋白与结构域序列轮廓数据库进行扫描,往往会产生冲突和重叠的匹配。我们开发了一种新的方法,该方法采用最重加权团发现(HCF),我们的实验表明,与基于依次分配最佳非重叠匹配(最佳匹配级联,BMC)的标准发布方法相比,该方法具有显著优势。
我们创建了 CATH 数据库中结构域分配的基准数据集,以及相应的基于隐马尔可夫模型的域预测数据集。使用这些数据集,我们证明通过使用 HCF 方法考虑所有可能的匹配组合,我们可以实现比标准 BMC 方法更高的预测精度。我们还表明,为了正确识别域分配,必须允许查询中的域匹配重叠。此外,我们引入了一种简单有效的协议,用于解决任何重叠的分配,并生成一组非重叠的预测域。
新方法将用于确定 UniProt 和 Ensembl 的 MDAs,并通过 Gene3D 网站提供:http://gene3d.biochem.ucl.ac.uk/Gene3D/。该软件已用 C++实现并为 Linux 编译:源代码和二进制文件可在以下位置找到:ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/
补充数据可在 Bioinformatics 在线获得。