用于检测蛋白质序列中保守结构域的数据增强算法：一项比较研究。

Data augmentation algorithms for detecting conserved domains in protein sequences: a comparative study.

作者信息

Bi Chengpeng

机构信息

Bioinformatics and Intelligent Computing Lab, Children's Mercy Hospitals and Clinics Schools of Medicine, Computing and Engineering University of Missouri, Kansas City, Missouri 64108, USA.

出版信息

J Proteome Res. 2008 Jan;7(1):192-201. doi: 10.1021/pr070475q. Epub 2007 Dec 15.

DOI:10.1021/pr070475q

PMID:18081244

Abstract

Protein conserved domains are distinct units of molecular structure, usually associated with particular aspects of molecular function such as catalysis or binding. These conserved subsequences are often unobserved and thus in need of detection. Motif discovery methods can be used to find these unobserved domains given a set of sequences. This paper presents the data augmentation (DA) framework that unifies a suite of motif-finding algorithms through maximizing the same likelihood function by imputing the unobserved data. The data augmentation refers to those methods that formulate iterative optimization by exploiting the unobserved data. Two categories of maximum likelihood based motif-finding algorithms are illustrated under the DA framework. The first is the deterministic algorithms that are to maximize the likelihood function by performing an iteratively optimal local search in the alignment space. The second is the stochastic algorithms that are to iteratively draw motif location samples via Monte Carlo simulation and simultaneously keep track of the superior solution with the best likelihood. As a result, four DA motif discovery algorithms are described, evaluated, and compared by aligning real and simulated protein sequences.

摘要

蛋白质保守结构域是分子结构的不同单元，通常与分子功能的特定方面相关，如催化或结合。这些保守子序列往往难以观察到，因此需要进行检测。给定一组序列，基序发现方法可用于找到这些难以观察到的结构域。本文提出了一种数据增强（DA）框架，该框架通过对未观察到的数据进行插补来最大化相同的似然函数，从而统一了一套基序查找算法。数据增强是指那些通过利用未观察到的数据来制定迭代优化的方法。在DA框架下阐述了两类基于最大似然的基序查找算法。第一类是确定性算法，通过在比对空间中执行迭代最优局部搜索来最大化似然函数。第二类是随机算法，通过蒙特卡罗模拟迭代抽取基序位置样本，并同时跟踪具有最佳似然性的最优解。通过比对真实和模拟的蛋白质序列，最终描述、评估并比较了四种DA基序发现算法。