Frishman D, Argos P
European Molecular Biology Laboratory, Heidelberg, Germany.
J Mol Biol. 1992 Dec 5;228(3):951-62. doi: 10.1016/0022-2836(92)90877-m.
A sensitive technique for protein sequence motif recognition based on neural networks has been developed. It involves three major steps. (1) At each appropriate alignment position of a set of N matched sequences, a set of N aligned oligopeptides is specified with preselected window length. N neural nets are subsequently and successively trained on N-1 amino acid spans after eliminating each ith oligopeptide. A test for recognition of each of the ith spans is performed. The average neural net recognition over N such trials is used as a measure of conservation for the particular windowed region of the multiple alignment. This process is repeated for all possible spans of given length in the multiple alignment. (2) The M most conserved regions are regarded as motifs and the oligopeptides within each are used to train intensively M individual neural networks. (3) The M networks are then applied in a search for related primary structures in a databank of known protein sequences. The oligopeptide spans in the database sequence with strongest neural net output for each of the M networks are saved and then scored according to the output signals and the proper combination that follows the expected N- to C-terminal sequence order. The motifs from the database with highest similarity scores can then be used to retrain the M neural nets, which can be subsequently utilized for further searches in the databank, thus providing even greater sensitivity to recognize distant familial proteins. This technique was successfully applied to the integrase, DNA-polymerase and immunoglobulin families.
一种基于神经网络的用于蛋白质序列基序识别的灵敏技术已经被开发出来。它包括三个主要步骤。(1)在一组N个匹配序列的每个合适的比对位置,用预先选择的窗口长度指定一组N个比对的寡肽。随后,在消除每个第i个寡肽后,在N - 1个氨基酸跨度上依次连续训练N个神经网络。对每个第i个跨度进行识别测试。在N次这样的试验中神经网络识别的平均值被用作多重比对中特定窗口区域保守性的一种度量。对多重比对中给定长度的所有可能跨度重复这个过程。(2)将M个最保守的区域视为基序,并且每个区域内的寡肽用于密集训练M个单独的神经网络。(3)然后将这M个网络应用于在已知蛋白质序列数据库中搜索相关的一级结构。保存数据库序列中对于M个网络中每个网络具有最强神经网络输出的寡肽跨度,然后根据输出信号以及遵循预期的从N端到C端序列顺序的适当组合进行评分。来自数据库中具有最高相似性分数的基序然后可用于重新训练M个神经网络,随后可将其用于在数据库中进一步搜索,从而提供更高的灵敏度以识别远亲家族蛋白。该技术已成功应用于整合酶、DNA聚合酶和免疫球蛋白家族。