Gilks Walter R, Audit Benjamin, de Angelis Daniela, Tsoka Sophia, Ouzounis Christos A
Medical Research Council Biostatistics Unit, Institute of Public Health, University of Forvive Site, Robinson Way, Cambridge CB2 2SR, UK.
Math Biosci. 2005 Feb;193(2):223-34. doi: 10.1016/j.mbs.2004.08.001.
Databases of protein sequences have grown rapidly in recent years as a result of genome sequencing projects. Annotating protein sequences with descriptions of their biological function ideally requires careful experimentation, but this work lags far behind. Instead, biological function is often imputed by copying annotations from similar protein sequences. This gives rise to annotation errors, and more seriously, to chains of misannotation. [Percolation of annotation errors in a database of protein sequences (2002)] developed a probabilistic framework for exploring the consequences of this percolation of errors through protein databases, and applied their theory to a simple database model. Here we apply the theory to hierarchically structured protein sequence databases, and draw conclusions about database quality at different levels of the hierarchy.
近年来,由于基因组测序项目,蛋白质序列数据库迅速增长。理想情况下,用其生物学功能描述对蛋白质序列进行注释需要仔细的实验,但这项工作远远滞后。相反,生物学功能通常是通过从相似蛋白质序列复制注释来推断的。这会导致注释错误,更严重的是,会导致错误注释链。《蛋白质序列数据库中注释错误的渗透》(2002年)开发了一个概率框架,用于探索这种错误渗透通过蛋白质数据库的后果,并将其理论应用于一个简单的数据库模型。在这里,我们将该理论应用于层次结构的蛋白质序列数据库,并得出关于层次结构不同级别数据库质量的结论。