Saunders Rebecca E, Perkins Stephen J
Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom.
Hum Mutat. 2008 Mar;29(3):333-44. doi: 10.1002/humu.20629.
Central repositories of mutations that combine structural, sequence, and phenotypic information in related proteins will facilitate the diagnosis and molecular understanding of diseases associated with them. Coagulation involves the sequential activation of serine proteases and regulators in order to yield stable blood clots while maintaining hemostasis. Five coagulation serine proteases-factor VII (F7), factor IX (F9), factor X (F10), protein C (PROC), and thrombin (F2)-exhibit high sequence similarities and all require vitamin K. All five of these were incorporated into an interactive database of mutations named CoagMDB (http://www.coagMDB.org; last accessed: 9 August 2007). The large number of mutations involved (especially for factor IX) and the increasing problem of out-of-date databases required the development of new database management tools. A text mining tool automatically scans full-length references to identify and extract mutations. High recall rates between 96 and 99% and precision rates of 87 to 93% were achieved. Text mining significantly reduces the time and expertise required to maintain the databases and offers a solution to the problem of locus-specific database management and upkeep. A total of 875 mutations were extracted from 1,279 literature sources. Of these, 116 correspond to Gla domains, 86 to the N-terminal EGF domain, 73 to the C-terminal EGF domain, and 477 to the serine protease domain. The combination of text mining and consensus domain structures enables mutations to be correlated with experimentally-measurable phenotypes based on either low protein levels (Type I) or reduced functional activities (Type II), respectively. A tendency for the conservation of phenotype with structural location was identified.
整合相关蛋白质结构、序列和表型信息的突变中央储存库将有助于对与之相关疾病的诊断和分子理解。凝血过程涉及丝氨酸蛋白酶和调节因子的顺序激活,以形成稳定的血凝块同时维持止血功能。五种凝血丝氨酸蛋白酶——凝血因子VII(F7)、凝血因子IX(F9)、凝血因子X(F10)、蛋白C(PROC)和凝血酶(F2)——具有高度的序列相似性,且都需要维生素K。这五种蛋白酶都被纳入了一个名为CoagMDB(http://www.coagMDB.org;最后访问时间:2007年8月9日)的突变交互式数据库。所涉及的大量突变(特别是凝血因子IX的突变)以及数据库过时问题日益严重,这就需要开发新的数据库管理工具。一种文本挖掘工具会自动扫描全长参考文献以识别和提取突变。召回率在96%至99%之间,精确率在87%至93%之间。文本挖掘显著减少了维护数据库所需的时间和专业知识,并为位点特异性数据库的管理和维护问题提供了解决方案。共从1279篇文献来源中提取了875个突变。其中,116个对应于Gla结构域,86个对应于N端EGF结构域,73个对应于C端EGF结构域,477个对应于丝氨酸蛋白酶结构域。文本挖掘与共有结构域结构相结合,能够使突变分别与基于低蛋白水平(I型)或功能活性降低(II型)的实验可测量表型相关联。研究发现了表型与结构位置之间的保守趋势。