Andrade M A, Valencia A
Protein Design Group, CNB-CSIC, Cantoblanco, E-28049 Madrid, Spain.
Bioinformatics. 1998;14(7):600-7. doi: 10.1093/bioinformatics/14.7.600.
Annotation of the biological function of different protein sequences is a time-consuming process currently performed by human experts. Genome analysis tools encounter great difficulty in performing this task. Database curators, developers of genome analysis tools and biologists in general could benefit from access to tools able to suggest functional annotations and facilitate access to functional information.
We present here the first prototype of a system for the automatic annotation of protein function. The system is triggered by collections of s related to a given protein, and it is able to extract biological information directly from scientific literature, i.e. MEDLINE abstracts. Relevant keywords are selected by their relative accumulation in comparison with a domain-specific background distribution. Simultaneously, the most representative sentences and MEDLINE abstracts are selected and presented to the end-user. Evolutionary information is considered as a predominant characteristic in the domain of protein function. Our system consequently extracts domain-specific information from the analysis of a set of protein families.
The system has been tested with different protein families, of which three examples are discussed in detail here: 'ataxia-telangiectasia associated protein', 'ran GTPase' and 'carbonic anhydrase'. We found generally good correlation between the amount of information provided to the system and the quality of the annotations. Finally, the current limitations and future developments of the system are discussed.
The current system can be considered as a prototype system. As such, it can be accessed as a server at http://columba.ebi.ac. uk:8765/andrade/abx. The system accepts text related to the protein or proteins to be evaluated (optimally, the result of a MEDLINE search by keyword) and the results are returned in the form of Web pages for keywords, sentences and s.
Web pages containing full information on the examples mentioned in the text are available at: http://www.cnb.uam.es/ approximately cnbprot/keywords/
对不同蛋白质序列的生物学功能进行注释是一个耗时的过程,目前由专家人工完成。基因组分析工具在执行这项任务时遇到很大困难。数据库管理员、基因组分析工具开发者以及一般的生物学家都能从能够提供功能注释建议并便于获取功能信息的工具中受益。
我们在此展示了一个用于蛋白质功能自动注释系统的首个原型。该系统由与给定蛋白质相关的文献集合触发,能够直接从科学文献(即MEDLINE摘要)中提取生物学信息。通过与特定领域背景分布相比的相对积累来选择相关关键词。同时,选择最具代表性的句子和MEDLINE摘要并呈现给最终用户。进化信息被视为蛋白质功能领域的一个主要特征。因此,我们的系统从一组蛋白质家族的分析中提取特定领域的信息。
该系统已在不同的蛋白质家族上进行了测试,这里详细讨论了其中三个例子:“共济失调毛细血管扩张症相关蛋白”、“Ran GTP酶”和“碳酸酐酶”。我们发现提供给系统的信息量与注释质量之间总体上有良好的相关性。最后,讨论了该系统当前的局限性和未来的发展。
当前系统可被视为一个原型系统。因此,可以通过服务器http://columba.ebi.ac.uk:8765/andrade/abx访问它。该系统接受与要评估的一种或多种蛋白质相关的文本(最佳方式是通过关键词进行MEDLINE搜索的结果),结果以关于关键词、句子和文献的网页形式返回。
包含文本中提及例子完整信息的网页可在以下网址获取:http://www.cnb.uam.es/~cnbprot/keywords/