Shatkay Hagit, Brady Scott, Wong Andrew
Dept. of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA; Delaware Biotechnology Institute, University of Delaware, Newark, DE 19711, USA; Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, ON K7L 3N6, Canada.
School of Medicine, University of Toronto, Toronto, ON M5S 1A8, Canada; Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, ON K7L 3N6, Canada.
Methods. 2015 Mar;74:54-64. doi: 10.1016/j.ymeth.2014.10.027. Epub 2014 Nov 15.
The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining. Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text. In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.
当前大规模生物学时代的特点是测序基因组数量快速增长,因此也有大量已鉴定但功能尚未确定的蛋白质。与此同时,任何有关基因和蛋白质的已知或假设信息都是不断增长的已发表科学文献的一部分,该文献正以每年超过一百万篇新出版物的速度增长。试图自动预测和注释蛋白质特征(如功能和定位模式)的计算工具正在与旨在通过文本挖掘支持该过程的系统一起开发。大多数蛋白质表征工作都集中在直接从蛋白质序列数据中衍生的特征上。旨在利用文献的蛋白质相关工作通常集中于从文本中提取特定事实(例如蛋白质相互作用)。在过去几年中,我们采取了不同的方法,将文献视为基于文本的特征来源,就像早期工作中使用基于序列的蛋白质特征一样,可用于预测蛋白质亚细胞定位,甚至可能还有功能。我们在此详细讨论整体方法,以及我们在该领域所做工作的结果,这些结果证明了该方法的价值及其潜在用途。