Gonzalez Graciela, Uribe Juan C, Tari Luis, Brophy Colleen, Baral Chitta
Department of Biomedical Informatics, Ira A. Fulton School of Engineering, Arizona Sate University, Tempe, Arizona 85281, USA.
Pac Symp Biocomput. 2007:28-39.
The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining.
An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness.
Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.
后基因组时代与疾病相关的发现和进展所带来的前景尚未完全实现,自那时以来发表的数百万篇生物医学论文中隐藏着许多发现机会。公共数据库提供了专家团队从文献中提取的数据,但它们的覆盖范围往往有限,且落后于最新发现。我们提出了一种计算方法,该方法将从文献中提取的数据与来自精心策划来源的数据相结合,以揭示未直接陈述或在初始挖掘中被遗漏的可能的基因-疾病关系。
使用自然语言处理从PubMed摘要中提取的基因-疾病关系中获得一组初始基因和蛋白质。类似地提取涉及相应蛋白质的相互作用,并将其与来自精心策划数据库(如BIND和DIP)的相互作用进行整合,根据每个相互作用的来源为其分配一个置信度度量。然后,结合两个分数对扩充后的基因和基因产物列表进行排名:一个分数反映与初始基因集的关系强度,并纳入用户定义的权重,另一个分数反映基因在维持网络连通性方面的重要性。我们将该方法应用于动脉粥样硬化以评估其有效性。
如果忽略重复项,该方法排名靠前的蛋白质与动脉粥样硬化相关,前20个的准确率在0.85至1.00之间,前90个的准确率在0.64至0.80之间,前20个中有45%以及前90个中有75%是由该方法得出的,而非从文本中提取。因此,尽管初始基因集和相互作用是自动从文本中提取的(并且存在自动提取的不精确性),但在进行充分的计算分析后,将它们用于进一步的假设生成是有价值的。