Kolárik Corinna, Klinger Roman, Hofmann-Apitius Martin
Department of Bioinformatics, Fraunhofer Institute Algorithms and Scientific Computing (SCAI) Schloss Birlinghoven, D-53754 Sankt Augustin, Germany.
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S28. doi: 10.1186/1471-2105-10-S1-S28.
Posttranslational modifications of histones influence the structure of chromatine and in such a way take part in the regulation of gene expression. Certain histone modification patterns, distributed over the genome, are connected to cell as well as tissue differentiation and to the adaption of organisms to their environment. Abnormal changes instead influence the development of disease states like cancer. The regulation mechanisms for modifying histones and its functionalities are the subject of epigenomics investigation and are still not completely understood. Text provides a rich resource of knowledge on epigenomics and modifications of histones in particular. It contains information about experimental studies, the conditions used, and results. To our knowledge, no approach has been published so far for identifying histone modifications in text.
We have developed an approach for identifying histone modifications in biomedical literature with Conditional Random Fields (CRF) and for resolving the recognized histone modification term variants by term standardization. For the term identification F1 measures of 0.84 by 10-fold cross-validation on the training corpus and 0.81 on an independent test corpus have been obtained. The standardization enabled the correct transformation of 96% of the terms from training and 98% from test the corpus. Due to the lack of terminologies exhaustively covering specific histone modification types, we developed a histone modification term hierarchy for use in a semantic text retrieval system.
The developed approach highly improves the retrieval of articles describing histone modifications. Since text contains context information about performed studies and experiments, the identification of histone modifications is the basis for supporting literature-based knowledge discovery and hypothesis generation to accelerate epigenomic research.
组蛋白的翻译后修饰会影响染色质结构,从而参与基因表达的调控。某些分布于基因组的组蛋白修饰模式与细胞及组织分化以及生物体对环境的适应性相关。相反,异常变化会影响诸如癌症等疾病状态的发展。组蛋白修饰的调控机制及其功能是表观基因组学研究的主题,目前仍未完全明晰。文本提供了关于表观基因组学,尤其是组蛋白修饰的丰富知识资源。它包含有关实验研究、所用条件及结果的信息。据我们所知,目前尚未有在文本中识别组蛋白修饰的方法被发表。
我们开发了一种利用条件随机场(CRF)在生物医学文献中识别组蛋白修饰,并通过术语标准化解决已识别的组蛋白修饰术语变体的方法。在训练语料库上通过10折交叉验证获得的术语识别F1值为0.84,在独立测试语料库上为0.81。标准化使得训练语料库中96%的术语以及测试语料库中98%的术语能够正确转换。由于缺乏详尽涵盖特定组蛋白修饰类型的术语表,我们开发了一个组蛋白修饰术语层次结构,用于语义文本检索系统。
所开发的方法极大地改进了描述组蛋白修饰的文章的检索。由于文本包含有关所进行研究和实验的上下文信息,组蛋白修饰的识别是支持基于文献的知识发现和假设生成以加速表观基因组学研究的基础。