School of Informatics, Indiana University Purdue University Indianapolis, Indianapolis, IN 46202, USA.
BMC Cancer. 2012 Aug 1;12:331. doi: 10.1186/1471-2407-12-331.
Biological entities do not perform in isolation, and often, it is the nature and degree of interactions among numerous biological entities which ultimately determines any final outcome. Hence, experimental data on any single biological entity can be of limited value when considered only in isolation. To address this, we propose that augmenting individual entity data with the literature will not only better define the entity's own significance but also uncover relationships with novel biological entities.To test this notion, we developed a comprehensive text mining and computational methodology that focused on discovering new targets of one class of molecular entities, transcription factors (TF), within one particular disease, colorectal cancer (CRC).
We used 39 molecular entities known to be associated with CRC along with six colorectal cancer terms as the bait list, or list of search terms, for mining the biomedical literature to identify CRC-specific genes and proteins. Using the literature-mined data, we constructed a global TF interaction network for CRC. We then developed a multi-level, multi-parametric methodology to identify TFs to CRC.
The small bait list, when augmented with literature-mined data, identified a large number of biological entities associated with CRC. The relative importance of these TF and their associated modules was identified using functional and topological features. Additional validation of these highly-ranked TF using the literature strengthened our findings. Some of the novel TF that we identified were: SLUG, RUNX1, IRF1, HIF1A, ATF-2, ABL1, ELK-1 and GATA-1. Some of these TFs are associated with functional modules in known pathways of CRC, including the Beta-catenin/development, immune response, transcription, and DNA damage pathways.
Our methodology of using text mining data and a multi-level, multi-parameter scoring technique was able to identify both known and novel TF that have roles in CRC. Starting with just one TF (SMAD3) in the bait list, the literature mining process identified an additional 116 CRC-associated TFs. Our network-based analysis showed that these TFs all belonged to any of 13 major functional groups that are known to play important roles in CRC. Among these identified TFs, we obtained a novel six-node module consisting of ATF2-P53-JNK1-ELK1-EPHB2-HIF1A, from which the novel JNK1-ELK1 association could potentially be a significant marker for CRC.
生物实体并非孤立运作,而往往是众多生物实体之间的性质和相互作用程度最终决定了任何最终结果。因此,仅考虑单个生物实体的实验数据可能具有局限性。为了解决这个问题,我们提出,将单个实体数据与文献相结合,不仅可以更好地定义实体自身的意义,还可以揭示与新生物实体的关系。为了验证这一观点,我们开发了一种全面的文本挖掘和计算方法,该方法侧重于发现一类分子实体(转录因子(TF))在一种特定疾病(结直肠癌(CRC))中的新靶标。
我们使用了 39 种已知与 CRC 相关的分子实体和 6 个结直肠癌术语作为挖掘生物医学文献的诱饵列表或搜索词列表,以识别 CRC 特异性基因和蛋白质。使用文献挖掘数据,我们构建了一个 CRC 的全局 TF 相互作用网络。然后,我们开发了一种多层次、多参数的方法来识别 CRC 的 TF。
当与文献挖掘数据结合使用时,小的诱饵列表可以识别出与 CRC 相关的大量生物实体。使用功能和拓扑特征确定了这些 TF 的相对重要性及其相关模块。使用文献对这些高排名 TF 进行额外验证加强了我们的发现。我们确定的一些新的 TF 包括:SLUG、RUNX1、IRF1、HIF1A、ATF-2、ABL1、ELK-1 和 GATA-1。其中一些 TF 与 CRC 已知途径中的功能模块有关,包括 Beta-catenin/发育、免疫反应、转录和 DNA 损伤途径。
我们使用文本挖掘数据和多层次、多参数评分技术的方法能够识别 CRC 中的已知和新的 TF。从诱饵列表中的一个 TF(SMAD3)开始,文献挖掘过程确定了另外 116 个与 CRC 相关的 TF。我们的网络分析表明,这些 TF 都属于已知在 CRC 中发挥重要作用的 13 个主要功能组中的任何一个。在这些确定的 TF 中,我们获得了一个由 ATF2-P53-JNK1-ELK1-EPHB2-HIF1A 组成的新的六节点模块,其中 JNK1-ELK1 之间的新关联可能是 CRC 的一个重要标志物。