Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio 43210, USA.
J Am Med Inform Assoc. 2011 Dec;18 Suppl 1(Suppl 1):i125-31. doi: 10.1136/amiajnl-2011-000434. Epub 2011 Oct 7.
The conduct of investigational studies that involve large-scale data sets presents significant challenges related to the discovery and testing of novel hypotheses capable of supporting in silico discovery science. The use of what are known as Conceptual Knowledge Discovery in Databases (CKDD) methods provides a potential means of scaling hypothesis discovery and testing approaches for large data sets. Such methods enable the high-throughput generation and evaluation of knowledge-anchored relationships between complexes of variables found in targeted data sets.
The authors have conducted a multipart model formulation and validation process, focusing on the development of a methodological and technical approach to using CKDD to support hypothesis discovery for in silico science. The model the authors have developed is known as the Translational Ontology-anchored Knowledge Discovery Engine (TOKEn). This model utilizes a specific CKDD approach known as Constructive Induction to identify and prioritize potential hypotheses related to the meaningful semantic relationships between variables found in large-scale and heterogeneous biomedical data sets.
The authors have verified and validated TOKEn in the context of a translational research data repository maintained by the NCI-funded Chronic Lymphocytic Leukemia Research Consortium. Such studies have shown that TOKEn is: (1) computationally tractable; and (2) able to generate valid and potentially useful hypotheses concerning relationships between phenotypic and biomolecular variables in that data collection.
The TOKEn model represents a potentially useful and systematic approach to knowledge synthesis for in silico discovery science in the context of large-scale and multidimensional research data sets.
涉及大规模数据集的研究具有挑战性,因为其需要发现和测试新的假说,以支持基于计算机的发现科学。使用所谓的数据库概念知识发现 (CKDD) 方法提供了一种扩展大规模数据集假说发现和测试方法的潜在手段。这些方法能够在目标数据集中发现的变量的复杂关系中实现知识锚定的关系的高通量生成和评估。
作者进行了多部分模型制定和验证过程,重点是开发一种使用 CKDD 支持计算机科学假说发现的方法和技术方法。作者开发的模型称为基于转化本体的知识发现引擎 (TOKEn)。该模型利用一种特定的 CKDD 方法,即建设性归纳法,以识别和优先考虑与大型和异构生物医学数据集之间的变量之间有意义的语义关系相关的潜在假说。
作者在 NCI 资助的慢性淋巴细胞白血病研究联盟维护的转化研究数据存储库中验证和验证了 TOKEn。这些研究表明,TOKEn 是:(1)计算上可处理的;(2)能够在该数据集中生成关于表型和生物分子变量之间关系的有效且潜在有用的假说。
TOKEn 模型代表了在大规模和多维研究数据集的背景下,针对计算机发现科学进行知识综合的一种潜在有用和系统的方法。