Computational Biology and Bioinformatics, Pacific Northwest National Laboratory, Richland, WA 99352, USA.
Dis Markers. 2013;35(5):513-23. doi: 10.1155/2013/613529. Epub 2013 Oct 10.
The availability of large complex data sets generated by high throughput technologies has enabled the recent proliferation of disease biomarker studies. However, a recurring problem in deriving biological information from large data sets is how to best incorporate expert knowledge into the biomarker selection process.
To develop a generalizable framework that can incorporate expert knowledge into data-driven processes in a semiautomated way while providing a metric for optimization in a biomarker selection scheme.
The framework was implemented as a pipeline consisting of five components for the identification of signatures from integrated clustering (ISIC). Expert knowledge was integrated into the biomarker identification process using the combination of two distinct approaches; a distance-based clustering approach and an expert knowledge-driven functional selection.
The utility of the developed framework ISIC was demonstrated on proteomics data from a study of chronic obstructive pulmonary disease (COPD). Biomarker candidates were identified in a mouse model using ISIC and validated in a study of a human cohort.
Expert knowledge can be introduced into a biomarker discovery process in different ways to enhance the robustness of selected marker candidates. Developing strategies for extracting orthogonal and robust features from large data sets increases the chances of success in biomarker identification.
高通量技术产生的大型复杂数据集的可用性使得疾病生物标志物研究最近大量涌现。然而,从大型数据集推导生物学信息的一个反复出现的问题是如何最好地将专家知识纳入生物标志物选择过程。
开发一个可推广的框架,以半自动的方式将专家知识纳入数据驱动的过程,并为生物标志物选择方案中的优化提供一个度量标准。
该框架实现为一个由五个组件组成的管道,用于从集成聚类(ISIC)中识别签名。专家知识通过两种不同方法的组合集成到生物标志物识别过程中;基于距离的聚类方法和专家知识驱动的功能选择。
所开发的 ISIC 框架的实用性在一项慢性阻塞性肺疾病(COPD)研究的蛋白质组学数据上得到了证明。使用 ISIC 在小鼠模型中鉴定了生物标志物候选物,并在人类队列的研究中进行了验证。
可以以不同的方式将专家知识引入到生物标志物发现过程中,以增强所选标记候选物的稳健性。从大型数据集中提取正交且稳健特征的策略增加了生物标志物识别成功的机会。