Hainke Katrin, Szugat Sebastian, Fried Roland, Rahnenführer Jörg
Department of Statistics, TU Dortmund University, Dortmund, 44221, Germany.
BMC Bioinformatics. 2017 Aug 1;18(1):358. doi: 10.1186/s12859-017-1762-1.
Disease progression models are important for understanding the critical steps during the development of diseases. The models are imbedded in a statistical framework to deal with random variations due to biology and the sampling process when observing only a finite population. Conditional probabilities are used to describe dependencies between events that characterise the critical steps in the disease process. Many different model classes have been proposed in the literature, from simple path models to complex Bayesian networks. A popular and easy to understand but yet flexible model class are oncogenetic trees. These have been applied to describe the accumulation of genetic aberrations in cancer and HIV data. However, the number of potentially relevant aberrations is often by far larger than the maximal number of events that can be used for reliably estimating the progression models. Still, there are only a few approaches to variable selection, which have not yet been investigated in detail.
We fill this gap and propose specifically for oncogenetic trees ten variable selection methods, some of these being completely new. We compare them in an extensive simulation study and on real data from cancer and HIV. It turns out that the preselection of events by clique identification algorithms performs best. Here, events are selected if they belong to the largest or the maximum weight subgraph in which all pairs of vertices are connected.
The variable selection method of identifying cliques finds both the important frequent events and those related to disease pathways.
疾病进展模型对于理解疾病发展过程中的关键步骤非常重要。这些模型嵌入在一个统计框架中,以处理由于生物学因素和仅观察有限总体时的抽样过程所导致的随机变化。条件概率用于描述表征疾病过程关键步骤的事件之间的依赖性。文献中已经提出了许多不同的模型类别,从简单的路径模型到复杂的贝叶斯网络。一种流行且易于理解但又灵活的模型类别是肿瘤发生树。这些模型已被应用于描述癌症和艾滋病数据中遗传畸变的积累。然而,潜在相关畸变的数量往往远大于可用于可靠估计进展模型的最大事件数量。尽管如此,变量选择的方法仍然很少,且尚未得到详细研究。
我们填补了这一空白,特别针对肿瘤发生树提出了十种变量选择方法,其中一些是全新的。我们在广泛的模拟研究以及癌症和艾滋病的真实数据上对它们进行了比较。结果表明,通过团识别算法进行事件预选的方法表现最佳。在这里,如果事件属于所有顶点对都相连的最大或最大权重子图,则选择这些事件。
识别团的变量选择方法既能找到重要的频繁事件,也能找到与疾病途径相关的事件。