Wróbel Łukasz, Gudyś Adam, Sikora Marek
Institute of Informatics, Silesian Univ. of Technology, Akademicka 16, Gliwice, 44-100, Poland.
Institute of Innovative Technologies, EMAG, Leopolda 31, Katowice, 40-189, Poland.
BMC Bioinformatics. 2017 May 30;18(1):285. doi: 10.1186/s12859-017-1693-x.
Survival analysis is an important element of reasoning from data. Applied in a number of fields, it has become particularly useful in medicine to estimate the survival rate of patients on the basis of their condition, examination results, and undergoing treatment. The recent developments in the next generation sequencing open new opportunities in survival study as they allow vast amount of genome-, transcriptome-, and proteome-related features to be investigated. These include single nucleotide and structural variants, expressions of genes and microRNAs, DNA methylation, and many others.
We present LR-Rules, a new algorithm for rule induction from survival data. It works according to the separate-and-conquer heuristics with a use of log-rank test for establishing rule body. Extensive experiments show LR-Rules to generate models of superior accuracy and comprehensibility. The detailed analysis of rules rendered by the presented algorithm on four medical datasets concerning leukemia as well as breast, lung, and thyroid cancers, reveals the ability to discover true relations between attributes and patients' survival rate. Two of the case studies incorporate features obtained with a use of high throughput technologies showing the usability of the algorithm in the analysis of bioinformatics data.
LR-Rules is a viable alternative to existing approaches to survival analysis, particularly when the interpretability of a resulting model is crucial. Presented algorithm may be especially useful when applied on the genomic and proteomic data as it may contribute to the better understanding of the background of diseases and support their treatments.
生存分析是数据推理的一个重要元素。它应用于多个领域,在医学领域尤其有用,可根据患者的病情、检查结果和正在接受的治疗来估计患者的生存率。新一代测序技术的最新发展为生存研究带来了新机遇,因为它们使得大量与基因组、转录组和蛋白质组相关的特征得以研究。这些特征包括单核苷酸和结构变异、基因和微小RNA的表达、DNA甲基化等等。
我们提出了LR - Rules,一种从生存数据中归纳规则的新算法。它依据分治启发式方法工作,使用对数秩检验来建立规则体。大量实验表明LR - Rules能生成准确性和可理解性都更优的模型。对该算法在四个关于白血病以及乳腺癌、肺癌和甲状腺癌的医学数据集上生成的规则进行详细分析,揭示了其发现属性与患者生存率之间真实关系的能力。其中两个案例研究纳入了通过高通量技术获得的特征,展示了该算法在生物信息学数据分析中的实用性。
LR - Rules是现有生存分析方法的一个可行替代方案,特别是当所得模型的可解释性至关重要时。当应用于基因组和蛋白质组数据时,所提出的算法可能特别有用,因为它可能有助于更好地理解疾病背景并支持疾病治疗。