Sebban M, Mokrousov I, Rastogi N, Sola C
French West Indies and Guiana University, TRIVIA, Department of Mathematics and Computer Science, Campus Fouillole, 97159 Pointe-à-Pitre Cedex, Guadeloupe.
Bioinformatics. 2002 Feb;18(2):235-43. doi: 10.1093/bioinformatics/18.2.235.
The Direct Repeat (DR) locus of Mycobacterium tuberculosis is a suitable model to study (i) molecular epidemiology and (ii) the evolutionary genetics of tuberculosis. This is achieved by a DNA analysis technique (genotyping), called sp acer oligo nucleotide typing (spoligotyping ). In this paper, we investigated data analysis methods to discover intelligible knowledge rules from spoligotyping, that has not yet been applied on such representation. This processing was achieved by applying the C4.5 induction algorithm and knowledge rules were produced. Finally, a Prototype Selection (PS) procedure was applied to eliminate noisy data. This both simplified decision rules, as well as the number of spacers to be tested to solve classification tasks. In the second part of this paper, the contribution of 25 new additional spacers and the knowledge rules inferred were studied from a machine learning point of view. From a statistical point of view, the correlations between spacers were analyzed and suggested that both negative and positive ones may be related to potential structural constraints within the DR locus that may shape its evolution directly or indirectly.
By generating knowledge rules induced from decision trees, it was shown that not only the expert knowledge may be modeled but also improved and simplified to solve automatic classification tasks on unknown patterns. A practical consequence of this study may be a simplification of the spoligotyping technique, resulting in a reduction of the experimental constraints and an increase in the number of samples processed.
结核分枝杆菌的直接重复序列(DR)位点是研究(i)分子流行病学和(ii)结核病进化遗传学的合适模型。这是通过一种称为间隔寡核苷酸分型(spoligotyping)的DNA分析技术(基因分型)来实现的。在本文中,我们研究了数据分析方法,以从spoligotyping中发现可理解的知识规则,这种方法尚未应用于此类表示形式。通过应用C4.5归纳算法实现了这种处理,并生成了知识规则。最后,应用了原型选择(PS)程序来消除噪声数据。这既简化了决策规则,也减少了用于解决分类任务的间隔序列的测试数量。在本文的第二部分,从机器学习的角度研究了25个新的额外间隔序列的贡献以及推断出的知识规则。从统计学角度分析了间隔序列之间的相关性,结果表明正负相关性可能都与DR位点内潜在的结构限制有关,这些限制可能直接或间接影响其进化。
通过生成从决策树导出的知识规则,结果表明不仅可以对专家知识进行建模,还可以对其进行改进和简化,以解决对未知模式的自动分类任务。这项研究的实际结果可能是简化spoligotyping技术,从而减少实验限制并增加处理的样本数量。