一种用于结核分枝杆菌间隔区寡核苷酸分型的数据挖掘方法。

A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis.

作者信息

Sebban M, Mokrousov I, Rastogi N, Sola C

机构信息

French West Indies and Guiana University, TRIVIA, Department of Mathematics and Computer Science, Campus Fouillole, 97159 Pointe-à-Pitre Cedex, Guadeloupe.

出版信息

Bioinformatics. 2002 Feb;18(2):235-43. doi: 10.1093/bioinformatics/18.2.235.

DOI:10.1093/bioinformatics/18.2.235

PMID:11847071

Abstract

MOTIVATION

The Direct Repeat (DR) locus of Mycobacterium tuberculosis is a suitable model to study (i) molecular epidemiology and (ii) the evolutionary genetics of tuberculosis. This is achieved by a DNA analysis technique (genotyping), called sp acer oligo nucleotide typing (spoligotyping ). In this paper, we investigated data analysis methods to discover intelligible knowledge rules from spoligotyping, that has not yet been applied on such representation. This processing was achieved by applying the C4.5 induction algorithm and knowledge rules were produced. Finally, a Prototype Selection (PS) procedure was applied to eliminate noisy data. This both simplified decision rules, as well as the number of spacers to be tested to solve classification tasks. In the second part of this paper, the contribution of 25 new additional spacers and the knowledge rules inferred were studied from a machine learning point of view. From a statistical point of view, the correlations between spacers were analyzed and suggested that both negative and positive ones may be related to potential structural constraints within the DR locus that may shape its evolution directly or indirectly.

RESULTS

By generating knowledge rules induced from decision trees, it was shown that not only the expert knowledge may be modeled but also improved and simplified to solve automatic classification tasks on unknown patterns. A practical consequence of this study may be a simplification of the spoligotyping technique, resulting in a reduction of the experimental constraints and an increase in the number of samples processed.

摘要

动机

结核分枝杆菌的直接重复序列（DR）位点是研究（i）分子流行病学和（ii）结核病进化遗传学的合适模型。这是通过一种称为间隔寡核苷酸分型（spoligotyping）的DNA分析技术（基因分型）来实现的。在本文中，我们研究了数据分析方法，以从spoligotyping中发现可理解的知识规则，这种方法尚未应用于此类表示形式。通过应用C4.5归纳算法实现了这种处理，并生成了知识规则。最后，应用了原型选择（PS）程序来消除噪声数据。这既简化了决策规则，也减少了用于解决分类任务的间隔序列的测试数量。在本文的第二部分，从机器学习的角度研究了25个新的额外间隔序列的贡献以及推断出的知识规则。从统计学角度分析了间隔序列之间的相关性，结果表明正负相关性可能都与DR位点内潜在的结构限制有关，这些限制可能直接或间接影响其进化。