Harris Cole, Ghaffari Noushin
Exagen Diagnostics, Inc, Houston, TX, USA.
BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S7. doi: 10.1186/1471-2164-9-S2-S7.
The growing body of DNA microarray data has the potential to advance our understanding of the molecular basis of disease. However annotating microarray datasets with clinically useful information is not always possible, as this often requires access to detailed patient records. In this study we introduce GLAD, a new Semi-Supervised Learning (SSL) method for combining independent annotated datasets and unannotated datasets with the aim of identifying more robust sample classifiers. In our method, independent models are developed using subsets of genes for the annotated and unannotated datasets. These models are evaluated according to a scoring function that incorporates terms for classification accuracy on annotated data, and relative cluster separation in unannotated data. Improved models are iteratively generated using a genetic algorithm feature selection technique. Our results show that the addition of unannotated data into training, significantly improves classifier robustness.
越来越多的DNA微阵列数据有潜力促进我们对疾病分子基础的理解。然而,用临床有用信息注释微阵列数据集并非总是可行的,因为这通常需要获取详细的患者记录。在本研究中,我们引入了GLAD,这是一种新的半监督学习(SSL)方法,用于结合独立的注释数据集和未注释数据集,目的是识别更强大的样本分类器。在我们的方法中,使用注释和未注释数据集的基因子集开发独立模型。根据一个评分函数对这些模型进行评估,该评分函数包含注释数据上的分类准确性和未注释数据中的相对聚类分离项。使用遗传算法特征选择技术迭代生成改进模型。我们的结果表明,在训练中加入未注释数据可显著提高分类器的稳健性。