Natsoulis Georges, El Ghaoui Laurent, Lanckriet Gert R G, Tolley Alexander M, Leroy Fabrice, Dunlea Shane, Eynon Barrett P, Pearson Cecelia I, Tugendreich Stuart, Jarnagin Kurt
Iconix Pharmaceuticals, Mountain View, CA 94043, USA.
Genome Res. 2005 May;15(5):724-36. doi: 10.1101/gr.2807605.
A large gene expression database has been produced that characterizes the gene expression and physiological effects of hundreds of approved and withdrawn drugs, toxicants, and biochemical standards in various organs of live rats. In order to derive useful biological knowledge from this large database, a variety of supervised classification algorithms were compared using a 597-microarray subset of the data. Our studies show that several types of linear classifiers based on Support Vector Machines (SVMs) and Logistic Regression can be used to derive readily interpretable drug signatures with high classification performance. Both methods can be tuned to produce classifiers of drug treatments in the form of short, weighted gene lists which upon analysis reveal that some of the signature genes have a positive contribution (act as "rewards" for the class-of-interest) while others have a negative contribution (act as "penalties") to the classification decision. The combination of reward and penalty genes enhances performance by keeping the number of false positive treatments low. The results of these algorithms are combined with feature selection techniques that further reduce the length of the drug signatures, an important step towards the development of useful diagnostic biomarkers and low-cost assays. Multiple signatures with no genes in common can be generated for the same classification end-point. Comparison of these gene lists identifies biological processes characteristic of a given class.
已经建立了一个大型基因表达数据库,该数据库描述了数百种已批准和撤回的药物、毒物以及生化标准品在活大鼠各个器官中的基因表达和生理效应。为了从这个大型数据库中获取有用的生物学知识,使用该数据的一个597个微阵列子集比较了多种监督分类算法。我们的研究表明,基于支持向量机(SVM)和逻辑回归的几种线性分类器可用于得出具有高分类性能且易于解释的药物特征。这两种方法都可以进行调整,以生成药物治疗分类器,其形式为简短的加权基因列表,经分析发现,一些特征基因具有正向贡献(作为目标类别的“奖励”),而其他基因对分类决策具有负向贡献(作为“惩罚”)。奖励基因和惩罚基因的组合通过保持低假阳性治疗数量来提高性能。这些算法的结果与特征选择技术相结合,进一步缩短了药物特征的长度,这是朝着开发有用的诊断生物标志物和低成本检测方法迈出的重要一步。对于相同的分类终点,可以生成多个没有共同基因的特征。比较这些基因列表可识别给定类别的生物学过程特征。