Qi Yuan, Missiuro Patrycja E, Kapoor Ashish, Hunter Craig P, Jaakkola Tommi S, Gifford David K, Ge Hui
Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139, USA.
Bioinformatics. 2006 Jul 15;22(14):e417-23. doi: 10.1093/bioinformatics/btl256.
Gene expression profiling is a powerful approach to identify genes that may be involved in a specific biological process on a global scale. For example, gene expression profiling of mutant animals that lack or contain an excess of certain cell types is a common way to identify genes that are important for the development and maintenance of given cell types. However, it is difficult for traditional computational methods, including unsupervised and supervised learning methods, to detect relevant genes from a large collection of expression profiles with high sensitivity and specificity. Unsupervised methods group similar gene expressions together while ignoring important prior biological knowledge. Supervised methods utilize training data from prior biological knowledge to classify gene expression. However, for many biological problems, little prior knowledge is available, which limits the prediction performance of most supervised methods.
We present a Bayesian semi-supervised learning method, called BGEN, that improves upon supervised and unsupervised methods by both capturing relevant expression profiles and using prior biological knowledge from literature and experimental validation. Unlike currently available semi-supervised learning methods, this new method trains a kernel classifier based on labeled and unlabeled gene expression examples. The semi-supervised trained classifier can then be used to efficiently classify the remaining genes in the dataset. Moreover, we model the confidence of microarray probes and probabilistically combine multiple probe predictions into gene predictions. We apply BGEN to identify genes involved in the development of a specific cell lineage in the C. elegans embryo, and to further identify the tissues in which these genes are enriched. Compared to K-means clustering and SVM classification, BGEN achieves higher sensitivity and specificity. We confirm certain predictions by biological experiments.
The results are available at http://www.csail.mit.edu/~alanqi/projects/BGEN.html.
基因表达谱分析是一种在全球范围内识别可能参与特定生物学过程的基因的强大方法。例如,对缺乏或含有过量特定细胞类型的突变动物进行基因表达谱分析,是识别对特定细胞类型的发育和维持至关重要的基因的常用方法。然而,包括无监督和监督学习方法在内的传统计算方法,很难从大量表达谱中以高灵敏度和特异性检测出相关基因。无监督方法将相似的基因表达聚集在一起,而忽略了重要的先验生物学知识。监督方法利用来自先验生物学知识的训练数据对基因表达进行分类。然而,对于许多生物学问题,可用的先验知识很少,这限制了大多数监督方法的预测性能。
我们提出了一种贝叶斯半监督学习方法,称为BGEN,它通过捕获相关表达谱并利用文献和实验验证中的先验生物学知识,对监督和无监督方法进行了改进。与目前可用的半监督学习方法不同,这种新方法基于标记和未标记的基因表达示例训练核分类器。然后,半监督训练的分类器可用于高效地对数据集中的其余基因进行分类。此外,我们对微阵列探针的置信度进行建模,并将多个探针预测概率性地组合成基因预测。我们应用BGEN来识别秀丽隐杆线虫胚胎中特定细胞谱系发育过程中涉及的基因,并进一步确定这些基因富集的组织。与K均值聚类和支持向量机分类相比,BGEN具有更高的灵敏度和特异性。我们通过生物学实验证实了某些预测。