使用生化加权核提取聚糖基序。

Extracting glycan motifs using a biochemicallyweighted kernel.

作者信息

Jiang Hao, Aoki-Kinoshita Kiyoko F, Ching Wai-Ki

出版信息

Bioinformation. 2011;7(8):405-12. doi: 10.6026/97320630007405. Epub 2011 Dec 21.

DOI:10.6026/97320630007405

PMID:22347783

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3280441/

Abstract

Carbohydrates, or glycans, are one of the most abundant and structurally diverse biopolymers constitute the third major class of biomolecules, following DNA and proteins. However, the study of carbohydrate sugar chains has lagged behind compared to that of DNA and proteins, mainly due to their inherent structural complexity. However, their analysis is important because they serve various important roles in biological processes, including signaling transduction and cellular recognition. In order to glean some light into glycan function based on carbohydrate structure, kernel methods have been developed in the past, in particular to extract potential glycan biomarkers by classifying glycan structures found in different tissue samples. The recently developed weighted qgram method (LK-method) exhibits good performance on glycan structure classification while having limitations in feature selection. That is, it was unable to extract biologically meaningful features from the data. Therefore, we propose a biochemicallyweighted tree kernel (BioLK-method) which is based on a glycan similarity matrix and also incorporates biochemical information of individual q-grams in constructing the kernel matrix. We further applied our new method for the classification and recognition of motifs on publicly available glycan data. Our novel tree kernel (BioLK-method) using a Support Vector Machine (SVM) is capable of detecting biologically important motifs accurately while LK-method failed to do so. It was tested on three glycan data sets from the Consortium for Functional Glycomics (CFG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) GLYCAN and showed that the results are consistent with the literature. The newly developed BioLK-method also maintains comparable classification performance with the LK-method. Our results obtained here indicate that the incorporation of biochemical information of q-grams further shows the flexibility and capability of the novel kernel in feature extraction, which may aid in the prediction of glycan biomarkers.

摘要

碳水化合物，即聚糖，是最丰富且结构多样的生物聚合物之一，是继DNA和蛋白质之后的第三大类生物分子。然而，与DNA和蛋白质相比，碳水化合物糖链的研究相对滞后，主要是由于其固有的结构复杂性。然而，对它们的分析很重要，因为它们在生物过程中发挥着各种重要作用，包括信号转导和细胞识别。为了基于碳水化合物结构来了解聚糖功能，过去已经开发了核方法，特别是通过对不同组织样本中发现的聚糖结构进行分类来提取潜在的聚糖生物标志物。最近开发的加权qgram方法（LK方法）在聚糖结构分类方面表现良好，但在特征选择方面存在局限性。也就是说，它无法从数据中提取具有生物学意义的特征。因此，我们提出了一种基于聚糖相似性矩阵的生化加权树核（BioLK方法），并且在构建核矩阵时还纳入了各个q-gram的生化信息。我们进一步将我们的新方法应用于公开可用的聚糖数据上的基序分类和识别。我们使用支持向量机（SVM）的新型树核（BioLK方法）能够准确检测出生物学上重要的基序，而LK方法却无法做到这一点。它在来自功能糖组学协会（CFG）和京都基因与基因组百科全书（KEGG）聚糖的三个聚糖数据集上进行了测试，结果表明与文献一致。新开发的BioLK方法也保持了与LK方法相当的分类性能。我们在此获得的结果表明，纳入q-gram的生化信息进一步显示了新型核在特征提取方面的灵活性和能力，这可能有助于聚糖生物标志物的预测。