Suppr超能文献

基于必需功能特征的多标签多类模型预测基因表型。

Predicting gene phenotype by multi-label multi-class model based on essential functional features.

机构信息

School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.

College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.

出版信息

Mol Genet Genomics. 2021 Jul;296(4):905-918. doi: 10.1007/s00438-021-01789-8. Epub 2021 Apr 29.

Abstract

Phenotype is one of the most significant concepts in genetics, which is used to describe all the characteristics of a research object that can be observed. Considering that phenotype reflects the integrated features of genotype and environment factors, it is hard to define phenotype characteristics, even difficult to predict unknown phenotypes. Restricted by current biological techniques, it is still quite expensive and time-consuming to obtain sufficient structural information of large-scale phenotype-associated genes/proteins. Various bioinformatics methods have been presented to solve such problem, and researchers have confirmed the efficacy and prediction accuracy of functional network-based prediction. But general functional descriptions have highly complicated inner structures for phenotype prediction. To further address this issue and improve the efficacy of phenotype prediction on more than ten kinds of phenotypes, we first extract functional enrichment features from GO and KEGG, and then use node2vec to learn functional embedding features of genes from a gene-gene network. All these features are analyzed by some feature selection methods (Boruta, minimum redundancy maximum relevance) to generate a feature list. Such list is fed into the incremental feature selection, incorporating some multi-label classifiers built by RAkEL and some classic base classifiers, to build an optimum multi-label multi-class classification model for phenotype prediction. According to recent researches, our method has indeed identified many literature-supported genes/proteins and their associated phenotypes, and even some candidate genes with re-assigned new phenotypes, which provide a new computational tool for the accurate and effective phenotypic prediction.

摘要

表型是遗传学中最重要的概念之一,用于描述可以观察到的研究对象的所有特征。由于表型反映了基因型和环境因素的综合特征,因此很难定义表型特征,甚至难以预测未知的表型。受当前生物技术的限制,获取大规模表型相关基因/蛋白质的足够结构信息仍然非常昂贵和耗时。已经提出了各种生物信息学方法来解决这个问题,研究人员已经证实了基于功能网络的预测的功效和预测准确性。但是,一般的功能描述对于表型预测具有高度复杂的内部结构。为了进一步解决这个问题,并提高对十多种表型的表型预测的功效,我们首先从 GO 和 KEGG 中提取功能富集特征,然后使用 node2vec 从基因-基因网络中学习基因的功能嵌入特征。所有这些特征都通过一些特征选择方法(Boruta、最小冗余最大相关性)进行分析,以生成特征列表。该列表被输入到增量特征选择中,整合了由 RAkEL 构建的一些多标签分类器和一些经典的基础分类器,以构建用于表型预测的最优多标签多类分类模型。根据最近的研究,我们的方法确实已经确定了许多有文献支持的基因/蛋白质及其相关表型,甚至一些候选基因被重新分配了新的表型,这为准确有效的表型预测提供了一种新的计算工具。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验