College of Plant Protection, Hunan Agricultural University, Changsha 410128, China.
Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Nongda Road, Furong District, Changsha 410128, China.
Genes (Basel). 2023 Feb 6;14(2):421. doi: 10.3390/genes14020421.
Gene families, which are parts of a genome's information storage hierarchy, play a significant role in the development and diversity of multicellular organisms. Several studies have focused on the characteristics of gene families, such as function, homology, or phenotype. However, statistical and correlation analyses on the distribution of gene family members in the genome have yet to be conducted. Here, a novel framework incorporating gene family analysis and genome selection based on NMF-ReliefF is reported. Specifically, the proposed method starts by obtaining gene families from the TreeFam database and determining the number of gene families within the feature matrix. Then, NMF-ReliefF is used to select features from the gene feature matrix, which is a new feature selection algorithm that overcomes the inefficiencies of traditional methods. Finally, a support vector machine is utilized to classify the acquired features. The results show that the framework achieved an accuracy of 89.1% and an AUC of 0.919 on the insect genome test set. We also employed four microarray gene data sets to evaluate the performance of the NMF-ReliefF algorithm. The outcomes show that the proposed method may strike a delicate balance between robustness and discrimination. Additionally, the proposed method's categorization is superior to state-of-the-art feature selection approaches.
基因家族是基因组信息存储层次结构的一部分,在多细胞生物的发育和多样性中发挥着重要作用。已有多项研究聚焦于基因家族的特征,如功能、同源性或表型等。然而,对基因家族成员在基因组中的分布进行统计和相关性分析的工作尚未开展。本研究报告了一种新的框架,该框架结合了基于 NMF-ReliefF 的基因家族分析和基因组选择。具体而言,该方法首先从 TreeFam 数据库中获取基因家族,并确定特征矩阵中的基因家族数量。然后,使用 NMF-ReliefF 从基因特征矩阵中选择特征,这是一种新的特征选择算法,克服了传统方法的效率低下问题。最后,使用支持向量机对获取的特征进行分类。结果表明,该框架在昆虫基因组测试集上的准确率为 89.1%,AUC 为 0.919。我们还使用了四个微阵列基因数据集来评估 NMF-ReliefF 算法的性能。结果表明,该方法可能在稳健性和区分度之间取得了微妙的平衡。此外,与最先进的特征选择方法相比,所提出的方法在分类方面具有优越性。