Institute of Computing Science and Technology, Guangzhou University, Guangzhou, 510006, China.
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, 510006, China.
Infect Dis Poverty. 2022 May 4;11(1):50. doi: 10.1186/s40249-022-00974-0.
Influenza B virus can cause epidemics with high pathogenicity, so it poses a serious threat to public health. A feature representation algorithm is proposed in this paper to identify the pathogenicity phenotype of influenza B virus.
The dataset included all 11 influenza virus proteins encoded in eight genome segments of 1724 strains. Two types of features were hierarchically used to build the prediction model. Amino acid features were directly delivered from 67 feature descriptors and input into the random forest classifier to output informative features about the class label and probabilistic prediction. The sequential forward search strategy was used to optimize the informative features. The final features for each strain had low dimensions and included knowledge from different perspectives, which were used to build the machine learning model for pathogenicity identification.
The 40 signature positions were achieved by entropy screening. Mutations at position 135 of the hemagglutinin protein had the highest entropy value (1.06). After the informative features were directly generated from the 67 random forest models, the dimensions for class and probabilistic features were optimized as 4 and 3, respectively. The optimal class features had a maximum accuracy of 94.2% and a maximum Matthews correlation coefficient of 88.4%, while the optimal probabilistic features had a maximum accuracy of 94.1% and a maximum Matthews correlation coefficient of 88.2%. The optimized features outperformed the original informative features and amino acid features from individual descriptors. The sequential forward search strategy had better performance than the classical ensemble method.
The optimized informative features had the best performance and were used to build a predictive model so as to identify the phenotype of influenza B virus with high pathogenicity and provide early risk warning for disease control.
乙型流感病毒可引起高致病性的流行,对公共卫生构成严重威胁。本文提出了一种特征表示算法,用于识别乙型流感病毒的致病性表型。
该数据集包括 1724 株流感病毒的 8 个基因组片段中编码的所有 11 种病毒蛋白。使用两种类型的特征进行分层构建预测模型。氨基酸特征直接从 67 个特征描述符中提供,并输入随机森林分类器,以输出关于类别标签和概率预测的信息特征。采用顺序前向搜索策略对信息特征进行优化。最终的特征具有低维性,包含不同视角的知识,可用于构建致病性识别的机器学习模型。
通过熵筛选得到了 40 个特征位置。血凝素蛋白第 135 位的突变具有最高的熵值(1.06)。在从 67 个随机森林模型中直接生成信息特征后,类别特征和概率特征的维度分别优化为 4 和 3。最优类别特征的准确率最高为 94.2%,马氏相关系数最高为 88.4%;最优概率特征的准确率最高为 94.1%,马氏相关系数最高为 88.2%。优化后的特征优于原始信息特征和来自单个描述符的氨基酸特征。顺序前向搜索策略的性能优于经典集成方法。
优化后的信息特征具有最佳性能,用于构建预测模型,以识别高致病性乙型流感病毒的表型,并为疾病控制提供早期风险预警。