Laboratorio de Virología Molecular, Centro de Investigaciones Nucleares, Facultad de Ciencias, Universidad de la República, Montevideo, Uruguay.
Laboratorio de Evolución Experimental de Virus, Institut Pasteur de Montevideo, Montevideo, Uruguay.
PeerJ. 2022 Apr 22;10:e11683. doi: 10.7717/peerj.11683. eCollection 2022.
Plant innate immunity relies on a broad repertoire of receptor proteins that can detect pathogens and trigger an effective defense response. Bioinformatic tools based on conserved domain and sequence similarity are within the most popular strategies for protein identification and characterization. However, the multi-domain nature, high sequence diversity and complex evolutionary history of disease resistance (DR) proteins make their prediction a real challenge. Here we present RFPDR, which pioneers the application of Random Forest (RF) for Plant DR protein prediction.
A recently published collection of experimentally validated DR proteins was used as a positive dataset, while 10x10 nested datasets, ranging from 400-4,000 non-DR proteins, were used as negative datasets. A total of 9,631 features were extracted from each protein sequence, and included in a full dimension (FD) RFPDR model. Sequence selection was performed, to generate a reduced-dimension (RD) RFPDR model. Model performances were evaluated using an 80/20 (training/testing) partition, with 10-cross fold validation, and compared to baseline, sequence-based and state-of-the-art strategies. To gain some insights into the underlying biology, the most discriminatory sequence-based features in the RF classifier were identified.
RD-RFPDR showed to be sensitive (86.4 ± 4.0%) and specific (96.9 ± 1.5%) for identifying DR proteins, while robust to data imbalance. Its high performance and robustness, added to the fact that RD-RFPDR provides valuable information related to DR proteins underlying properties, make RD-RFPDR an interesting approach for DR protein prediction, complementing the state-of-the-art strategies.
植物先天免疫依赖于广泛的受体蛋白库,这些受体蛋白可以识别病原体并触发有效的防御反应。基于保守结构域和序列相似性的生物信息学工具是蛋白质鉴定和特征分析最常用的策略之一。然而,由于抗病(DR)蛋白具有多结构域、高序列多样性和复杂的进化历史,因此对其进行预测是一个真正的挑战。在这里,我们提出了 RFPDR,它开创了随机森林(RF)在植物 DR 蛋白预测中的应用。
我们使用最近发表的一组经实验验证的 DR 蛋白作为阳性数据集,而 10x10 嵌套数据集,范围从 400-4000 个非 DR 蛋白,作为阴性数据集。从每个蛋白质序列中提取了 9631 个特征,并包含在全维(FD)RFPDR 模型中。进行了序列选择,以生成降维(RD)RFPDR 模型。使用 80/20(训练/测试)划分、10 折交叉验证来评估模型性能,并与基线、基于序列的和最新策略进行比较。为了深入了解潜在生物学,确定了 RF 分类器中最具区分性的基于序列的特征。
RD-RFPDR 对识别 DR 蛋白具有较高的敏感性(86.4 ± 4.0%)和特异性(96.9 ± 1.5%),并且对数据不平衡具有鲁棒性。其高性能和鲁棒性,加上 RD-RFPDR 提供了与 DR 蛋白潜在特性相关的有价值信息,使其成为 DR 蛋白预测的一种有趣方法,补充了最新策略。