Gonay Valentin, Dunne Michael P, Caceres-Delpiano Javier, Kajava Andrey V
CRBM UMR 5237 CNRS, Université Montpellier, Montpellier, France.
PROTERA SAS, Paris, France.
Alzheimers Dement. 2025 Feb;21(2):e14510. doi: 10.1002/alz.14510. Epub 2025 Jan 8.
The importance of protein amyloidogenesis, associated with various diseases and functional roles, has driven the creation of computational predictors of amyloidogenicity. The accuracy of these predictors, particularly those utilizing artificial intelligence technologies, heavily depends on the quality of the data.
We built Cross-Beta DB, a database containing high-quality data on known cross-β amyloids formed under natural conditions. We used it to train and benchmark several machine-learning (ML) algorithms to predict amyloid-forming potential of proteins.
We developed the Cross-Beta predictor using an Extra trees ML algorithm, which outperforms other amyloid predictors with the highest F1 score (0.852) and accuracy (0.844) compared to existing methods.
The development of the Cross-Beta DB database and a new ML-based Cross-Beta predictor may enable the creation of personalized risk profiles for neurodegenerative diseases and other amyloidoses-especially as genome sequencing becomes more affordable.
Accuracy of ML-based predictors depends on the quality of training data We built Cross-Beta DB, a database of high-quality data on naturally-occurring amyloids Using this data, we developed an amyloid predictor that outperforms other predictors This computational tool enables the creation of risk profiles for neurodegenerative diseases.
蛋白质淀粉样变与多种疾病及功能作用相关,其重要性推动了淀粉样变性计算预测工具的创建。这些预测工具的准确性,尤其是那些利用人工智能技术的工具,在很大程度上取决于数据质量。
我们构建了交叉β数据库(Cross-Beta DB),这是一个包含在自然条件下形成的已知交叉β淀粉样蛋白高质量数据的数据库。我们用它来训练和评估几种机器学习(ML)算法,以预测蛋白质形成淀粉样蛋白的潜力。
我们使用极端随机树机器学习算法开发了交叉β预测工具,与现有方法相比,它在F1分数(0.852)和准确率(0.844)方面表现最佳,优于其他淀粉样蛋白预测工具。
交叉β数据库(Cross-Beta DB)和基于机器学习的新型交叉β预测工具的开发,可能有助于为神经退行性疾病和其他淀粉样变性创建个性化风险概况——尤其是随着基因组测序成本越来越低。
基于机器学习的预测工具的准确性取决于训练数据的质量 我们构建了交叉β数据库(Cross-Beta DB),这是一个关于天然淀粉样蛋白的高质量数据的数据库 利用这些数据,我们开发了一种优于其他预测工具的淀粉样蛋白预测工具 这种计算工具能够为神经退行性疾病创建风险概况