Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, 600036, India.
Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, 600036, India; School of Computing, Tokyo Tech World Research Hub Initiative (WRHI), Institute of Innovative Research, Tokyo Institute of Technology, Midori-ku, Kanagawa, 226-8503, Yokohama, Japan.
Comput Biol Med. 2020 Sep;124:103933. doi: 10.1016/j.compbiomed.2020.103933. Epub 2020 Aug 5.
Alzheimer's disease (AD) is a complex and heterogeneous disease that affects neuronal cells over time and it is prevalent among all neurodegenerative diseases. Next Generation Sequencing (NGS) techniques are widely used for developing high-throughput screening methods to identify biomarkers and variants, which help early diagnosis and treatments.
The primary purpose of this study is to develop a classification model using machine learning for predicting the deleterious effect of variants with respect to AD.
We have constructed a set of 20,401 deleterious and 37,452 control variants from Genome-Wide Association Study (GWAS) and Genotype-Tissue Expression (GTEx) portals, respectively. Recursive feature elimination using cross-validation (RFECV) followed by a forward feature selection method was utilized to select the important features and a random forest classifier was used for distinguishing between deleterious and neutral variants.
Our method showed an accuracy of 81.21% on 10-fold cross-validation and 70.63% on a test set of 5785 variants. The same test set was used to compare the performance of CADD and FATHMM and their accuracies are in the range of 54%-62%.
Our model is freely available as the Variant Effect Predictor for Alzheimer's Disease (VEPAD) at http://web.iitm.ac.in/bioinfo2/vepad/. VEPAD can be used to predict the effect of new variants associated with AD.
阿尔茨海默病(AD)是一种复杂的异质疾病,会随着时间的推移影响神经元细胞,是所有神经退行性疾病中最常见的一种。下一代测序(NGS)技术被广泛用于开发高通量筛选方法,以识别生物标志物和变体,这有助于早期诊断和治疗。
本研究的主要目的是使用机器学习开发一种分类模型,用于预测 AD 相关变体的有害影响。
我们分别从全基因组关联研究(GWAS)和基因型组织表达(GTEx)门户构建了一组 20401 个有害变体和 37452 个对照变体。使用交叉验证(RFECV)的递归特征消除和前向特征选择方法来选择重要特征,并使用随机森林分类器来区分有害和中性变体。
我们的方法在 10 折交叉验证上的准确率为 81.21%,在 5785 个变体的测试集上的准确率为 70.63%。同一测试集用于比较 CADD 和 FATHMM 的性能,它们的准确率在 54%-62%之间。
我们的模型作为阿尔茨海默病变体效应预测器(VEPAD)在 http://web.iitm.ac.in/bioinfo2/vepad/ 上免费提供。VEPAD 可用于预测与 AD 相关的新变体的效应。