University of Chicago, Chicago, Illinois, United States of America.
Department of Neurology, University of Chicago, Chicago, Illinois, United States of America.
PLoS One. 2022 Jun 1;17(6):e0256411. doi: 10.1371/journal.pone.0256411. eCollection 2022.
A number of neurologic diseases associated with expanded nucleotide repeats, including an inherited form of amyotrophic lateral sclerosis, have an unconventional form of translation called repeat-associated non-AUG (RAN) translation. It has been speculated that the repeat regions in the RNA fold into secondary structures in a length-dependent manner, promoting RAN translation. Repeat protein products are translated, accumulate, and may contribute to disease pathogenesis. Nucleotides that flank the repeat region, especially ones closest to the initiation site, are believed to enhance translation initiation. A machine learning model has been published to help identify ATG and near-cognate translation initiation sites; however, this model has diminished predictive power due to its extensive feature selection and limited training data. Here, we overcome this limitation and increase prediction accuracy by the following: a) capture the effect of nucleotides most critical for translation initiation via feature reduction, b) implement an alternative machine learning algorithm better suited for limited data, c) build comprehensive and balanced training data (via sampling without replacement) that includes previously unavailable sequences, and d) split ATG and near-cognate translation initiation codon data to train two separate models. We also design a supplementary scoring system to provide an additional prognostic assessment of model predictions. The resultant models have high performance, with ~85-88% accuracy, exceeding that of the previously published model by >18%. The models presented here are used to identify translation initiation sites in genes associated with a number of neurologic repeat expansion disorders. The results confirm a number of sites of translation initiation upstream of the expanded repeats that have been found experimentally, and predict sites that are not yet established.
许多与核苷酸重复扩展相关的神经疾病,包括一种遗传性肌萎缩侧索硬化症,都有一种非常规的翻译方式,称为重复相关非 AUG(RAN)翻译。据推测,RNA 中的重复区域以长度依赖的方式折叠成二级结构,从而促进 RAN 翻译。重复蛋白产物被翻译、积累,并可能导致疾病发病机制。靠近重复区域的侧翼核苷酸,特别是靠近起始位点的核苷酸,被认为可以增强翻译起始。已经发表了一种机器学习模型来帮助识别 ATG 和近同源翻译起始位点;然而,由于其广泛的特征选择和有限的训练数据,该模型的预测能力已经减弱。在这里,我们通过以下方法克服了这一限制并提高了预测准确性:a)通过特征减少来捕获对翻译起始最关键的核苷酸的影响,b)实施更适合有限数据的替代机器学习算法,c)构建全面且平衡的训练数据(通过无替换采样),包括以前不可用的序列,以及 d)将 ATG 和近同源翻译起始密码子数据拆分,以训练两个单独的模型。我们还设计了一个补充评分系统,为模型预测提供额外的预后评估。由此产生的模型具有很高的性能,准确率约为 85-88%,比之前发表的模型高出>18%。这里提出的模型用于识别与许多神经重复扩展疾病相关的基因中的翻译起始位点。结果证实了在扩展重复之前实验发现的许多翻译起始位点,并预测了尚未确定的位点。