Davuluri R V, Suzuki Y, Sugano S, Zhang M Q
Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.
Genome Res. 2000 Nov;10(11):1807-16. doi: 10.1101/gr.gr-1460r.
A nonredundant database of 2312 full-length human 5'-untranslated regions (UTRs) was carefully prepared using state-of-the-art experimental and computational technologies. A comprehensive computational analysis of this data was conducted for characterizing the 5' UTR features. Classification and regression tree (CART) analysis was used to classify the data into three distinct classes. Class I consists of mRNAs that are believed to be poorly translated with long 5' UTRs filled with potential inhibitory features. Class II consists of terminal oligopyrimidine tract (TOP) mRNAs that are regulated in a growth-dependent manner, and class III consists of mRNAs with favorable 5' UTR features that may help efficient translation. The most accurate tree we found has 92.5% classification accuracy as estimated by cross validation. The classification model included the presence of TOP, a secondary structure, 5' UTR length, and the presence of upstream AUGs (uAUGs) as the most relevant variables. The present classification and characterization of the 5' UTRs provide precious information for better understanding the translational regulation of human mRNAs. Furthermore, this database and classification can help people build better computational models for predicting the 5'-terminal exon and separating the 5' UTR from the coding region.
利用最先进的实验和计算技术,精心构建了一个包含2312个全长人类5'非翻译区(UTR)的非冗余数据库。对这些数据进行了全面的计算分析,以表征5'UTR的特征。使用分类与回归树(CART)分析将数据分为三个不同的类别。第一类由5'UTR较长且充满潜在抑制特征、翻译效率较低的mRNA组成。第二类由以生长依赖方式调控的末端寡嘧啶序列(TOP)mRNA组成,第三类由具有有利于高效翻译的5'UTR特征的mRNA组成。通过交叉验证估计,我们发现的最准确的树具有92.5%的分类准确率。分类模型将TOP的存在、二级结构、5'UTR长度以及上游AUG(uAUG)的存在作为最相关的变量。目前对5'UTR的分类和表征为更好地理解人类mRNA的翻译调控提供了宝贵信息。此外,该数据库和分类有助于人们构建更好的计算模型,用于预测5'末端外显子并将5'UTR与编码区区分开来。