Shkurin Aleksei, Pour Sara E, Hughes Timothy R
Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada.
Terrence Donnelly Centre for Cellular & Biomolecular Research, Toronto, ON M5S 3E1, Canada.
NAR Genom Bioinform. 2023 Apr 5;5(2):lqad031. doi: 10.1093/nargab/lqad031. eCollection 2023 Jun.
Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CAUA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 59% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model increases this figure to only 61%, suggesting that additional factors beyond the core CPA machinery have a minimal role in delineating real from cryptic sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented.
切割与聚腺苷酸化(CPA)位点定义了真核基因的末端。CPA位点与五个关键序列识别元件相关:上游的UGUA、聚腺苷酸化信号(PAS)以及富含U的序列;发生切割的CAUA二核苷酸;以及富含GU的下游元件(DSE)。目前尚不清楚这些序列是否足以界定CPA位点。此外,人们还描述了许多其他序列和因子,这些通常是在促进可变CPA位点和防止隐蔽CPA位点使用的背景下进行的。在此,我们使用标准判别模型剖析了各个序列特征对CPA的贡献。我们发现,对于59%的人类基因,仅由五个主要CPA序列特征组成的模型,相对于整个前体mRNA序列,在编码基因末端的组成型CPA位点上给出的概率得分最高。U1杂交序列在性能上有小幅提升。将所有已知的RBP RNA结合基序添加到模型中,这一比例仅提高到61%,这表明除了核心CPA机制之外的其他因素在区分真实位点和隐蔽位点方面作用极小。据我们所知,既定特征在预测人类基因末端方面的这种高效性此前尚未有文献记载。