一种无需训练步骤的转录组序列编码框架分类统计方法。

A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences.

作者信息

Carels Nicolas, Frías Diego

机构信息

Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil.

出版信息

Bioinform Biol Insights. 2013;7:35-54. doi: 10.4137/BBI.S10053. Epub 2013 Jan 23.

DOI:10.4137/BBI.S10053

PMID:23400232

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3561939/

Abstract

In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or non-coding through a score based on 5 factors: (i) stop codon frequency; (ii) the product of the probabilities of purines occurring in the three positions of nucleotide triplets; (iii) the product of the probabilities of Cytosine (C), Guanine (G), and Adenine (A) occurring in the 1st, 2nd, and 3rd positions of triplets, respectively; (iv) the probabilities of a G occurring in the 1st and 2nd positions of triplets; and (v) the probabilities of a T occurring in the 1st and an A in the 2nd position of triplets. Because UFM is based on primary determinants of coding sequences that are conserved throughout the biosphere, it is suitable for cORF classification of any sequence in eukaryote transcriptomes without prior knowledge. Considering the protein sequences of the Protein Data Bank (RCSB PDB or more simply PDB) as a reference, we found that UFM classifies cORFs of ≥200 bp (if the coding strand is known) and cORFs of ≥300 bp (if the coding strand is unknown), and releases them in their coding strand and coding frame, which allows their automatic translation into protein sequences with a success rate equal to or higher than 95%. We first established the statistical parameters of UFM using ESTs from Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa, Zea mays, Drosophila melanogaster, Homo sapiens and Chlamydomonas reinhardtii in reference to the protein sequences of PDB. Second, we showed that the success rate of cORF classification using UFM is expected to apply to approximately 95% of higher eukaryote genes that encode for proteins. Third, we used UFM in combination with CAP3 to assemble large EST samples into cORFs that we used to analyze transcriptome phenotypes in rice, maize, and humans. We discuss the error rate and the interference of noisy sequences such as pseudogenes, transposons, and retrotransposons. This method is suitable for rapid cORF extraction from transcriptome data and allows correct description of the genome phenotypes of plant genomes without prior knowledge. Additional care is necessary when addressing the human transcriptome due to the interference caused by large amounts of noisy sequences. UFM can be regarded as a low complexity tool for prior knowledge extraction concerning the coding fraction of the transcriptome of any eukaryote. Due to its low level of complexity, UFM is also very robust to variations of codon usage.

摘要

在本研究中，我们使用通用特征方法（UFM）研究了对表达序列标签（EST）的开放阅读框（cORF）分类进行编码的方式。UFM算法基于嘌呤偏好性得分（Rrr）和终止密码子频率。UFM通过基于5个因素的得分将开放阅读框分类为编码或非编码：（i）终止密码子频率；（ii）嘌呤在核苷酸三联体三个位置出现的概率的乘积；（iii）胞嘧啶（C）、鸟嘌呤（G）和腺嘌呤（A）分别在三联体第1、2和3位置出现的概率的乘积；（iv）G在三联体第1和2位置出现的概率；以及（v）T在三联体第1位置和A在三联体第2位置出现的概率。由于UFM基于整个生物圈中保守的编码序列的主要决定因素，因此它适用于真核生物转录组中任何序列的cORF分类，无需先验知识。以蛋白质数据库（RCSB PDB或更简单地称为PDB）的蛋白质序列为参考，我们发现UFM对长度≥200 bp的cORF（如果编码链已知）和长度≥300 bp的cORF（如果编码链未知）进行分类，并以其编码链和编码框形式输出，这使得它们能够自动翻译成蛋白质序列，成功率等于或高于95%。我们首先使用恶性疟原虫、拟南芥、水稻、玉米、黑腹果蝇、人类和莱茵衣藻的EST，并参考PDB的蛋白质序列，建立了UFM的统计参数。其次，我们表明使用UFM进行cORF分类的成功率预计适用于大约95%的编码蛋白质的高等真核生物基因。第三，我们将UFM与CAP3结合使用，将大型EST样本组装成cORF，用于分析水稻、玉米和人类的转录组表型。我们讨论了错误率以及假基因、转座子和逆转座子等噪声序列的干扰。该方法适用于从转录组数据中快速提取cORF，并能够在无需先验知识的情况下正确描述植物基因组的基因组表型。由于大量噪声序列的干扰，在处理人类转录组时需要格外小心。UFM可被视为一种低复杂度工具，用于提取有关任何真核生物转录组编码部分的先验知识。由于其低复杂度水平，UFM对密码子使用的变化也非常稳健。