关于在核苷酸序列数据中鉴定II组内含子

On the identification of group II introns in nucleotide sequence data.

作者信息

Knoop V, Kloska S, Brennicke A

机构信息

Institut für Genbiologische Forschung Berlin, Germany.

出版信息

J Mol Biol. 1994 Sep 30;242(4):389-96. doi: 10.1006/jmbi.1994.1589.

DOI:10.1006/jmbi.1994.1589

PMID:7932698

Abstract

Four different consensus sequences (GTI, group II identifiers) have been derived from domains V of known group II introns and are used as query input sequences for sensitive database screenings with the FASTA and LFASTA programs. The set of four GTI sequences can identify all domains V of the 96 known group II introns in the completely sequenced chloroplast genomes of Marchantia polymorpha, Epifagus virginiana, Oryza sativa, Nicotiana tabacum and the completely sequenced mitochondrial genomes of Saccharomyces cerevisiae, Podospora anserina, Schizosaccharomyces pombe and Marchantia polymorpha. Seven moderately high-scoring hits can easily be rejected as false-positives since they do not fulfil secondary structure requirements. Large FASTA outputs obtained after screening the entire nucleotide sequence database are evaluated in a second step by a program (D5SCAN) that allows the assignment of variable selection criteria for potential domain V secondary structures. Database searches with these routines yield evidence for several group II intron sequences previously unrecognized. These include novel intron structures in the cyanobacterium Synechocystis and in the mitochondrial genomes of Marchantia, soybean, pea, broad bean, sugar beet and a heterobasidiomycete. Potential intron remnants are found contributing to the secondary structure of rRNAs in several trypanosome species. At a given sensitivity of 95% positively identified true domains V, the search routine produces one false positive hit per 10,000 kb.

摘要

已从已知的II类内含子的V结构域中推导得到四种不同的共有序列（GTI，II组标识符），并将其用作查询输入序列，以便使用FASTA和LFASTA程序进行敏感的数据库筛选。这组四个GTI序列可以识别多歧藻、弗吉尼亚Epifagus、水稻、烟草等完全测序的叶绿体基因组中96个已知II类内含子的所有V结构域，以及酿酒酵母、鹅颈孢、粟酒裂殖酵母和多歧藻等完全测序的线粒体基因组中的V结构域。七个得分中等偏高的匹配项很容易被判定为假阳性，因为它们不符合二级结构要求。在第二步中，通过一个程序（D5SCAN）对筛选整个核苷酸序列数据库后获得的大量FASTA输出结果进行评估，该程序允许为潜在的V结构域二级结构分配可变选择标准。使用这些程序进行数据库搜索为几个以前未被识别的II类内含子序列提供了证据。这些包括蓝藻集胞藻以及多歧藻、大豆、豌豆、蚕豆、甜菜和一种异担子菌线粒体基因组中的新型内含子结构。在几种锥虫物种中发现了潜在的内含子残余物，它们对rRNA的二级结构有贡献。在给定的95%阳性识别真实V结构域的灵敏度下，搜索程序每10000 kb产生一个假阳性匹配项。