Barquist Lars, Burge Sarah W, Gardner Paul P
Institute for Molecular Infection Biology, University of Würzburg, Würzburg, Germany.
Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.
Curr Protoc Bioinformatics. 2016 Jun 20;54:12.13.1-12.13.25. doi: 10.1002/cpbi.4.
Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remain difficult. This unit introduces methods developed by the Rfam database for identifying "families" of homologous ncRNAs starting from single "seed" sequences, using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs and then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process. © 2016 by John Wiley & Sons, Inc.
新兴的高通量技术已导致在各种各样的生物体中鉴定出大量假定的非编码RNA(ncRNA)序列。对这些转录本进行系统表征将是一项巨大的挑战。同源性检测对于最大限度地利用收集到的有关ncRNA的功能信息至关重要:识别同源序列使我们能够快速且高度自信地将在一种生物体中收集到的信息转移到另一种生物体中。ncRNA对同源性检测提出了挑战,因为其一级序列通常保守性较差,而且从头进行二级结构预测和搜索仍然困难。本单元介绍了Rfam数据库开发的方法,该方法从单个“种子”序列开始识别同源ncRNA的“家族”,使用人工编辑的序列比对构建强大的序列和结构保守性统计模型,即协方差模型(CMs),该模型在Infernal软件包中实现。我们提供了一个逐步迭代的方案,用于识别ncRNA同源物,然后构建比对和相应的CM。我们还通过细菌小RNA MicA的实例进行研究,在此过程中发现了以前未报道的在致病杆菌属中与MicA不同源的一个家族。© 2016约翰威立父子公司版权所有