Margulies Elliott H, Cooper Gregory M, Asimenos George, Thomas Daryl J, Dewey Colin N, Siepel Adam, Birney Ewan, Keefe Damian, Schwartz Ariel S, Hou Minmei, Taylor James, Nikolaev Sergey, Montoya-Burgos Juan I, Löytynoja Ari, Whelan Simon, Pardi Fabio, Massingham Tim, Brown James B, Bickel Peter, Holmes Ian, Mullikin James C, Ureta-Vidal Abel, Paten Benedict, Stone Eric A, Rosenbloom Kate R, Kent W James, Bouffard Gerard G, Guan Xiaobin, Hansen Nancy F, Idol Jacquelyn R, Maduro Valerie V B, Maskeri Baishali, McDowell Jennifer C, Park Morgan, Thomas Pamela J, Young Alice C, Blakesley Robert W, Muzny Donna M, Sodergren Erica, Wheeler David A, Worley Kim C, Jiang Huaiyang, Weinstock George M, Gibbs Richard A, Graves Tina, Fulton Robert, Mardis Elaine R, Wilson Richard K, Clamp Michele, Cuff James, Gnerre Sante, Jaffe David B, Chang Jean L, Lindblad-Toh Kerstin, Lander Eric S, Hinrichs Angie, Trumbower Heather, Clawson Hiram, Zweig Ann, Kuhn Robert M, Barber Galt, Harte Rachel, Karolchik Donna, Field Matthew A, Moore Richard A, Matthewson Carrie A, Schein Jacqueline E, Marra Marco A, Antonarakis Stylianos E, Batzoglou Serafim, Goldman Nick, Hardison Ross, Haussler David, Miller Webb, Pachter Lior, Green Eric D, Sidow Arend
Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Genome Res. 2007 Jun;17(6):760-74. doi: 10.1101/gr.6034307.
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
正在进行的ENCODE项目的一个关键组成部分,涉及对人类基因组最初选定的1%进行严格的比较序列分析。在此,我们展示了针对所有ENCODE靶点的23种哺乳动物物种的直系同源序列生成、比对及进化约束分析。使用四种不同方法生成了比对结果;对这些方法的比较揭示了大规模的一致性,但在小基因组重排、灵敏度(序列覆盖度)和特异性(比对准确性)方面存在显著差异。我们描述了与比对方法选择相关联的定量和定性权衡,以及在需要多序列比对的应用中需要考虑的技术误差水平。利用生成的比对结果,我们使用三种不同方法识别了约束区域。虽然不同的约束检测方法总体上是一致的,但在基础比对和特定算法方面存在重要差异。然而,通过整合不同比对结果和约束检测方法的结果,我们生成了基于多种独立衡量标准都很可靠的约束注释。对这些注释的分析表明,大多数经实验注释的功能元件类别都富含受约束序列;然而,每个类别中的很大一部分(蛋白质编码序列除外)并不与约束区域重叠。后一类元件可能不受一级序列约束,可能并非在所有哺乳动物中都受约束,或者可能具有可消耗的分子功能。相反,40%的受约束序列并不与任何已通过实验鉴定的功能元件重叠。总之,这些发现证明并量化了还有多少基因组功能元件有待进行基础分子特征描述。