Hubley Robert, Finn Robert D, Clements Jody, Eddy Sean R, Jones Thomas A, Bao Weidong, Smit Arian F A, Wheeler Travis J
Institute for Systems Biology, Seattle, WA 98109, USA
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1RQ, UK.
Nucleic Acids Res. 2016 Jan 4;44(D1):D81-9. doi: 10.1093/nar/gkv1272. Epub 2015 Nov 26.
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.
重复DNA,尤其是由转座元件(TEs)产生的重复DNA,在许多基因组中占很大比例。Dfam是一个关于重复DNA元件家族的开放获取数据库,其中每个家族由一个多序列比对和一个轮廓隐马尔可夫模型(HMM)表示。2013年《核酸研究》数据库专刊中介绍的Dfam初始版本包含在人类中发现的1143个重复元件家族,并被用于以更高的速度对人类基因组中超过100 Mb的转座元件衍生区域进行额外注释。在此,我们描述了近期的进展,最显著的是扩展到总共4150个家族,包括来自四种新生物(小鼠、斑马鱼、果蝇和线虫)的一套全面的已知重复家族。我们描述了在覆盖范围以及识别和减少错误注释方法方面的改进。我们还描述了网站界面的更新。Dfam网站已迁移至http://dfam.org。种子比对、轮廓HMM、命中列表和其他基础数据可供下载。