Department of Pathology and Center for Cancer Research, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, USA.
Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
Nat Commun. 2020 Jul 29;11(1):3697. doi: 10.1038/s41467-020-17453-5.
As the number of genomics datasets grows rapidly, sample mislabeling has become a high stakes issue. We present CrosscheckFingerprints (Crosscheck), a tool for quantifying sample-relatedness and detecting incorrectly paired sequencing datasets from different donors. Crosscheck outperforms similar methods and is effective even when data are sparse or from different assays. Application of Crosscheck to 8851 ENCODE ChIP-, RNA-, and DNase-seq datasets enabled us to identify and correct dozens of mislabeled samples and ambiguous metadata annotations, representing ~1% of ENCODE datasets.
随着基因组学数据集数量的快速增长,样本标记错误已成为一个高风险问题。我们提出了 CrosscheckFingerprints(Crosscheck),这是一种用于量化样本相关性和检测来自不同供体的错误配对测序数据集的工具。Crosscheck 的性能优于类似的方法,即使在数据稀疏或来自不同检测时也非常有效。将 Crosscheck 应用于 8851 个 ENCODE ChIP、RNA 和 DNase-seq 数据集,使我们能够识别和纠正数十个标记错误的样本和模糊的元数据注释,这些样本和注释约占 ENCODE 数据集的 1%。