Lin Tien-Ho, Myers Eugene W, Xing Eric P
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
Bioinformatics. 2006 Jul 15;22(14):e298-306. doi: 10.1093/bioinformatics/btl200.
The problem of identifying victims in a mass disaster using DNA fingerprints involves a scale of computation that requires efficient and accurate algorithms. In a typical scenario there are hundreds of samples taken from remains that must be matched to the pedigrees of the alleged victim's surviving relatives. Moreover the samples are often degraded due to heat and exposure. To develop a competent method for this type of forensic inference problem, the complicated quality issues of DNA typing need to be handled appropriately, the matches between every sample and every family must be considered, and the confidence of matches need to be provided.
We present a unified probabilistic framework that efficiently clusters samples, conservatively eliminates implausible sample-pedigree pairings, and handles both degraded samples (missing values) and experimental errors in producing and/or reading a genotype. We present a method that confidently exclude forensically unambiguous sample-family matches from the large hypothesis space of candidate matches, based on posterior probabilistic inference. Due to the high confidentiality of disaster DNA data, simulation experiments are commonly performed and used here for validation. Our framework is shown to be robust to these errors at levels typical in real applications. Furthermore, the flexibility in the probabilistic models makes it possible to extend this framework to include other biological factors such as interdependent markers, mitochondrial sequences, and blood type.
The software and data sets are available from the authors upon request.
利用DNA指纹识别大规模灾难中的受害者这一问题涉及大规模计算,需要高效且准确的算法。在典型场景中,会从遗体上采集数百个样本,这些样本必须与据称受害者幸存亲属的谱系进行匹配。此外,样本常常因受热和暴露而降解。为了针对这类法医推断问题开发一种有效的方法,需要妥善处理DNA分型中复杂的质量问题,考虑每个样本与每个家族之间的匹配情况,并给出匹配的置信度。
我们提出了一个统一的概率框架,该框架能够有效地对样本进行聚类,保守地排除不合理的样本 - 谱系配对,并在生成和/或读取基因型时处理降解样本(缺失值)和实验误差。我们提出了一种基于后验概率推断,能够从大量候选匹配假设空间中可靠地排除法医上明确无误的样本 - 家族匹配的方法。由于灾难DNA数据的高度保密性,通常会进行模拟实验并在此用于验证。我们的框架在实际应用中典型的误差水平下被证明是稳健的。此外,概率模型的灵活性使得有可能扩展这个框架以纳入其他生物学因素,如相互依赖的标记、线粒体序列和血型。
软件和数据集可根据作者要求提供。