Netherlands Forensic Institute, Division of Biological Traces, Laan van Ypenburg 6, 2497GB, The Hague, the Netherlands.
Netherlands Forensic Institute, Division of Digital and Biometric Traces, Laan van Ypenburg 6, 2497GB, The Hague, the Netherlands.
Forensic Sci Int Genet. 2019 Nov;43:102150. doi: 10.1016/j.fsigen.2019.102150. Epub 2019 Aug 23.
The number of contributors (NOC) to (complex) autosomal STR profiles cannot be determined with absolute certainty due to complicating factors such as allele sharing and allelic drop-out. The precision of NOC estimations can be improved by increasing the number of (highly polymorphic) markers, the use of massively parallel sequencing instead of capillary electrophoresis, and/or using more profile information than only the allele counts. In this study, we focussed on machine learning approaches in order to make maximum use of the profile information. To this end, a set of 590 PowerPlex® Fusion 6C profiles with one up to five contributors were generated from a total of 1174 different donors. This set varied for the template amount of DNA, mixture proportion, levels of allele sharing, allelic drop-out and degradation. The dataset contained labels with known NOC and was split into a training, test and hold-out set. The training set was used to optimize ten different algorithms with selection of profile characteristics. Per profile, over 250 characteristics, denoted 'features', were calculated. These features were based on allele counts, peak heights and allele frequencies. The features that were most related to the NOC were selected based on partial correlation using the training set. Next, the performance of each model (=combination of features plus algorithm) was examined using the test set. A random forest classifier with 19 features, denoted the 'RFC19-model' showed best performance and was selected for further validation. Results showed improved accuracy compared to the conventional maximum allele count approach and an in-house nC-tool based on the total allele count. The method is extremely fast and regarded useful for application in forensic casework.
由于等位基因共享和等位基因缺失等复杂因素,(复杂)常染色体 STR 谱的贡献者(NOC)数量不能被确定为绝对。通过增加(高度多态性)标记的数量、使用大规模平行测序代替毛细管电泳、以及/或者使用比等位基因计数更多的谱信息,可以提高 NOC 估计的精度。在这项研究中,我们专注于机器学习方法,以便最大限度地利用谱信息。为此,我们从总共 1174 个不同的供体中生成了一套 590 个 PowerPlex®Fusion 6C 谱,其中有一个到五个贡献者。该数据集的模板 DNA 量、混合比例、等位基因共享水平、等位基因缺失和降解程度各不相同。该数据集包含已知 NOC 的标签,并被分为训练集、测试集和保留集。训练集用于优化十种不同的算法,并选择谱特征。对于每个谱,计算了超过 250 个特征,称为“特征”。这些特征基于等位基因计数、峰高和等位基因频率。基于训练集,使用偏相关选择与 NOC 最相关的特征。然后,使用测试集检查每个模型(=特征加算法的组合)的性能。基于 19 个特征的随机森林分类器,称为“RFC19 模型”,表现出最佳性能,并被选用于进一步验证。结果表明,与传统的最大等位基因计数方法和基于总等位基因计数的内部 nC 工具相比,该方法的准确性有所提高。该方法速度极快,被认为对法医案件工作有实际应用价值。