Lefèvre Thomas, Chariot Patrick, Chauvin Pierre
Inserm, UMRS 1136, Pierre Louis Institute of Epidemiology and Public Health, Department of Social Epidemiology, Paris, France; Université Pierre et Marie Curie-Paris 6, UMRS 1136, Paris, France; AP-HP, Hôpital Jean-Verdier, Department of Forensic Medicine, F-93140 Bondy, France; IRIS Institut de recherches interdisciplinaires sur les enjeux sociaux (INSERM, CNRS, EHESS, Université Paris 13, UMR 8156-723), Bobigny, France.
AP-HP, Hôpital Jean-Verdier, Department of Forensic Medicine, F-93140 Bondy, France; IRIS Institut de recherches interdisciplinaires sur les enjeux sociaux (INSERM, CNRS, EHESS, Université Paris 13, UMR 8156-723), Bobigny, France.
Forensic Sci Int. 2016 Sep;266:581.e1-581.e9. doi: 10.1016/j.forsciint.2016.05.014. Epub 2016 May 21.
Researchers handle increasingly higher dimensional datasets, with many variables to explore. Such datasets pose several problems, since they are difficult to handle and present unexpected features. As dimensionality increases, classical statistical analysis becomes inoperative. Variables can present redundancy, and the reduction of dataset dimensionality to its lowest possible value is often needed. Principal components analysis (PCA) has proven useful to reduce dimensionality but present several shortcomings. As others, forensic sciences will face the issues specific related to an evergrowing quantity of data to be integrated. Age estimation in living persons, an unsolved problem so far, could benefit from the integration of various sources of data, e.g., clinical, dental and radiological data. We present here novel multivariate techniques (nonlinear dimensionality reduction techniques, NLDR), applied to a theoretical example. Results were compared to those of PCA. NLDR techniques were then applied to clinical, dental and radiological data (13 variables) used for age estimation. The correlation dimension of these data was estimated. NLDR techniques outperformed PCA results. They showed that two living persons sharing similar characteristics may present rather different estimated ages. Moreover, data presented a very high informational redundancy, i.e., a correlation dimension of 2. NLDR techniques should be used with or preferred to PCA techniques to analyze complex and big data. Data routinely used for age estimation may not be considered suitable for this purpose. How integrating other data or approaches could improve age estimation in living persons is still uncertain.
研究人员处理的数据集维度越来越高,有许多变量需要探索。这样的数据集带来了几个问题,因为它们难以处理且呈现出意想不到的特征。随着维度的增加,经典统计分析变得无效。变量可能存在冗余,通常需要将数据集维度降至尽可能低的值。主成分分析(PCA)已被证明有助于降低维度,但存在一些缺点。与其他领域一样,法医学将面临与不断增长的待整合数据量相关的特定问题。活体年龄估计是一个迄今尚未解决的问题,可能会受益于各种数据来源的整合,例如临床、牙科和放射学数据。我们在此展示应用于一个理论示例的新型多变量技术(非线性降维技术,NLDR)。将结果与主成分分析的结果进行比较。然后将NLDR技术应用于用于年龄估计的临床、牙科和放射学数据(13个变量)。估计了这些数据的关联维数。NLDR技术优于主成分分析的结果。结果表明,两个具有相似特征的活体可能呈现出相当不同的估计年龄。此外,数据呈现出非常高的信息冗余,即关联维数为2。在分析复杂和大数据时,应使用NLDR技术或优先于主成分分析技术。常规用于年龄估计的数据可能不被认为适用于此目的。整合其他数据或方法如何能改善活体年龄估计仍不确定。