Center for Clinical Epidemiology and Evidence-Based Medicine, Beijing Children's Hospital, Capital Medical University, National Center for Children Health, No.56 Nanlishi Road, Beijing, 100045, China.
Department of Orthopaedics and Neurosurgery, Keck Medical Center of USC, University of Southern California, Los Angeles, USA.
BMC Med Res Methodol. 2022 Apr 10;22(1):106. doi: 10.1186/s12874-022-01596-8.
Our study aimed to compare the reference distributions of serum creatinine and urea obtained by direct sampling technique and two indirect sampling techniques including the Gaussian Mixture Model (GMM) and the Self-Organizing Map (SOM) clustering based on clinical laboratory records, so that the feasibility as well as the potential limitations of indirect sampling techniques could be clarified.
The direct sampling technique was used in the Pediatric Reference Interval in China (PRINCE) study, in which 15,150 healthy volunteers aged 0 to 19 years were recruited from 11 provinces across China from January 2017 to December 2018. The indirect sampling techniques were used in the Laboratory Information System (LIS) database of Beijing Children's Hospital, in which 164,710 outpatients were included for partitioning of potential healthy individuals by GMM or SOM from January to December 2016. The reference distributions of creatinine and urea that were established by the PRINCE study and the LIS database were compared.
The density curves of creatinine and urea based on the PRINCE data and the GMM and SOM partitioned LIS data showed a large overlap. However, deviations were found in reference intervals among the three populations.
Both GMM and SOM can identify potential healthy individuals from the LIS data. The performance of GMM is consistent and stable. However, GMM relies on Gaussian fitting, and thus is not suitable for skewed data. SOM is applicable for high-dimensional data, and is adaptable to data distribution. But it is susceptible to sample size and outlier detection strategy.
本研究旨在比较直接采样技术和两种间接采样技术(包括高斯混合模型[GMM]和自组织映射[SOM]聚类)获得的血清肌酐和尿素参考分布,从而阐明间接采样技术的可行性和潜在局限性。
直接采样技术用于中国儿科参考区间(PRINCE)研究,该研究于 2017 年 1 月至 2018 年 12 月从中国 11 个省招募了 15150 名 0 至 19 岁的健康志愿者。间接采样技术用于北京儿童医院的实验室信息系统(LIS)数据库,该数据库于 2016 年 1 月至 12 月使用 GMM 或 SOM 对潜在健康个体进行分区,纳入了 164710 名门诊患者。比较了 PRINCE 研究和 LIS 数据库建立的肌酐和尿素的参考分布。
基于 PRINCE 数据和 GMM 和 SOM 分区的 LIS 数据的肌酐和尿素密度曲线显示出很大的重叠。然而,在这三个人群中发现了参考区间的偏差。
GMM 和 SOM 都可以从 LIS 数据中识别潜在的健康个体。GMM 的性能是一致和稳定的。然而,GMM 依赖于高斯拟合,因此不适合偏态数据。SOM 适用于高维数据,并且适应数据分布。但是,它容易受到样本量和异常值检测策略的影响。