Churová Vendula, Vyškovský Roman, Maršálová Kateřina, Kudláček David, Schwarz Daniel
Faculty of Medicine, Masaryk University, Brno, Czech Republic.
Institute of Biostatistics and Analyses, Ltd, Brno, Czech Republic.
JMIR Med Inform. 2021 May 7;9(5):e27172. doi: 10.2196/27172.
Statistical analysis, which has become an integral part of evidence-based medicine, relies heavily on data quality that is of critical importance in modern clinical research. Input data are not only at risk of being falsified or fabricated, but also at risk of being mishandled by investigators.
The urgent need to assure the highest data quality possible has led to the implementation of various auditing strategies designed to monitor clinical trials and detect errors of different origin that frequently occur in the field. The objective of this study was to describe a machine learning-based algorithm to detect anomalous patterns in data created as a consequence of carelessness, systematic error, or intentionally by entering fabricated values.
A particular electronic data capture (EDC) system, which is used for data management in clinical registries, is presented including its architecture and data structure. This EDC system features an algorithm based on machine learning designed to detect anomalous patterns in quantitative data. The detection algorithm combines clustering with a series of 7 distance metrics that serve to determine the strength of an anomaly. For the detection process, the thresholds and combinations of the metrics were used and the detection performance was evaluated and validated in the experiments involving simulated anomalous data and real-world data.
Five different clinical registries related to neuroscience were presented-all of them running in the given EDC system. Two of the registries were selected for the evaluation experiments and served also to validate the detection performance on an independent data set. The best performing combination of the distance metrics was that of Canberra, Manhattan, and Mahalanobis, whereas Cosine and Chebyshev metrics had been excluded from further analysis due to the lowest performance when used as single distance metric-based classifiers.
The experimental results demonstrate that the algorithm is universal in nature, and as such may be implemented in other EDC systems, and is capable of anomalous data detection with a sensitivity exceeding 85%.
统计分析已成为循证医学不可或缺的一部分,在很大程度上依赖于数据质量,而数据质量在现代临床研究中至关重要。输入数据不仅有被伪造或编造的风险,还存在被研究人员不当处理的风险。
迫切需要确保尽可能高的数据质量,这导致实施了各种审计策略,旨在监测临床试验并检测该领域频繁出现的不同来源的错误。本研究的目的是描述一种基于机器学习的算法,以检测因粗心、系统误差或故意输入编造值而产生的数据中的异常模式。
介绍了一种用于临床注册数据管理的特定电子数据捕获(EDC)系统,包括其架构和数据结构。该EDC系统具有一种基于机器学习的算法,旨在检测定量数据中的异常模式。检测算法将聚类与一系列7种距离度量相结合,用于确定异常的强度。在涉及模拟异常数据和真实世界数据的实验中,使用这些度量的阈值和组合来进行检测过程,并对检测性能进行评估和验证。
展示了五个与神经科学相关的不同临床注册库——所有这些都在给定的EDC系统中运行。选择其中两个注册库进行评估实验,并用于在独立数据集上验证检测性能。距离度量的最佳性能组合是堪培拉距离、曼哈顿距离和马氏距离,而余弦距离和切比雪夫距离度量由于作为基于单距离度量的分类器时性能最低而被排除在进一步分析之外。
实验结果表明,该算法本质上具有通用性,因此可以在其他EDC系统中实现,并且能够以超过85%的灵敏度检测异常数据。