Reinecke Ines, Siebel Joscha, Fuhrmann Saskia, Fischer Andreas, Sedlmayr Martin, Weidner Jens, Bathelt Franziska
Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany.
Center for Evidence-Based Healthcare, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany.
JMIR Med Inform. 2023 Jan 25;11:e40312. doi: 10.2196/40312.
Digitization offers a multitude of opportunities to gain insights into current diagnostics and therapies from retrospective data. In this context, real-world data and their accessibility are of increasing importance to support unbiased and reliable research on big data. However, routinely collected data are not readily usable for research owing to the unstructured nature of health care systems and a lack of interoperability between these systems. This challenge is evident in drug data.
This study aimed to present an approach that identifies and increases the structuredness of drug data while ensuring standardization according to Anatomical Therapeutic Chemical (ATC) classification.
Our approach was based on available drug prescriptions and a drug catalog and consisted of 4 steps. First, we performed an initial analysis of the structuredness of local drug data to define a point of comparison for the effectiveness of the overall approach. Second, we applied 3 algorithms to unstructured data that translated text into ATC codes based on string comparisons in terms of ingredients and product names and performed similarity comparisons based on Levenshtein distance. Third, we validated the results of the 3 algorithms with expert knowledge based on the 1000 most frequently used prescription texts. Fourth, we performed a final validation to determine the increased degree of structuredness.
Initially, 47.73% (n=843,980) of 1,768,153 drug prescriptions were classified as structured. With the application of the 3 algorithms, we were able to increase the degree of structuredness to 85.18% (n=1,506,059) based on the 1000 most frequent medication prescriptions. In this regard, the combination of algorithms 1, 2, and 3 resulted in a correctness level of 100% (with 57,264 ATC codes identified), algorithms 1 and 3 resulted in 99.6% (with 152,404 codes identified), and algorithms 1 and 2 resulted in 95.9% (with 39,472 codes identified).
As shown in the first analysis steps of our approach, the availability of a product catalog to select during the documentation process is not sufficient to generate structured data. Our 4-step approach reduces the problems and reliably increases the structuredness automatically. Similarity matching shows promising results, particularly for entries with no connection to a product catalog. However, further enhancement of the correctness of such a similarity matching algorithm needs to be investigated in future work.
数字化为从回顾性数据中深入了解当前的诊断和治疗方法提供了众多机会。在这种背景下,真实世界数据及其可获取性对于支持大数据的无偏且可靠的研究变得越来越重要。然而,由于医疗保健系统的非结构化性质以及这些系统之间缺乏互操作性,常规收集的数据不易用于研究。这一挑战在药物数据中尤为明显。
本研究旨在提出一种方法,该方法可识别并提高药物数据的结构化程度,同时确保根据解剖治疗化学(ATC)分类进行标准化。
我们的方法基于可用的药物处方和药物目录,包括4个步骤。首先,我们对本地药物数据的结构化程度进行初步分析,以确定整体方法有效性的比较点。其次,我们将3种算法应用于非结构化数据,这些算法根据成分和产品名称的字符串比较将文本转换为ATC代码,并基于莱文斯坦距离进行相似性比较。第三,我们基于1000个最常用的处方文本,用专家知识验证了这3种算法的结果。第四,我们进行了最终验证,以确定结构化程度的提高程度。
最初,1,768,153份药物处方中有47.73%(n = 843,980)被分类为结构化。通过应用这3种算法,基于1000个最常见的药物处方,我们能够将结构化程度提高到85.18%(n = 1,506,059)。在这方面,算法1、2和3的组合产生了100%的正确水平(识别出57,264个ATC代码),算法1和3产生了99.6%(识别出152,404个代码),算法1和2产生了95.9%(识别出39,472个代码)。
正如我们方法的第一个分析步骤所示,在文档编制过程中可供选择的产品目录不足以生成结构化数据。我们的4步方法减少了问题并可靠地自动提高了结构化程度。相似性匹配显示出有希望的结果,特别是对于与产品目录无关的条目。然而,这种相似性匹配算法的正确性的进一步提高需要在未来的工作中进行研究。