Pallier Karine, Prot Olivier, Naldi Simone, Silva Francisco, Denis Thierry, Giry Olivier, Leobon Sophie, Deluche Elise, Tubiana-Mathieu Nicole
Centre de Coordination en Cancérologie de la Haute-Vienne - 3C87, CHU de Limoges, Limoges, France.
Univ. Limoges, CNRS, XLIM, UMR 7252, Limoges, France.
Cancer Inform. 2023 May 19;22:11769351231172609. doi: 10.1177/11769351231172609. eCollection 2023.
The Regional Basis of Solid Tumor (RBST), a clinical data warehouse, centralizes information related to cancer patient care in 5 health establishments in 2 French departments.
To develop algorithms matching heterogeneous data to "real" patients and "real" tumors with respect to patient identification (PI) and tumor identification (TI).
A graph database programed in java Neo4j was used to build the RBST with data from ~20 000 patients. The PI algorithm using the Levenshtein distance was based on the regulatory criteria identifying a patient. A TI algorithm was built on 6 characteristics: tumor location and laterality, date of diagnosis, histology, primary and metastatic status. Given the heterogeneous nature and semantics of the collected data, the creation of repositories (organ, synonym, and histology repositories) was required. The TI algorithm used the Dice coefficient to match tumors.
Patients matched if there was complete agreement of the given name, surname, sex, and date/month/year of birth. These parameters were assigned weights of 28%, 28%, 21%, and 23% (with 18% for year, 2.5% for month, and 2.5% for day), respectively. The algorithm had a sensitivity of 99.69% (95% confidence interval [CI] [98.89%, 99.96%]) and a specificity of 100% (95% CI [99.72%, 100%]). The TI algorithm used repositories, weights were assigned to the diagnosis date and associated organ (37.5% and 37.5%, respectively), laterality (16%) histology (5%), and metastatic status (4%). This algorithm had a sensitivity of 71% (95% CI [62.68%, 78.25%]) and a specificity of 100% (95% CI [94.31%, 100%]).
The RBST encompasses 2 quality controls: PI and TI. It facilitates the implementation of transversal structuring and assessments of the performance of the provided care.
实体瘤区域基础数据库(RBST)是一个临床数据仓库,集中了法国两个省5家医疗机构中与癌症患者护理相关的信息。
开发算法,在患者识别(PI)和肿瘤识别(TI)方面,将异构数据与“真实”患者和“真实”肿瘤进行匹配。
使用用Java Neo4j编写的图形数据库,根据约20000名患者的数据构建RBST。使用莱文斯坦距离的PI算法基于识别患者的监管标准。TI算法基于6个特征构建:肿瘤位置和侧别、诊断日期、组织学、原发和转移状态。鉴于所收集数据的异构性质和语义,需要创建存储库(器官、同义词和组织学存储库)。TI算法使用骰子系数来匹配肿瘤。
如果名字、姓氏、性别和出生日期完全一致,则患者匹配成功。这些参数的权重分别为28%、28%、21%和23%(年份占18%,月份占2.5%,日期占2.5%)。该算法的灵敏度为99.69%(95%置信区间[CI][98.89%,99,96%]),特异性为100%(95%CI[99.72%,100%])。TI算法使用存储库,诊断日期和相关器官的权重分别为37.5%和37.5%,侧别为16%,组织学为5%,转移状态为4%。该算法的灵敏度为71%(95%CI[62.68%,78.25%]),特异性为100%(95%CI[94.31%,100%])。
RBST包含两个质量控制:PI和TI。它有助于横向结构的实施和对所提供护理的性能评估。