Safety Surveillance and Risk Management, Pfizer Inc, New York, New York, USA.
Safety Surveillance and Risk Management, Pfizer Inc, Shanghai, China.
Pharmacoepidemiol Drug Saf. 2023 Mar;32(3):387-391. doi: 10.1002/pds.5555. Epub 2022 Dec 9.
Literature reports of adverse drug events can be replicated across multiple companies, resulting in extreme duplication (defined as a majority of reports being duplicates) in the FDA Adverse Event Reporting System (FAERS) database because they can escape legacy duplicate detection algorithms routinely deployed on that data source. Literature reference field, added to in 2014, could potentially be utilized to identify replicated reports. FAERS does not enforce adherence to the Vancouver referencing convention, thus the same article may be referenced differently leading to duplication. The objective of this analysis is to determine if variations of the same literature references observed in FAERS can be resolved with text normalization and fuzzy string matching.
We normalized the literature references recorded in the FAERS database through the first quarter of 2021 with a rule-based algorithm so that they better conform to the Vancouver convention. Levenshtein distance was then utilized to merge sufficiently similar normalized literature references together.
Normalization of literature references increases the percentage that can be parsed into author, title, and journal from 61.74% to 93.93%. We observe that about 98% of pairs within groups do have a Levenshtein similarity of the title above the threshold. The extreme duplication ranged from 66% to 87% with a median of 72% of reports being duplicates and often involved addictovigilance scenarios.
We have shown that these normalized references can be merged via fuzzy string matching to improve enumeration of all the individual case safety reports that refer to the same article. Inclusion of the PubMed ID and adherence to the Vancouver convention could facilitate identification of duplicates in the FAERS dataset. Awareness of this phenomenon may improve disproportionality analysis, especially in areas such as addictovigilance.
文献中药物不良反应的报告可以在多家公司中复制,导致 FDA 不良事件报告系统(FAERS)数据库中出现严重重复(定义为多数报告为重复报告),因为它们可能逃避常规部署在该数据源上的遗留重复检测算法。2014 年添加的文献参考字段可能可用于识别重复报告。FAERS 并未强制遵守温哥华参考惯例,因此同一篇文章可能会以不同的方式引用,从而导致重复。本分析的目的是确定 FAERS 中观察到的相同文献参考的变体是否可以通过文本规范化和模糊字符串匹配来解决。
我们通过基于规则的算法对 2021 年第一季度 FAERS 数据库中的文献参考进行了规范化,以便更好地符合温哥华惯例。然后利用 Levenshtein 距离将足够相似的规范化文献参考合并在一起。
文献参考的规范化使可以解析为作者、标题和期刊的参考的百分比从 61.74%增加到 93.93%。我们观察到,组内约 98%的对具有标题上的 Levenshtein 相似性阈值。极端重复率从 66%到 87%不等,中位数为 72%的报告为重复报告,并且通常涉及药物警戒场景。
我们已经表明,可以通过模糊字符串匹配合并这些规范化的参考,以提高引用同一篇文章的所有个别病例安全报告的枚举。包括 PubMed ID 并遵守温哥华惯例可以促进在 FAERS 数据集中识别重复项。对这种现象的认识可能会改善不成比例性分析,尤其是在药物警戒等领域。