Chakraborty Payal, Ning Xia, McNeill Mary, Kline David M, Shoben Abigail B, Miller William C, Norris Turner Abigail
Ohio Department of Health, Columbus, OH.
Division of Public Health Sciences, Department of Biostatistics and Data Science, Wake Forest School of Medicine, Winston-Salem, NC.
Sex Transm Dis. 2025 Mar 1;52(3):146-153. doi: 10.1097/OLQ.0000000000002091. Epub 2024 Oct 31.
Developments in natural language processing and unsupervised machine learning methodologies (e.g., clustering) have given researchers new tools to analyze both structured and unstructured health data. We applied these methods to 2019 Ohio disease intervention specialist (DIS) syphilis records, to determine whether these methods can uncover novel patterns of co-occurrence of individual characteristics, risk factors, and clinical characteristics of syphilis that are not yet reported in the literature.
The 2019 disease intervention specialist syphilis records (n = 1996) contain both structured data (categorical and numerical variables) and unstructured notes. In the structured data, we examined case demographics, syphilis risk factors, and clinical characteristics of syphilis. For the unstructured text, we applied TF-IDF (term frequency multiplied by inverse document frequency) weights, a common way to convert text into numerical representations. We performed agglomerative clustering with cosine similarity using the CLUTO software.
The cluster analysis yielded 6 clusters of syphilis cases based on patterns in the structured and unstructured data. The average internal similarities were much higher than the average external similarities, indicating that the clusters were well formed. The factors underlying 3 of the clusters related to patterns of missing data. The factors underlying the other 3 clusters were sexual behaviors and partnerships. Notably, 1 of the 3 consisted of individuals who reported oral sex with male or anonymous partners while intoxicated, and one comprised mainly of males who have sex with females.
Our analysis resulted in clusters that were well formed mathematically, but did not reveal novel epidemiological information about syphilis risk factors or transmission that were not already known.
自然语言处理和无监督机器学习方法(如聚类)的发展为研究人员提供了新工具,可用于分析结构化和非结构化健康数据。我们将这些方法应用于2019年俄亥俄州疾病干预专家(DIS)的梅毒记录,以确定这些方法能否揭示梅毒个体特征、风险因素和临床特征共现的新模式,而这些模式尚未在文献中报道。
2019年疾病干预专家梅毒记录(n = 1996)包含结构化数据(分类和数值变量)和非结构化笔记。在结构化数据中,我们检查了病例人口统计学、梅毒风险因素和梅毒临床特征。对于非结构化文本,我们应用了TF-IDF(词频乘以逆文档频率)权重,这是将文本转换为数值表示的常用方法。我们使用CLUTO软件进行了基于余弦相似度的凝聚聚类。
聚类分析根据结构化和非结构化数据中的模式产生了6个梅毒病例聚类。平均内部相似度远高于平均外部相似度,表明聚类形成良好。其中3个聚类的潜在因素与缺失数据模式有关。其他3个聚类的潜在因素是性行为和性伴侣关系。值得注意的是,其中1个聚类由报告在醉酒时与男性或匿名伴侣进行口交的个体组成,另一个主要由与女性发生性行为的男性组成。
我们的分析得出了在数学上形成良好的聚类,但没有揭示关于梅毒风险因素或传播的新的流行病学信息,这些信息此前已经为人所知。