Hao Tianyong, Rusanov Alexander, Boland Mary Regina, Weng Chunhua
Department of Biomedical Informatics, Columbia University, New York, NY, United States.
Department of Anesthesiology, Columbia University, New York, NY, United States.
J Biomed Inform. 2014 Dec;52:112-20. doi: 10.1016/j.jbi.2014.01.009. Epub 2014 Feb 1.
To automatically identify and cluster clinical trials with similar eligibility features.
Using the public repository ClinicalTrials.gov as the data source, we extracted semantic features from the eligibility criteria text of all clinical trials and constructed a trial-feature matrix. We calculated the pairwise similarities for all clinical trials based on their eligibility features. For all trials, by selecting one trial as the center each time, we identified trials whose similarities to the central trial were greater than or equal to a predefined threshold and constructed center-based clusters. Then we identified unique trial sets with distinctive trial membership compositions from center-based clusters by disregarding their structural information.
From the 145,745 clinical trials on ClinicalTrials.gov, we extracted 5,508,491 semantic features. Of these, 459,936 were unique and 160,951 were shared by at least one pair of trials. Crowdsourcing the cluster evaluation using Amazon Mechanical Turk (MTurk), we identified the optimal similarity threshold, 0.9. Using this threshold, we generated 8806 center-based clusters. Evaluation of a sample of the clusters by MTurk resulted in a mean score 4.331±0.796 on a scale of 1-5 (5 indicating "strongly agree that the trials in the cluster are similar").
We contribute an automated approach to clustering clinical trials with similar eligibility features. This approach can be potentially useful for investigating knowledge reuse patterns in clinical trial eligibility criteria designs and for improving clinical trial recruitment. We also contribute an effective crowdsourcing method for evaluating informatics interventions.
自动识别并聚类具有相似纳入标准特征的临床试验。
以公共数据库ClinicalTrials.gov作为数据源,我们从所有临床试验的纳入标准文本中提取语义特征,并构建了一个试验-特征矩阵。我们基于纳入标准特征计算了所有临床试验之间的成对相似度。对于所有试验,每次选择一个试验作为中心,我们识别出与中心试验相似度大于或等于预定义阈值的试验,并构建基于中心的聚类。然后,我们通过忽略基于中心的聚类的结构信息,识别出具有独特试验成员组成的独特试验集。
从ClinicalTrials.gov上的145,745项临床试验中,我们提取了5,508,491个语义特征。其中,459,936个是独特的,160,951个至少被一对试验共享。使用亚马逊土耳其机器人(MTurk)众包聚类评估,我们确定了最佳相似度阈值为0.9。使用这个阈值,我们生成了8806个基于中心的聚类。MTurk对聚类样本的评估在1-5分的量表上得出平均分数为4.331±0.796(5表示“强烈同意聚类中的试验相似”)。
我们提供了一种自动方法来聚类具有相似纳入标准特征的临床试验。这种方法可能有助于研究临床试验纳入标准设计中的知识复用模式以及改善临床试验招募。我们还提供了一种有效的众包方法来评估信息学干预措施。