Himmel Wolfgang, Reincke Ulrich, Michelmann Hans Wilhelm
Department of General Practice/Family Medicine, University of Göttingen, Humboldtallee 38, 37070 Göttingen, Germany.
J Med Internet Res. 2009 Jul 22;11(3):e25. doi: 10.2196/jmir.1123.
Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called "ask the doctor" services.
To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies.
We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website "Rund ums Baby" ("Everything about Babies") into one or more of 38 categories belonging to two dimensions ("subject matter" and "expectations"). After creating start and synonym lists, we calculated the average Cramer's V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification.
According to the manual classification of 988 documents, 102 (10%) documents fell into the category "in vitro fertilization (IVF)," 81 (8%) into the category "ovulation," 79 (8%) into "cycle," and 57 (6%) into "semen analysis." These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as "general information" and 351 (36%) as a wish for "treatment recommendations." The generation of indicator variables based on the chi-square analysis and Cramer's V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38 categories in the test sample. For 35 (92%) categories, precision and recall were better than 80%. For some categories, the input variables (ie, "words") also included variables from other categories, most often with a negative sign. For example, absence of words predictive for "menstruation" was a strong indicator for the category "pregnancy test."
Our approach suggests a way of automatically classifying and analyzing unstructured information in Internet expert forums. The technique can perform a preliminary categorization of new requests and help Internet medical experts to better handle the mass of information and to give professional feedback.
健康人群和患病者越来越多地利用电子媒体获取医学信息和建议。例如,互联网用户可能会向基于网络的专家论坛或所谓的“向医生提问”服务发送请求。
结合不同的文本挖掘策略,自动对互联网医学专家论坛中的外行请求进行分类。
我们首先将德国网站“Rund ums Baby”(“关于宝宝的一切”)上一个非自愿不孕论坛收到的988个请求样本手动分类到属于两个维度(“主题”和“期望”)的38个类别中的一个或多个类别。在创建起始词列表和同义词列表后,我们计算了每个单词与每个类别的关联的平均克莱默V统计量。我们还使用主成分分析和奇异值分解作为进一步的文本挖掘策略。通过这些措施,我们训练了回归模型,并根据最佳回归模型确定任何请求属于38个不同类别中每个类别的概率,截止值为50%。计算测试样本的召回率和精确率作为自动分类质量的度量。
根据对988份文档的手动分类,102份(10%)文档属于“体外受精(IVF)”类别,81份(8%)属于“排卵”类别,79份(8%)属于“周期”类别,57份(6%)属于“精液分析”类别。这些是主题维度(由32个类别组成)中最常见的四个类别。期望维度包括六个类别;我们将533份文档(54%)分类为“一般信息”,351份(36%)分类为希望获得“治疗建议”。基于卡方分析和克莱默V生成指示变量被证明是约一半类别中自动分类的最佳方法。与其他两种方法相结合,测试样本中38个类别中的18个(47%)实现了100%的精确率和100%的召回率。对于35个(92%)类别,精确率和召回率优于80%。对于某些类别,输入变量(即“单词”)还包括来自其他类别的变量,大多数情况下带有负号。例如,缺乏预测“月经”的单词是“妊娠试验”类别的一个强指标。
我们的方法提出了一种自动分类和分析互联网专家论坛中非结构化信息的方法。该技术可以对新请求进行初步分类,并帮助互联网医学专家更好地处理大量信息并提供专业反馈。