Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis), Wright State University, Dayton, OH 45435, USA.
J Biomed Inform. 2013 Dec;46(6):985-97. doi: 10.1016/j.jbi.2013.07.007. Epub 2013 Jul 25.
The role of social media in biomedical knowledge mining, including clinical, medical and healthcare informatics, prescription drug abuse epidemiology and drug pharmacology, has become increasingly significant in recent years. Social media offers opportunities for people to share opinions and experiences freely in online communities, which may contribute information beyond the knowledge of domain professionals. This paper describes the development of a novel semantic web platform called PREDOSE (PREscription Drug abuse Online Surveillance and Epidemiology), which is designed to facilitate the epidemiologic study of prescription (and related) drug abuse practices using social media. PREDOSE uses web forum posts and domain knowledge, modeled in a manually created Drug Abuse Ontology (DAO--pronounced dow), to facilitate the extraction of semantic information from User Generated Content (UGC), through combination of lexical, pattern-based and semantics-based techniques. In a previous study, PREDOSE was used to obtain the datasets from which new knowledge in drug abuse research was derived. Here, we report on various platform enhancements, including an updated DAO, new components for relationship and triple extraction, and tools for content analysis, trend detection and emerging patterns exploration, which enhance the capabilities of the PREDOSE platform. Given these enhancements, PREDOSE is now more equipped to impact drug abuse research by alleviating traditional labor-intensive content analysis tasks.
Using custom web crawlers that scrape UGC from publicly available web forums, PREDOSE first automates the collection of web-based social media content for subsequent semantic annotation. The annotation scheme is modeled in the DAO, and includes domain specific knowledge such as prescription (and related) drugs, methods of preparation, side effects, and routes of administration. The DAO is also used to help recognize three types of data, namely: (1) entities, (2) relationships and (3) triples. PREDOSE then uses a combination of lexical and semantic-based techniques to extract entities and relationships from the scraped content, and a top-down approach for triple extraction that uses patterns expressed in the DAO. In addition, PREDOSE uses publicly available lexicons to identify initial sentiment expressions in text, and then a probabilistic optimization algorithm (from related research) to extract the final sentiment expressions. Together, these techniques enable the capture of fine-grained semantic information, which facilitate search, trend analysis and overall content analysis using social media on prescription drug abuse. Moreover, extracted data are also made available to domain experts for the creation of training and test sets for use in evaluation and refinements in information extraction techniques.
A recent evaluation of the information extraction techniques applied in the PREDOSE platform indicates 85% precision and 72% recall in entity identification, on a manually created gold standard dataset. In another study, PREDOSE achieved 36% precision in relationship identification and 33% precision in triple extraction, through manual evaluation by domain experts. Given the complexity of the relationship and triple extraction tasks and the abstruse nature of social media texts, we interpret these as favorable initial results. Extracted semantic information is currently in use in an online discovery support system, by prescription drug abuse researchers at the Center for Interventions, Treatment and Addictions Research (CITAR) at Wright State University.
A comprehensive platform for entity, relationship, triple and sentiment extraction from such abstruse texts has never been developed for drug abuse research. PREDOSE has already demonstrated the importance of mining social media by providing data from which new findings in drug abuse research were uncovered. Given the recent platform enhancements, including the refined DAO, components for relationship and triple extraction, and tools for content, trend and emerging pattern analysis, it is expected that PREDOSE will play a significant role in advancing drug abuse epidemiology in future.
近年来,社交媒体在生物医学知识挖掘中的作用越来越重要,包括临床、医学和医疗信息学、处方药物滥用流行病学和药物药理学。社交媒体为人们在在线社区中自由分享意见和经验提供了机会,这些意见和经验可能提供了领域专业人员所不知道的信息。本文介绍了一种名为 PREDOSE(处方药物滥用在线监测和流行病学)的新型语义网平台的开发,该平台旨在利用社交媒体促进处方(和相关)药物滥用实践的流行病学研究。PREDOSE 使用论坛帖子和领域知识,通过人工创建的药物滥用本体论 (DAO) 进行建模,通过词汇、基于模式和基于语义的技术相结合,从用户生成的内容 (UGC) 中提取语义信息。在之前的研究中,PREDOSE 用于获取新药物滥用研究知识的数据集。在这里,我们报告了各种平台增强功能,包括更新的 DAO、用于关系和三元组提取的新组件以及内容分析、趋势检测和新兴模式探索工具,这些增强功能增强了 PREDOSE 平台的功能。有了这些增强功能,PREDOSE 现在通过减轻传统的劳动密集型内容分析任务,更有能力影响药物滥用研究。
使用自定义网络爬虫从公共可用的网络论坛中抓取 UGC,PREDOSE 首先自动收集基于网络的社交媒体内容,以便随后进行语义注释。注释方案在 DAO 中建模,包括处方(和相关)药物、制剂方法、副作用和给药途径等领域特定知识。DAO 还用于帮助识别三种类型的数据,即:(1) 实体,(2) 关系和 (3) 三元组。PREDOSE 然后使用词汇和基于语义的技术从抓取的内容中提取实体和关系,并使用自上而下的方法使用 DAO 中表达的模式提取三元组。此外,PREDOSE 使用公共词汇表来识别文本中的初始情感表达,然后使用来自相关研究的概率优化算法提取最终情感表达。这些技术共同实现了细粒度语义信息的捕获,这有助于使用社交媒体进行处方药物滥用的搜索、趋势分析和整体内容分析。此外,提取的数据还提供给领域专家,用于创建培训和测试集,以用于评估和改进信息提取技术。
最近对 PREDOSE 平台应用的信息提取技术的评估表明,在手动创建的黄金标准数据集中,实体识别的精度为 85%,召回率为 72%。在另一项研究中,通过领域专家的手动评估,PREDOSE 在关系识别方面的精度达到 36%,在三元组提取方面的精度达到 33%。鉴于关系和三元组提取任务的复杂性以及社交媒体文本的深奥性质,我们将这些解释为有利的初步结果。提取的语义信息目前正在 Wright State 大学干预、治疗和成瘾研究中心 (CITAR) 的处方药物滥用研究人员使用的在线发现支持系统中使用。
用于从这种深奥的文本中提取实体、关系、三元组和情感的综合平台从未为药物滥用研究开发过。PREDOSE 已经通过提供从药物滥用研究中发现新发现的数据证明了挖掘社交媒体的重要性。鉴于最近的平台增强功能,包括经过改进的 DAO、关系和三元组提取组件以及内容、趋势和新兴模式分析工具,预计 PREDOSE 将在未来在推进药物滥用流行病学方面发挥重要作用。