Biomedical Science and Engineering Center, Health Data Sciences Institute, Oak Ridge National Laboratory, One Bethel Valley Road, Oak Ridge, TN 37830, USA.
Bioinformatics. 2014 Jan 1;30(1):104-14. doi: 10.1093/bioinformatics/btt571. Epub 2013 Sep 29.
Life stories of diseased and healthy individuals are abundantly available on the Internet. Collecting and mining such online content can offer many valuable insights into patients' physical and emotional states throughout the pre-diagnosis, diagnosis, treatment and post-treatment stages of the disease compared with those of healthy subjects. However, such content is widely dispersed across the web. Using traditional query-based search engines to manually collect relevant materials is rather labor intensive and often incomplete due to resource constraints in terms of human query composition and result parsing efforts. The alternative option, blindly crawling the whole web, has proven inefficient and unaffordable for e-health researchers.
We propose a user-oriented web crawler that adaptively acquires user-desired content on the Internet to meet the specific online data source acquisition needs of e-health researchers. Experimental results on two cancer-related case studies show that the new crawler can substantially accelerate the acquisition of highly relevant online content compared with the existing state-of-the-art adaptive web crawling technology. For the breast cancer case study using the full training set, the new method achieves a cumulative precision between 74.7 and 79.4% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 32.8 and 37.0% using the peer method for the same time period. For the lung cancer case study using the full training set, the new method achieves a cumulative precision between 56.7 and 61.2% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 29.3 and 32.4% using the peer method. Using the reduced training set in the breast cancer case study, the cumulative precision of our method is between 44.6 and 54.9%, whereas the cumulative precision of the peer method is between 24.3 and 26.3%; for the lung cancer case study using the reduced training set, the cumulative precisions of our method and the peer method are, respectively, between 35.7 and 46.7% versus between 24.1 and 29.6%. These numbers clearly show a consistently superior accuracy of our method in discovering and acquiring user-desired online content for e-health research.
The implementation of our user-oriented web crawler is freely available to non-commercial users via the following Web site: http://bsec.ornl.gov/AdaptiveCrawler.shtml. The Web site provides a step-by-step guide on how to execute the web crawler implementation. In addition, the Web site provides the two study datasets including manually labeled ground truth, initial seeds and the crawling results reported in this article.
互联网上有大量患有疾病和健康个体的生活故事。与健康受试者相比,收集和挖掘这些在线内容可以为患者在疾病的诊断前、诊断时、治疗中和治疗后阶段的身体和情绪状态提供许多有价值的见解。然而,此类内容广泛分布在网络上。使用传统的基于查询的搜索引擎手动收集相关材料非常费力,并且由于人力查询组成和结果解析工作方面的资源限制,往往不完整。另一种选择是盲目地爬行整个网络,但对于电子健康研究人员来说,这种方法效率低下且成本高昂。
我们提出了一种面向用户的网络爬虫,它可以自适应地在互联网上获取用户所需的内容,以满足电子健康研究人员特定的在线数据源获取需求。在两个与癌症相关的案例研究上的实验结果表明,与现有的自适应网络爬虫技术相比,新的爬虫可以大大加快高度相关的在线内容的获取速度。对于使用完整训练集的乳腺癌案例研究,在执行 5 小时后,新方法在 20 小时的爬行会话结束时达到了 74.7%至 79.4%的累积精度,而使用相同时间段内的同类方法达到了 32.8%至 37.0%的累积精度。对于使用完整训练集的肺癌案例研究,在执行 5 小时后,新方法在 20 小时的爬行会话结束时达到了 56.7%至 61.2%的累积精度,而使用同类方法达到了 29.3%至 32.4%的累积精度。在乳腺癌案例研究中使用缩减训练集时,我们方法的累积精度在 44.6%至 54.9%之间,而同类方法的累积精度在 24.3%至 26.3%之间;在肺癌案例研究中使用缩减训练集时,我们方法和同类方法的累积精度分别在 35.7%至 46.7%和 24.1%至 29.6%之间。这些数字清楚地表明,我们的方法在发现和获取电子健康研究所需的用户期望的在线内容方面具有始终如一的更高准确性。
非商业用户可通过以下网站免费使用我们面向用户的网络爬虫的实现:http://bsec.ornl.gov/AdaptiveCrawler.shtml。该网站提供了执行网络爬虫实现的分步指南。此外,该网站还提供了两个研究数据集,包括手动标记的地面实况、初始种子和本文报告的爬行结果。