• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

面向用户的网络爬虫,用于有选择地获取电子健康研究中的在线内容。

A user-oriented web crawler for selectively acquiring online content in e-health research.

机构信息

Biomedical Science and Engineering Center, Health Data Sciences Institute, Oak Ridge National Laboratory, One Bethel Valley Road, Oak Ridge, TN 37830, USA.

出版信息

Bioinformatics. 2014 Jan 1;30(1):104-14. doi: 10.1093/bioinformatics/btt571. Epub 2013 Sep 29.

DOI:10.1093/bioinformatics/btt571
PMID:24078710
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3866553/
Abstract

MOTIVATION

Life stories of diseased and healthy individuals are abundantly available on the Internet. Collecting and mining such online content can offer many valuable insights into patients' physical and emotional states throughout the pre-diagnosis, diagnosis, treatment and post-treatment stages of the disease compared with those of healthy subjects. However, such content is widely dispersed across the web. Using traditional query-based search engines to manually collect relevant materials is rather labor intensive and often incomplete due to resource constraints in terms of human query composition and result parsing efforts. The alternative option, blindly crawling the whole web, has proven inefficient and unaffordable for e-health researchers.

RESULTS

We propose a user-oriented web crawler that adaptively acquires user-desired content on the Internet to meet the specific online data source acquisition needs of e-health researchers. Experimental results on two cancer-related case studies show that the new crawler can substantially accelerate the acquisition of highly relevant online content compared with the existing state-of-the-art adaptive web crawling technology. For the breast cancer case study using the full training set, the new method achieves a cumulative precision between 74.7 and 79.4% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 32.8 and 37.0% using the peer method for the same time period. For the lung cancer case study using the full training set, the new method achieves a cumulative precision between 56.7 and 61.2% after 5 h of execution till the end of the 20-h long crawling session as compared with the cumulative precision between 29.3 and 32.4% using the peer method. Using the reduced training set in the breast cancer case study, the cumulative precision of our method is between 44.6 and 54.9%, whereas the cumulative precision of the peer method is between 24.3 and 26.3%; for the lung cancer case study using the reduced training set, the cumulative precisions of our method and the peer method are, respectively, between 35.7 and 46.7% versus between 24.1 and 29.6%. These numbers clearly show a consistently superior accuracy of our method in discovering and acquiring user-desired online content for e-health research.

AVAILABILITY AND IMPLEMENTATION

The implementation of our user-oriented web crawler is freely available to non-commercial users via the following Web site: http://bsec.ornl.gov/AdaptiveCrawler.shtml. The Web site provides a step-by-step guide on how to execute the web crawler implementation. In addition, the Web site provides the two study datasets including manually labeled ground truth, initial seeds and the crawling results reported in this article.

摘要

动机

互联网上有大量患有疾病和健康个体的生活故事。与健康受试者相比,收集和挖掘这些在线内容可以为患者在疾病的诊断前、诊断时、治疗中和治疗后阶段的身体和情绪状态提供许多有价值的见解。然而,此类内容广泛分布在网络上。使用传统的基于查询的搜索引擎手动收集相关材料非常费力,并且由于人力查询组成和结果解析工作方面的资源限制,往往不完整。另一种选择是盲目地爬行整个网络,但对于电子健康研究人员来说,这种方法效率低下且成本高昂。

结果

我们提出了一种面向用户的网络爬虫,它可以自适应地在互联网上获取用户所需的内容,以满足电子健康研究人员特定的在线数据源获取需求。在两个与癌症相关的案例研究上的实验结果表明,与现有的自适应网络爬虫技术相比,新的爬虫可以大大加快高度相关的在线内容的获取速度。对于使用完整训练集的乳腺癌案例研究,在执行 5 小时后,新方法在 20 小时的爬行会话结束时达到了 74.7%至 79.4%的累积精度,而使用相同时间段内的同类方法达到了 32.8%至 37.0%的累积精度。对于使用完整训练集的肺癌案例研究,在执行 5 小时后,新方法在 20 小时的爬行会话结束时达到了 56.7%至 61.2%的累积精度,而使用同类方法达到了 29.3%至 32.4%的累积精度。在乳腺癌案例研究中使用缩减训练集时,我们方法的累积精度在 44.6%至 54.9%之间,而同类方法的累积精度在 24.3%至 26.3%之间;在肺癌案例研究中使用缩减训练集时,我们方法和同类方法的累积精度分别在 35.7%至 46.7%和 24.1%至 29.6%之间。这些数字清楚地表明,我们的方法在发现和获取电子健康研究所需的用户期望的在线内容方面具有始终如一的更高准确性。

可用性和实现

非商业用户可通过以下网站免费使用我们面向用户的网络爬虫的实现:http://bsec.ornl.gov/AdaptiveCrawler.shtml。该网站提供了执行网络爬虫实现的分步指南。此外,该网站还提供了两个研究数据集,包括手动标记的地面实况、初始种子和本文报告的爬行结果。

相似文献

1
A user-oriented web crawler for selectively acquiring online content in e-health research.面向用户的网络爬虫,用于有选择地获取电子健康研究中的在线内容。
Bioinformatics. 2014 Jan 1;30(1):104-14. doi: 10.1093/bioinformatics/btt571. Epub 2013 Sep 29.
2
Crawling the German Health Web: Exploratory Study and Graph Analysis.爬取德国健康网站:探索性研究与图谱分析。
J Med Internet Res. 2020 Jul 24;22(7):e17853. doi: 10.2196/17853.
3
Quantitative evaluation of recall and precision of CAT Crawler, a search engine specialized on retrieval of Critically Appraised Topics.对CAT Crawler(一个专门用于检索经严格评估主题的搜索引擎)召回率和精确率的定量评估。
BMC Med Inform Decis Mak. 2004 Dec 10;4:21. doi: 10.1186/1472-6947-4-21.
4
Mobyle: a new full web bioinformatics framework.Mobyle:一个全新的全网络生物信息学框架。
Bioinformatics. 2009 Nov 15;25(22):3005-11. doi: 10.1093/bioinformatics/btp493. Epub 2009 Aug 17.
5
BOV--a web-based BLAST output visualization tool.BOV——一个基于网络的BLAST输出可视化工具。
BMC Genomics. 2008 Sep 15;9:414. doi: 10.1186/1471-2164-9-414.
6
PaperBot: open-source web-based search and metadata organization of scientific literature.PaperBot:基于网络的开源科学文献搜索和元数据组织工具。
BMC Bioinformatics. 2019 Jan 24;20(1):50. doi: 10.1186/s12859-019-2613-z.
7
Searching for cancer information on the internet: analyzing natural language search queries.在互联网上搜索癌症信息:分析自然语言搜索查询
J Med Internet Res. 2003 Dec 11;5(4):e31. doi: 10.2196/jmir.5.4.e31.
8
PREDOSE: a semantic web platform for drug abuse epidemiology using social media.前置:一个利用社交媒体进行药物滥用流行病学研究的语义网平台。
J Biomed Inform. 2013 Dec;46(6):985-97. doi: 10.1016/j.jbi.2013.07.007. Epub 2013 Jul 25.
9
User centered and ontology based information retrieval system for life sciences.面向生命科学的以用户为中心和基于本体的信息检索系统。
BMC Bioinformatics. 2012 Jan 25;13 Suppl 1(Suppl 1):S4. doi: 10.1186/1471-2105-13-S1-S4.
10
The effectiveness of internet-based e-learning on clinician behavior and patient outcomes: a systematic review protocol.基于互联网的电子学习对临床医生行为和患者结局的有效性:一项系统评价方案。
JBI Database System Rev Implement Rep. 2015 Jan;13(1):52-64. doi: 10.11124/jbisrir-2015-1919.

引用本文的文献

1
What Chinese Women Seek in Mental Health Apps: Insights from Analyzing User Posts during the COVID-19 Pandemic.中国女性在心理健康应用程序中寻求什么:对新冠疫情期间用户帖子的分析见解
Healthcare (Basel). 2024 Jun 28;12(13):1297. doi: 10.3390/healthcare12131297.
2
An Automated Customizable Live Web Crawler for Curation of Comparative Pharmacokinetic Data: An Intelligent Compilation of Research-Based Comprehensive Article Repository.一种用于整理比较药代动力学数据的自动化可定制实时网络爬虫:基于研究的综合文章库的智能汇编。
Pharmaceutics. 2023 Apr 30;15(5):1384. doi: 10.3390/pharmaceutics15051384.
3
Using the bootstrapping method to verify whether hospital physicians have different h-indexes regarding individual research achievement: A bibliometric analysis.运用自抽样法验证医院医生在个人研究成果方面是否具有不同的h指数:一项文献计量分析。
Medicine (Baltimore). 2020 Aug 14;99(33):e21552. doi: 10.1097/MD.0000000000021552.
4
Epidemiological and clinical features of pediatric COVID-19.儿童 COVID-19 的流行病学和临床特征。
BMC Med. 2020 Aug 6;18(1):250. doi: 10.1186/s12916-020-01719-2.
5
Crawling the German Health Web: Exploratory Study and Graph Analysis.爬取德国健康网站:探索性研究与图谱分析。
J Med Internet Res. 2020 Jul 24;22(7):e17853. doi: 10.2196/17853.
6
PaperBot: open-source web-based search and metadata organization of scientific literature.PaperBot:基于网络的开源科学文献搜索和元数据组织工具。
BMC Bioinformatics. 2019 Jan 24;20(1):50. doi: 10.1186/s12859-019-2613-z.
7
Webcrawling and machine learning as a new approach for the spatial distribution of atmospheric emissions.网络爬虫和机器学习作为一种大气排放空间分布的新方法。
PLoS One. 2018 Jul 16;13(7):e0200650. doi: 10.1371/journal.pone.0200650. eCollection 2018.
8
Investigating the Association Between Sociodemographic Factors and Lung Cancer Risk Using Cyber Informatics.利用网络信息学研究社会人口学因素与肺癌风险之间的关联。
IEEE EMBS Int Conf Biomed Health Inform. 2016 Feb;2016:557-560. doi: 10.1109/BHI.2016.7455958. Epub 2016 Apr 21.
9
A novel web informatics approach for automated surveillance of cancer mortality trends.一种用于癌症死亡率趋势自动监测的新型网络信息学方法。
J Biomed Inform. 2016 Jun;61:110-8. doi: 10.1016/j.jbi.2016.03.027. Epub 2016 Apr 1.
10
Residential Mobility and Lung Cancer Risk: Data-Driven Exploration Using Internet Sources.居住流动性与肺癌风险:利用互联网资源进行数据驱动的探索
Soc Comput Behav Cult Model Predict (2015). 2015 Mar-Apr;9021:464-469. doi: 10.1007/978-3-319-16268-3_60. Epub 2015 Mar 17.