• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过PubMed标识符检索基因表达微阵列数据集的召回率和偏差。

Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers.

作者信息

Piwowar Heather, Chapman Wendy

机构信息

University of Pittsburgh.

出版信息

J Biomed Discov Collab. 2010 Mar 28;5:7-20.

PMID:20349403
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2990274/
Abstract

BACKGROUND

The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability of PubMed article identifiers to locate publicly available gene expression microarray datasets, and investigated whether the retrieved datasets were representative of publicly available datasets found through statements of data sharing in the associated research articles.

RESULTS

In a recent article, Ochsner and colleagues identified 397 studies that had generated gene expression microarray data. Their search of the full text of each publication for statements of data sharing revealed 203 publicly available datasets, including 179 in the Gene Expression Omnibus (GEO) or ArrayExpress databases. Our scripted search of GEO and ArrayExpress for PubMed identifiers of the same 397 studies returned 160 datasets, including six not found by the original search for data sharing statements. As a proportion of datasets found by either method, the search for data sharing statements identified 91.4% of the 209 publicly available datasets, compared to only 76.6% found by our search carried out using PubMed identifiers. Searching GEO or ArrayExpress alone retrieved 63.2% and 46.9% of all available datasets, respectively. There was no difference in the type of datasets found by PubMed identifier searches in terms of research theme or the technology used. However, the studies identified were more likely to have larger sample sizes, were more frequently cited, and published in higher impact journals.

CONCLUSIONS

Searching database entries using PubMed identifiers can identify the majority of publicly available datasets, but caution is required when this method is used to collect data for policy evaluation since studies in low impact journals are disproportionately excluded. We urge authors of all datasets to complete the citation fields for their dataset submissions once publication details are known, thereby ensuring their work has maximum visibility and can contribute to subsequent studies.

摘要

背景

有效且高效地定位公开可用的基因表达微阵列数据集的能力有助于这些潜在有价值资源的重复利用。集中式生物医学数据库允许用户查询数据集的元数据描述,但这些注释往往过于稀疏和多样,无法进行复杂而准确的查询。在本研究中,我们检验了PubMed文章标识符定位公开可用基因表达微阵列数据集的能力,并调查了检索到的数据集是否代表通过相关研究文章中的数据共享声明找到的公开可用数据集。

结果

在最近的一篇文章中,奥克斯纳及其同事识别出397项产生了基因表达微阵列数据的研究。他们在每篇出版物的全文中搜索数据共享声明,发现了203个公开可用的数据集,其中包括基因表达综合数据库(GEO)或ArrayExpress数据库中的179个。我们使用脚本在GEO和ArrayExpress中搜索这397项相同研究的PubMed标识符,返回了160个数据集,其中包括原始数据共享声明搜索未找到的6个。就两种方法找到的数据集比例而言,数据共享声明搜索识别出了209个公开可用数据集中的91.4%,相比之下,我们使用PubMed标识符进行的搜索仅找到76.6%。单独搜索GEO或ArrayExpress分别检索到所有可用数据集的63.2%和46.9%。通过PubMed标识符搜索找到的数据集类型在研究主题或所用技术方面没有差异。然而,识别出的研究更有可能具有更大的样本量,被引用的频率更高,且发表在影响因子更高的期刊上。

结论

使用PubMed标识符搜索数据库条目可以识别出大多数公开可用的数据集,但在使用此方法收集数据用于政策评估时需谨慎,因为低影响因子期刊中的研究被不成比例地排除了。我们敦促所有数据集的作者在已知出版细节后,为其数据集提交完成引用字段,从而确保他们的工作具有最大的可见性,并能为后续研究做出贡献。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/3a14613b1343/Jbiomeddiscovcollab-05-e02-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/9527e3131912/Jbiomeddiscovcollab-05-e02-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/64e85042a402/Jbiomeddiscovcollab-05-e02-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/8e16cdfaef33/Jbiomeddiscovcollab-05-e02-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/3a14613b1343/Jbiomeddiscovcollab-05-e02-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/9527e3131912/Jbiomeddiscovcollab-05-e02-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/64e85042a402/Jbiomeddiscovcollab-05-e02-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/8e16cdfaef33/Jbiomeddiscovcollab-05-e02-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/153b/2990274/3a14613b1343/Jbiomeddiscovcollab-05-e02-g004.jpg

相似文献

1
Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers.通过PubMed标识符检索基因表达微阵列数据集的召回率和偏差。
J Biomed Discov Collab. 2010 Mar 28;5:7-20.
2
Data reuse and the open data citation advantage.数据重用与开放数据引文优势。
PeerJ. 2013 Oct 1;1:e175. doi: 10.7717/peerj.175. eCollection 2013.
3
Retrospective analysis: reproducibility of interblastomere differences of mRNA expression in 2-cell stage mouse embryos is remarkably poor due to combinatorial mechanisms of blastomere diversification.回顾性分析:由于囊胚细胞多样化的组合机制,2 细胞期小鼠胚胎中 mRNA 表达的卵裂球间差异的可重复性极差。
Mol Hum Reprod. 2018 Jul 1;24(7):388-400. doi: 10.1093/molehr/gay021.
4
A content-based dataset recommendation system for researchers-a case study on Gene Expression Omnibus (GEO) repository.基于内容的数据集推荐系统,供研究人员使用——以基因表达综合数据库 (GEO) 为例
Database (Oxford). 2020 Jan 1;2020:1. doi: 10.1093/database/baaa064.
5
Beyond the black stump: rapid reviews of health research issues affecting regional, rural and remote Australia.超越黑木树:影响澳大利亚地区、农村和偏远地区的健康研究问题的快速综述。
Med J Aust. 2020 Dec;213 Suppl 11:S3-S32.e1. doi: 10.5694/mja2.50881.
6
ARGEOS: A New Bioinformatic Tool for Detailed Systematics Search in GEO and ArrayExpress.ARGEOS:一种用于在基因表达综合数据库(GEO)和ArrayExpress中进行详细系统搜索的新型生物信息学工具。
Biology (Basel). 2021 Oct 11;10(10):1026. doi: 10.3390/biology10101026.
7
GXD's RNA-Seq and Microarray Experiment Search: using curated metadata to reliably find mouse expression studies of interest.基因表达数据库(GXD)的RNA测序和微阵列实验搜索:利用经过整理的元数据可靠地找到感兴趣的小鼠表达研究。
Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa002.
8
Ontology-driven indexing of public datasets for translational bioinformatics.用于转化生物信息学的公共数据集的本体驱动索引编制
BMC Bioinformatics. 2009 Feb 5;10 Suppl 2(Suppl 2):S1. doi: 10.1186/1471-2105-10-S2-S1.
9
GEM-TREND: a web tool for gene expression data mining toward relevant network discovery.GEM-TREND:一个用于挖掘基因表达数据以发现相关网络的网络工具。
BMC Genomics. 2009 Sep 3;10:411. doi: 10.1186/1471-2164-10-411.
10
Comparison of conference abstracts and presentations with full-text articles in the health technology assessments of rapidly evolving technologies.在快速发展技术的卫生技术评估中,会议摘要和报告与全文文章的比较。
Health Technol Assess. 2006 Feb;10(5):iii-iv, ix-145. doi: 10.3310/hta10050.

引用本文的文献

1
A systematic review of non-coding RNA genes with differential expression profiles associated with autism spectrum disorders.非编码 RNA 基因与自闭症谱系障碍相关的差异表达谱的系统评价。
PLoS One. 2023 Jun 15;18(6):e0287131. doi: 10.1371/journal.pone.0287131. eCollection 2023.
2
NeuroRDF: semantic integration of highly curated data to prioritize biomarker candidates in Alzheimer's disease.NeuroRDF:高度精准数据的语义整合,以确定阿尔茨海默病生物标志物候选物的优先级
J Biomed Semantics. 2016 Jul 8;7:45. doi: 10.1186/s13326-016-0079-8.
3
NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases.

本文引用的文献

1
Toward a public toxicogenomics capability for supporting predictive toxicology: survey of current resources and chemical indexing of experiments in GEO and ArrayExpress.迈向支持预测毒理学的公共毒物基因组学能力:对当前资源的调查以及基因表达综合数据库(GEO)和ArrayExpress中实验的化学索引编制
Toxicol Sci. 2009 Jun;109(2):358-71. doi: 10.1093/toxsci/kfp061. Epub 2009 Mar 30.
2
Much room for improvement in deposition rates of expression microarray datasets.表达微阵列数据集的沉积率有很大的改进空间。
Nat Methods. 2008 Dec;5(12):991. doi: 10.1038/nmeth1208-991.
3
Methodologies for extracting functional pharmacogenomic experiments from international repository.
神经递质数据库(NeuroTransDB):针对神经退行性疾病精心策划和结构化的转录组元数据。
Database (Oxford). 2015 Oct 16;2015. doi: 10.1093/database/bav099. Print 2015.
4
The dawn of open access to phylogenetic data.系统发育数据开放获取的开端。
PLoS One. 2014 Oct 24;9(10):e110268. doi: 10.1371/journal.pone.0110268. eCollection 2014.
5
Data reuse and the open data citation advantage.数据重用与开放数据引文优势。
PeerJ. 2013 Oct 1;1:e175. doi: 10.7717/peerj.175. eCollection 2013.
6
Who shares? Who doesn't? Factors associated with openly archiving raw research data.谁共享?谁不共享?与公开存档原始研究数据相关的因素。
PLoS One. 2011;6(7):e18657. doi: 10.1371/journal.pone.0018657. Epub 2011 Jul 13.
7
Analysis of microarray data from the macaque corpus luteum; the search for common themes in primate luteal regression.分析恒河猴黄体的基因芯片数据;探索灵长类动物黄体退化的共同主题。
Mol Hum Reprod. 2011 Mar;17(3):143-51. doi: 10.1093/molehr/gaq080. Epub 2010 Sep 20.
从国际数据库中提取功能性药物基因组学实验的方法。
AMIA Annu Symp Proc. 2007 Oct 11;2007:463-7.
4
Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project.促进生物学和生物医学研究的连贯最低报告指南:MIBBI项目。
Nat Biotechnol. 2008 Aug;26(8):889-96. doi: 10.1038/nbt.1411.
5
BioLit: integrating biological literature with databases.生物文献整合:将生物学文献与数据库相结合。
Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W385-9. doi: 10.1093/nar/gkn317. Epub 2008 May 31.
6
Enabling integrative genomic analysis of high-impact human diseases through text mining.通过文本挖掘实现对重大人类疾病的综合基因组分析。
Pac Symp Biocomput. 2008:580-91.
7
Annotation and query of tissue microarray data using the NCI Thesaurus.使用美国国立癌症研究所术语表对组织微阵列数据进行注释和查询。
BMC Bioinformatics. 2007 Aug 8;8:296. doi: 10.1186/1471-2105-8-296.
8
Compete, collaborate, compel.
Nat Genet. 2007 Aug;39(8):931. doi: 10.1038/ng0807-931.
9
BioText Search Engine: beyond abstract search.生物文本搜索引擎:超越摘要搜索。
Bioinformatics. 2007 Aug 15;23(16):2196-7. doi: 10.1093/bioinformatics/btm301. Epub 2007 Jun 1.
10
Sharing detailed research data is associated with increased citation rate.分享详细的研究数据与引文率的提高有关。
PLoS One. 2007 Mar 21;2(3):e308. doi: 10.1371/journal.pone.0000308.