• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过潜在类别拼接不完美匹配数据库特征来形成大数据集。

Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features.

机构信息

Battelle Center for Mathematical Medicine, Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH 43215, USA.

Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, OH 43215, USA.

出版信息

Genes (Basel). 2019 Sep 19;10(9):727. doi: 10.3390/genes10090727.

DOI:10.3390/genes10090727
PMID:31546899
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6771148/
Abstract

Informatics researchers often need to combine data from many different sources to increase statistical power and study subtle or complicated effects. Perfect overlap of measurements across academic studies is rare since virtually every dataset is collected for a unique purpose and without coordination across parties not-at-hand (i.e., informatics researchers in the future). Thus, incomplete concordance of measurements across datasets poses a major challenge for researchers seeking to combine public databases. In any given field, some measurements are fairly standard, but every organization collecting data makes unique decisions on instruments, protocols, and methods of processing the data. This typically denies literal concatenation of the raw data since constituent cohorts do not have the same measurements (i.e., columns of data). When measurements across datasets are similar prima facie, there is a desire to combine the data to increase power, but mixing non-identical measurements could greatly reduce the sensitivity of the downstream analysis. Here, we discuss a statistical method that is applicable when certain patterns of missing data are found; namely, it is possible to combine datasets that measure the same underlying constructs (or latent traits) when there is only partial overlap of measurements across the constituent datasets. Our method, ROSETTA empirically derives a set of common latent trait metrics for each related measurement domain using a novel variation of factor analysis to ensure equivalence across the constituent datasets. The advantage of combining datasets this way is the simplicity, statistical power, and modeling flexibility of a single joint analysis of all the data. Three simulation studies show the performance of ROSETTA on datasets with only partially overlapping measurements (i.e., systematically missing information), benchmarked to a condition of perfectly overlapped data (i.e., full information). The first study examined a range of correlations, while the second study was modeled after the observed correlations in a well-characterized clinical, behavioral cohort. Both studies consistently show significant correlations >0.94, often >0.96, indicating the robustness of the method and validating the general approach. The third study varied within and between domain correlations and compared ROSETTA to multiple imputation and meta-analysis as two commonly used methods that ostensibly solve the same data integration problem. We provide one alternative to meta-analysis and multiple imputation by developing a method that statistically equates similar but distinct manifest metrics into a set of empirically derived metrics that can be used for analysis across all datasets.

摘要

信息学研究人员通常需要结合来自许多不同来源的数据,以提高统计能力并研究微妙或复杂的影响。由于几乎每个数据集都是为特定目的而收集的,而且各方之间没有协调(即未来的信息学研究人员),因此学术研究之间的测量几乎没有完全重叠。因此,数据集之间测量的不完全一致性对寻求组合公共数据库的研究人员构成了重大挑战。在任何特定领域,某些测量都是相当标准的,但每个收集数据的组织都对仪器、协议和数据处理方法做出独特的决策。这通常拒绝原始数据的直接连接,因为组成队列没有相同的测量值(即数据的列)。当数据集之间的测量值表面上相似时,人们希望组合数据以提高能力,但混合不相同的测量值可能会大大降低下游分析的敏感性。在这里,我们讨论了一种统计方法,当发现某些缺失数据模式时,该方法适用;也就是说,当组成数据集之间的测量值只有部分重叠时,可以组合测量相同潜在结构(或潜在特征)的数据集。我们的方法 ROSETTA 使用因子分析的一种新变体来为每个相关测量领域推导出一组共同的潜在特征度量标准,以确保在组成数据集之间的等效性。以这种方式组合数据集的优点是对所有数据进行单一联合分析的简单性、统计能力和建模灵活性。三项模拟研究显示了 ROSETTA 在只有部分重叠测量值(即系统地缺失信息)的数据集上的性能,并与完美重叠数据的条件(即完整信息)进行了基准测试。第一项研究检查了一系列相关性,而第二项研究则是在一个特征明确的临床、行为队列的观察到的相关性之后建模的。这两项研究都一致显示出显著的相关性>0.94,通常>0.96,表明该方法的稳健性并验证了一般方法。第三项研究在域内和域间相关性方面有所不同,并将 ROSETTA 与多重插补和荟萃分析进行了比较,这两种方法通常被认为可以解决相同的数据集成问题。我们通过开发一种方法提供了一种替代荟萃分析和多重插补的方法,该方法通过统计方法将相似但不同的显式度量值转换为一组经验衍生的度量值,这些度量值可用于所有数据集的分析。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e5/6771148/42cdb720dfb7/genes-10-00727-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e5/6771148/f941e461f4a8/genes-10-00727-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e5/6771148/42cdb720dfb7/genes-10-00727-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e5/6771148/f941e461f4a8/genes-10-00727-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e5/6771148/42cdb720dfb7/genes-10-00727-g004.jpg

相似文献

1
Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features.通过潜在类别拼接不完美匹配数据库特征来形成大数据集。
Genes (Basel). 2019 Sep 19;10(9):727. doi: 10.3390/genes10090727.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Aspects of Genetic Diversity, Host Specificity and Public Health Significance of Single-Celled Intestinal Parasites Commonly Observed in Humans and Mostly Referred to as 'Non-Pathogenic'.人类常见且大多被称为“非致病性”的单细胞肠道寄生虫的遗传多样性、宿主特异性及公共卫生意义
APMIS. 2025 Sep;133(9):e70036. doi: 10.1111/apm.70036.
4
Clinical symptoms, signs and tests for identification of impending and current water-loss dehydration in older people.老年人即将发生和当前失水脱水的识别的临床症状、体征及检查
Cochrane Database Syst Rev. 2015 Apr 30;2015(4):CD009647. doi: 10.1002/14651858.CD009647.pub2.
5
The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》
Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.
6
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
7
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.
8
Short-Term Memory Impairment短期记忆障碍
9
Antidepressants for pain management in adults with chronic pain: a network meta-analysis.抗抑郁药治疗成人慢性疼痛的疼痛管理:一项网络荟萃分析。
Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.
10
Sexual Harassment and Prevention Training性骚扰与预防培训

引用本文的文献

1
A Latent Trait-based Measure as a Data Harmonization and Missing Data Solution Applied to the Environmental Influences on Child Health Outcomes Cohort.一种基于潜在特质的测量方法作为数据协调和缺失数据解决方案应用于儿童健康结果队列的环境影响研究。
Epidemiology. 2025 May 1;36(3):413-424. doi: 10.1097/EDE.0000000000001832. Epub 2025 Apr 1.
2
The Rosetta Phenotype Harmonization Method Facilitates Finding a Relationship Quantitative Trait Locus for a Complex Cognitive Trait.罗塞塔表型协调方法有助于找到复杂认知性状的关系数量性状基因座。
Genes (Basel). 2023 Aug 31;14(9):1748. doi: 10.3390/genes14091748.
3
Innovating Computational Biology and Intelligent Medicine: ICIBM 2019 Special Issue.

本文引用的文献

1
A statistical error in the estimation of the recommended dietary allowance for vitamin D.维生素D推荐膳食摄入量估计中的统计误差。
Nutrients. 2014 Oct 20;6(10):4472-5. doi: 10.3390/nu6104472.
2
Evaluation of a bayesian model integration-based method for censored data.一种基于贝叶斯模型整合的删失数据方法的评估
Hum Hered. 2012;74(1):1-11. doi: 10.1159/000342707. Epub 2012 Sep 26.
3
An eQTL biological data visualization challenge and approaches from the visualization community.eQTL 生物学数据可视化挑战及可视化社区的方法。
创新计算生物学与智能医学:ICIBM 2019 特刊。
Genes (Basel). 2020 Apr 17;11(4):437. doi: 10.3390/genes11040437.
BMC Bioinformatics. 2012;13 Suppl 8(Suppl 8):S8. doi: 10.1186/1471-2105-13-S8-S8. Epub 2012 May 18.
4
Meta-analysis in medical research.医学研究中的荟萃分析。
Hippokratia. 2010 Dec;14(Suppl 1):29-37.
5
Longitudinal genetic analysis of early reading: The Western Reserve Reading Project.早期阅读的纵向基因分析:西储阅读项目
Read Writ. 2007 Feb 1;20(1-2):127-146. doi: 10.1007/s11145-006-9021-2.
6
Environmental influences on the longitudinal covariance of expressive vocabulary: measuring the home literacy environment in a genetically sensitive design.环境对表达性词汇纵向协方差的影响:在基因敏感性设计中测量家庭读写环境
J Child Psychol Psychiatry. 2009 Aug;50(8):911-9. doi: 10.1111/j.1469-7610.2009.02074.x. Epub 2009 Feb 27.
7
Accumulating quantitative trait linkage evidence across multiple datasets using the posterior probability of linkage.利用连锁的后验概率在多个数据集中积累数量性状连锁证据。
Genet Epidemiol. 2007 Feb;31(2):91-102. doi: 10.1002/gepi.20193.