Suppr超能文献

通过潜在类别拼接不完美匹配数据库特征来形成大数据集。

Forming Big Datasets through Latent Class Concatenation of Imperfectly Matched Databases Features.

机构信息

Battelle Center for Mathematical Medicine, Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH 43215, USA.

Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, OH 43215, USA.

出版信息

Genes (Basel). 2019 Sep 19;10(9):727. doi: 10.3390/genes10090727.

Abstract

Informatics researchers often need to combine data from many different sources to increase statistical power and study subtle or complicated effects. Perfect overlap of measurements across academic studies is rare since virtually every dataset is collected for a unique purpose and without coordination across parties not-at-hand (i.e., informatics researchers in the future). Thus, incomplete concordance of measurements across datasets poses a major challenge for researchers seeking to combine public databases. In any given field, some measurements are fairly standard, but every organization collecting data makes unique decisions on instruments, protocols, and methods of processing the data. This typically denies literal concatenation of the raw data since constituent cohorts do not have the same measurements (i.e., columns of data). When measurements across datasets are similar prima facie, there is a desire to combine the data to increase power, but mixing non-identical measurements could greatly reduce the sensitivity of the downstream analysis. Here, we discuss a statistical method that is applicable when certain patterns of missing data are found; namely, it is possible to combine datasets that measure the same underlying constructs (or latent traits) when there is only partial overlap of measurements across the constituent datasets. Our method, ROSETTA empirically derives a set of common latent trait metrics for each related measurement domain using a novel variation of factor analysis to ensure equivalence across the constituent datasets. The advantage of combining datasets this way is the simplicity, statistical power, and modeling flexibility of a single joint analysis of all the data. Three simulation studies show the performance of ROSETTA on datasets with only partially overlapping measurements (i.e., systematically missing information), benchmarked to a condition of perfectly overlapped data (i.e., full information). The first study examined a range of correlations, while the second study was modeled after the observed correlations in a well-characterized clinical, behavioral cohort. Both studies consistently show significant correlations >0.94, often >0.96, indicating the robustness of the method and validating the general approach. The third study varied within and between domain correlations and compared ROSETTA to multiple imputation and meta-analysis as two commonly used methods that ostensibly solve the same data integration problem. We provide one alternative to meta-analysis and multiple imputation by developing a method that statistically equates similar but distinct manifest metrics into a set of empirically derived metrics that can be used for analysis across all datasets.

摘要

信息学研究人员通常需要结合来自许多不同来源的数据,以提高统计能力并研究微妙或复杂的影响。由于几乎每个数据集都是为特定目的而收集的,而且各方之间没有协调(即未来的信息学研究人员),因此学术研究之间的测量几乎没有完全重叠。因此,数据集之间测量的不完全一致性对寻求组合公共数据库的研究人员构成了重大挑战。在任何特定领域,某些测量都是相当标准的,但每个收集数据的组织都对仪器、协议和数据处理方法做出独特的决策。这通常拒绝原始数据的直接连接,因为组成队列没有相同的测量值(即数据的列)。当数据集之间的测量值表面上相似时,人们希望组合数据以提高能力,但混合不相同的测量值可能会大大降低下游分析的敏感性。在这里,我们讨论了一种统计方法,当发现某些缺失数据模式时,该方法适用;也就是说,当组成数据集之间的测量值只有部分重叠时,可以组合测量相同潜在结构(或潜在特征)的数据集。我们的方法 ROSETTA 使用因子分析的一种新变体来为每个相关测量领域推导出一组共同的潜在特征度量标准,以确保在组成数据集之间的等效性。以这种方式组合数据集的优点是对所有数据进行单一联合分析的简单性、统计能力和建模灵活性。三项模拟研究显示了 ROSETTA 在只有部分重叠测量值(即系统地缺失信息)的数据集上的性能,并与完美重叠数据的条件(即完整信息)进行了基准测试。第一项研究检查了一系列相关性,而第二项研究则是在一个特征明确的临床、行为队列的观察到的相关性之后建模的。这两项研究都一致显示出显著的相关性>0.94,通常>0.96,表明该方法的稳健性并验证了一般方法。第三项研究在域内和域间相关性方面有所不同,并将 ROSETTA 与多重插补和荟萃分析进行了比较,这两种方法通常被认为可以解决相同的数据集成问题。我们通过开发一种方法提供了一种替代荟萃分析和多重插补的方法,该方法通过统计方法将相似但不同的显式度量值转换为一组经验衍生的度量值,这些度量值可用于所有数据集的分析。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e5/6771148/f941e461f4a8/genes-10-00727-g002.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验