Suppr超能文献

基于分布的最近邻插补法用于截断高维数据及其在临床前和临床代谢组学研究中的应用

Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies.

作者信息

Shah Jasmit S, Rai Shesh N, DeFilippis Andrew P, Hill Bradford G, Bhatnagar Aruni, Brock Guy N

机构信息

Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA.

Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA.

出版信息

BMC Bioinformatics. 2017 Feb 20;18(1):114. doi: 10.1186/s12859-017-1547-6.

Abstract

BACKGROUND

High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses.

RESULTS

Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases.

CONCLUSION

Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD.

摘要

背景

高通量代谢组学能够测量生物样本中众多代谢物的相对丰度,这对生物医学研究的许多领域都很有用。然而,代谢组学数据集中的缺失值很常见,可能由技术和生物学原因导致。通常,此类缺失值会被最小值替代,这可能会在下游分析中导致不同的结果。

结果

在此,我们提出了一种改进版的K近邻(KNN)方法,该方法考虑了在最小值处的截断,即KNN截断(KNN-TN)。我们将基于KNN-TN的插补结果与其他KNN方法的结果进行比较,如基于相关性的KNN(KNN-CR)和基于欧几里得距离的KNN(KNN-EU)。我们的方法假设数据遵循截断正态分布,截断点在检测限(LOD)处。通过均方根误差(RMSE)测量以及代谢物列表一致性指数(MLCI)来分析每种方法对下游统计检验的影响的有效性。通过广泛的模拟研究并应用于三个真实数据集,我们表明,与其他两种KNN程序以及基于用代谢物均值、零值或LOD替代缺失值的更简单插补方法相比,KNN-TN具有更低的RMSE值。KNN-TN和KNN-EU之间的MLCI值大致相当,并且在大多数情况下优于其他四种方法。

结论

我们的研究结果表明,当由于随机缺失并结合LOD导致缺失时,与KNN-CR和KNN-EU相比,KNN-TN在插补不同数据集的缺失值方面通常具有更好的性能。本研究中所示结果处于代谢组学领域,但该方法可应用于任何因LOD而存在缺失的高通量技术。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/43c8/5319174/a82c648d5391/12859_2017_1547_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验