• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

比较用于检测组学数据中标记错误的异常值和相关生物标志物的方法。

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data.

机构信息

Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China.

Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, City, Yantai, 264003, Shandong, China.

出版信息

BMC Bioinformatics. 2020 Aug 14;21(1):357. doi: 10.1186/s12859-020-03653-9.

DOI:10.1186/s12859-020-03653-9
PMID:32795265
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7646480/
Abstract

BACKGROUND

Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen.

RESULTS

The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates < 0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble.

CONCLUSIONS

When the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.

摘要

背景

先前的研究报告指出,组学数据中标签错误并不罕见。潜在的异常值可能严重破坏患者的正确分类和特定疾病可靠生物标志物的识别。已经提出了三种方法来解决这个问题:稀疏标签噪声稳健逻辑回归(Rlogreg)、基于最小修剪平方的稳健弹性网络(enetLTS)和 Ensemble。Ensemble 是一种基于不同特征选择和建模策略的集成分类方法。需要评估和比较这些方法的生物标志物选择和异常值检测的准确性,以便选择合适的方法。

结果

针对模拟数据集和 RNA-seq 数据集,比较了三种方法(Ensemble、enetLTS、Rlogreg)的变量选择准确性、异常值识别和预测。在模拟数据集中,Ensemble 的综合指标变量选择准确性最高,三种方法中假发现率最低。当样本量较大且异常值的比例≤5%时,Ensemble 的阳性选择率与 enetLTS 相似。但是,当异常值的比例为 10%或 15%时,Ensemble 错过了一些影响因变量的变量。总体而言,enetLTS 的异常值检测准确性最高,假阳性率<0.05,灵敏度高,当异常值的比例较大时,enetLTS 仍能很好地发挥作用。在异常值比例为 1%或 2%时,Ensemble 显示出高的异常值检测准确性,但异常值的比例较高时,Ensemble 错过了许多误标记的样本。Rlogreg 和 Ensemble 在识别异常值方面不如 enetLTS 准确。enetLTS 的预测准确性优于 Rlogreg。在使用 enetLTS 识别的异常值子集上运行 Ensemble,可以提高 Ensemble 的变量选择准确性。

结论

当异常值的比例≤5%时,可以使用 Ensemble 进行变量选择。当异常值的比例>5%时,可以在使用 enetLTS 识别异常值之后,在异常值子集上使用 Ensemble 进行变量选择。对于异常值识别,推荐使用 enetLTS。在实践中,可以根据使用的诊断方法的不准确性来估计异常值的比例。

相似文献

1
Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data.比较用于检测组学数据中标记错误的异常值和相关生物标志物的方法。
BMC Bioinformatics. 2020 Aug 14;21(1):357. doi: 10.1186/s12859-020-03653-9.
2
Ensemble outlier detection and gene selection in triple-negative breast cancer data.三阴性乳腺癌数据中的集成异常值检测和基因选择。
BMC Bioinformatics. 2018 May 4;19(1):168. doi: 10.1186/s12859-018-2149-7.
3
ROSIE: RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data.ROSIE:用于癌症组学数据中的异常值检测和基因选择的鲁棒稀疏集成。
Stat Methods Med Res. 2022 May;31(5):947-958. doi: 10.1177/09622802211072456. Epub 2022 Jan 24.
4
An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data.一种用于检测组学数据中错误标记异常值的高效算法。
Comput Math Methods Med. 2021 Dec 22;2021:9436582. doi: 10.1155/2021/9436582. eCollection 2021.
5
Robust variable selection in the framework of classification with label noise and outliers: Applications to spectroscopic data in agri-food.分类框架中具有标签噪声和异常值的稳健变量选择:在农业食品光谱数据中的应用。
Anal Chim Acta. 2021 Apr 8;1153:338245. doi: 10.1016/j.aca.2021.338245. Epub 2021 Feb 1.
6
Robust identification of target genes and outliers in triple-negative breast cancer data.三阴性乳腺癌数据中目标基因和离群值的稳健识别。
Stat Methods Med Res. 2019 Oct-Nov;28(10-11):3042-3056. doi: 10.1177/0962280218794722. Epub 2018 Aug 27.
7
Meta-Analyzing Multiple Omics Data With Robust Variable Selection.通过稳健变量选择对多组学数据进行Meta分析
Front Genet. 2021 Jul 5;12:656826. doi: 10.3389/fgene.2021.656826. eCollection 2021.
8
Identification of influential observations in high-dimensional survival data through robust penalized Cox regression based on trimming.基于修剪的稳健惩罚 Cox 回归识别高维生存数据中的有影响观测值
Math Biosci Eng. 2023 Jan 11;20(3):5352-5378. doi: 10.3934/mbe.2023248.
9
Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery.基于化学计量学的特征选择方法在早期癌症检测和生物标志物发现中的稳健性。
Stat Appl Genet Mol Biol. 2013 Mar 13;12(2):207-23. doi: 10.1515/sagmb-2012-0067.
10
Detection of outlier residues for improving interface prediction in protein heterocomplexes.检测异常残基以改善蛋白质杂合体界面预测。
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):1155-65. doi: 10.1109/TCBB.2012.58.

引用本文的文献

1
Biases in machine-learning models of human single-cell data.人类单细胞数据机器学习模型中的偏差。
Nat Cell Biol. 2025 Mar;27(3):384-392. doi: 10.1038/s41556-025-01619-8. Epub 2025 Feb 19.
2
EnsMOD: A Software Program for Omics Sample Outlier Detection.EnsMOD:一种用于组学样本离群值检测的软件程序。
J Comput Biol. 2023 Jun;30(6):726-735. doi: 10.1089/cmb.2022.0243. Epub 2023 Apr 12.
3
TidyMass an object-oriented reproducible analysis framework for LC-MS data.TidyMass:一种面向对象的、可重现的 LC-MS 数据分析框架。

本文引用的文献

1
Trefoil factor 1 (TFF1) is a potential prognostic biomarker with functional significance in breast cancers.三叶因子 1(TFF1)是一种具有潜在预后价值的生物标志物,在乳腺癌中具有功能意义。
Biomed Pharmacother. 2020 Apr;124:109827. doi: 10.1016/j.biopha.2020.109827. Epub 2020 Jan 24.
2
Identification of key genes as potential biomarkers for triple‑negative breast cancer using integrating genomics analysis.基于整合基因组学分析鉴定三阴性乳腺癌的潜在生物标志物的关键基因。
Mol Med Rep. 2020 Feb;21(2):557-566. doi: 10.3892/mmr.2019.10867. Epub 2019 Dec 6.
3
Epigenetic profiles capturing breast cancer stemness for triple negative breast cancer control.
Nat Commun. 2022 Jul 28;13(1):4365. doi: 10.1038/s41467-022-32155-w.
4
ROSIE: RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data.ROSIE:用于癌症组学数据中的异常值检测和基因选择的鲁棒稀疏集成。
Stat Methods Med Res. 2022 May;31(5):947-958. doi: 10.1177/09622802211072456. Epub 2022 Jan 24.
5
Glucose Sensing in Human Whole Blood Based on Near-Infrared Phosphors and Outlier Treatment with the Programming Language "R".基于近红外磷光体的人体全血葡萄糖传感及使用编程语言“R”的异常值处理
ACS Omega. 2021 Dec 20;7(1):198-206. doi: 10.1021/acsomega.1c04344. eCollection 2022 Jan 11.
6
An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data.一种用于检测组学数据中错误标记异常值的高效算法。
Comput Math Methods Med. 2021 Dec 22;2021:9436582. doi: 10.1155/2021/9436582. eCollection 2021.
用于三阴性乳腺癌控制的捕获乳腺癌干细胞特性的表观遗传特征。
Epigenomics. 2019 Dec;11(16):1811-1825. doi: 10.2217/epi-2019-0266. Epub 2019 Nov 15.
4
Integrated Bioinformatics Data Analysis Reveals Prognostic Significance Of SIDT1 In Triple-Negative Breast Cancer.整合生物信息学数据分析揭示了SIDT1在三阴性乳腺癌中的预后意义。
Onco Targets Ther. 2019 Oct 11;12:8401-8410. doi: 10.2147/OTT.S215898. eCollection 2019.
5
Selective loss of phosphoserine aminotransferase 1 (PSAT1) suppresses migration, invasion, and experimental metastasis in triple negative breast cancer.磷酸丝氨酸转氨酶 1(PSAT1)选择性缺失抑制三阴性乳腺癌的迁移、侵袭和实验性转移。
Clin Exp Metastasis. 2020 Feb;37(1):187-197. doi: 10.1007/s10585-019-10000-7. Epub 2019 Oct 19.
6
Identification and Validation of a Novel Biologics Target in Triple Negative Breast Cancer.三阴性乳腺癌新型生物标志物的鉴定与验证。
Sci Rep. 2019 Oct 17;9(1):14934. doi: 10.1038/s41598-019-51453-w.
7
Tumor expression of environmental chemical-responsive genes and breast cancer mortality.环境化学应答基因在肿瘤中的表达与乳腺癌死亡率。
Endocr Relat Cancer. 2019 Dec;26(12):843-851. doi: 10.1530/ERC-19-0357.
8
Molecular profiling of mucinous epithelial ovarian cancer by weighted gene co-expression network analysis.基于加权基因共表达网络分析的黏液性卵巢上皮癌的分子谱特征。
Gene. 2019 Aug 15;709:56-64. doi: 10.1016/j.gene.2019.05.034. Epub 2019 May 17.
9
Integrative analyses of triple negative dysregulated transcripts compared with non-triple negative tumors and their functional and molecular interactions.三阴性失调转录物与非三阴性肿瘤的综合分析及其功能和分子相互作用。
J Cell Physiol. 2019 Dec;234(12):22386-22399. doi: 10.1002/jcp.28804. Epub 2019 May 12.
10
Chromosome 19 miRNA cluster and CEBPB expression specifically mark and potentially drive triple negative breast cancers.19 号染色体 miRNA 簇和 CEBPB 表达特异性标记并可能驱动三阴性乳腺癌。
PLoS One. 2018 Oct 18;13(10):e0206008. doi: 10.1371/journal.pone.0206008. eCollection 2018.