批次效应识别在高通量筛选数据中使用动态规划算法。

BatchI: Batch effect Identification in high-throughput screening data using a dynamic programming algorithm.

机构信息

Institute of Automatic Control, Silesian University of Technology, Gliwice, Poland.

Department of Internal Medicine, Yale School of Medicine, Yale University, New Haven, CT, USA.

出版信息

Bioinformatics. 2019 Jun 1;35(11):1885-1892. doi: 10.1093/bioinformatics/bty900.

DOI:10.1093/bioinformatics/bty900

PMID:30357412

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6546123/

Abstract

MOTIVATION

In contemporary biological experiments, bias, which interferes with the measurements, requires attentive processing. Important sources of bias in high-throughput biological experiments are batch effects and diverse methods towards removal of batch effects have been established. These include various normalization techniques, yet many require knowledge on the number of batches and assignment of samples to batches. Only few can deal with the problem of identification of batch effect of unknown structure. For this reason, an original batch identification algorithm through dynamical programming is introduced for omics data that may be sorted on a timescale.

RESULTS

BatchI algorithm is based on partitioning a series of high-throughput experiment samples into sub-series corresponding to estimated batches. The dynamic programming method is used for splitting data with maximal dispersion between batches, while maintaining minimal within batch dispersion. The procedure has been tested on a number of available datasets with and without prior information about batch partitioning. Datasets with a priori identified batches have been split accordingly, measured with weighted average Dice Index. Batch effect correction is justified by higher intra-group correlation. In the blank datasets, identified batch divisions lead to improvement of parameters and quality of biological information, shown by literature study and Information Content. The outcome of the algorithm serves as a starting point for correction methods. It has been demonstrated that omitting the essential step of batch effect control may lead to waste of valuable potential discoveries.

AVAILABILITY AND IMPLEMENTATION

The implementation is available within the BatchI R package at http://zaed.aei.polsl.pl/index.php/pl/111-software.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

在当代生物实验中，干扰测量的偏差需要进行仔细处理。高通量生物实验中重要的偏差来源是批次效应，并且已经建立了多种去除批次效应的方法。这些方法包括各种归一化技术，但许多方法都需要了解批次的数量以及样本分配到批次的情况。只有少数方法可以处理结构未知的批次效应识别问题。为此，引入了一种原始的通过动态规划进行组学数据批次识别的算法，这些数据可能按时间尺度进行排序。

结果

BatchI 算法基于将一系列高通量实验样本划分为与估计批次相对应的子序列。动态规划方法用于在保持批次内最小分散的同时，最大化批次间的分散来分割数据。该程序已经在具有和不具有批次划分先验信息的可用数据集上进行了测试。具有先验识别的批次的数据集已经根据加权平均 Dice 指数进行了相应的划分。通过更高的组内相关性来证明批处理校正的合理性。在空白数据集，识别的批次划分导致参数和生物信息质量的提高，通过文献研究和信息含量来显示。算法的结果可作为校正方法的起点。研究表明，省略批处理控制的基本步骤可能会导致有价值的潜在发现的浪费。

可用性和实现

实现可在 BatchI R 包中获得，网址为 http://zaed.aei.polsl.pl/index.php/pl/111-software。

补充信息

补充数据可在生物信息学在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8db3/6546123/46182f3af7f9/bty900f1.jpg

相似文献

BatchI: Batch effect Identification in high-throughput screening data using a dynamic programming algorithm.

Bioinformatics. 2019 Jun 1;35(11):1885-1892. doi: 10.1093/bioinformatics/bty900.

Detecting hidden batch factors through data-adaptive adjustment for biological effects.

Bioinformatics. 2018 Apr 1;34(7):1141-1147. doi: 10.1093/bioinformatics/btx635.

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.

BMC Bioinformatics. 2022 Jul 14;23(Suppl 6):279. doi: 10.1186/s12859-022-04775-y.

MultiBaC: an R package to remove batch effects in multi-omic experiments.

Bioinformatics. 2022 Apr 28;38(9):2657-2658. doi: 10.1093/bioinformatics/btac132.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

SCIBER: a simple method for removing batch effects from single-cell RNA-sequencing data.

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac819.

Batch alignment via retention orders for preprocessing large-scale multi-batch LC-MS experiments.

Bioinformatics. 2022 Aug 2;38(15):3759-3767. doi: 10.1093/bioinformatics/btac407.

Blind estimation and correction of microarray batch effect.

PLoS One. 2020 Apr 9;15(4):e0231446. doi: 10.1371/journal.pone.0231446. eCollection 2020.

Propensity scores as a novel method to guide sample allocation and minimize batch effects during the design of high throughput experiments.

BMC Bioinformatics. 2023 Mar 7;24(1):86. doi: 10.1186/s12859-023-05202-6.

GaMRed-Adaptive Filtering of High-Throughput Biological Data.

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):149-157. doi: 10.1109/TCBB.2018.2858825. Epub 2018 Jul 23.

引用本文的文献

A review of deep learning models for the prediction of chromatin interactions with DNA and epigenomic profiles.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae651.

Deep centroid: a general deep cascade classifier for biomedical omics data classification.

Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae039.

Evaluation of zero counts to better understand the discrepancies between bulk and single-cell RNA-Seq platforms.

Comput Struct Biotechnol J. 2023 Sep 29;21:4663-4674. doi: 10.1016/j.csbj.2023.09.035. eCollection 2023.

Unbiased comparison and modularization identify time-related transcriptomic reprogramming in exercised rat cartilage: Integrated data mining and experimental validation.

Front Physiol. 2022 Sep 15;13:974266. doi: 10.3389/fphys.2022.974266. eCollection 2022.

Local data commons: the sleeping beauty in the community of data commons.

BMC Bioinformatics. 2022 Sep 23;23(Suppl 12):386. doi: 10.1186/s12859-022-04922-5.

Perspectives for better batch effect correction in mass-spectrometry-based proteomics.

Comput Struct Biotechnol J. 2022 Aug 12;20:4369-4375. doi: 10.1016/j.csbj.2022.08.022. eCollection 2022.

Machine learning model for predicting Major Depressive Disorder using RNA-Seq data: optimization of classification approach.

Cogn Neurodyn. 2022 Apr;16(2):443-453. doi: 10.1007/s11571-021-09724-8. Epub 2021 Sep 22.

Translational precision medicine: an industry perspective.

J Transl Med. 2021 Jun 5;19(1):245. doi: 10.1186/s12967-021-02910-6.

Addressing Fairness, Bias, and Appropriate Use of Artificial Intelligence and Machine Learning in Global Health.

Front Artif Intell. 2021 Apr 15;3:561802. doi: 10.3389/frai.2020.561802. eCollection 2020.

Biological Perspectives of RNA-Sequencing Experimental Design.

Methods Mol Biol. 2021;2243:327-337. doi: 10.1007/978-1-0716-1103-6_17.

本文引用的文献

Detecting hidden batch factors through data-adaptive adjustment for biological effects.

Bioinformatics. 2018 Apr 1;34(7):1141-1147. doi: 10.1093/bioinformatics/btx635.

BatchQC: interactive software for evaluating sample and batch effects in genomic data.

Bioinformatics. 2016 Dec 15;32(24):3836-3838. doi: 10.1093/bioinformatics/btw538. Epub 2016 Aug 18.

Signal Partitioning Algorithm for Highly Efficient Gaussian Mixture Modeling in Mass Spectrometry.

PLoS One. 2015 Jul 31;10(7):e0134256. doi: 10.1371/journal.pone.0134256. eCollection 2015.

ArrayExpress update--simplifying data submissions.

Nucleic Acids Res. 2015 Jan;43(Database issue):D1113-6. doi: 10.1093/nar/gku1057. Epub 2014 Oct 31.

Removing batch effects for prediction problems with frozen surrogate variable analysis.

PeerJ. 2014 Sep 23;2:e561. doi: 10.7717/peerj.561. eCollection 2014.

Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012.

Int J Cancer. 2015 Mar 1;136(5):E359-86. doi: 10.1002/ijc.29210. Epub 2014 Oct 9.

A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.

Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19.

Adrenocortical carcinoma: a population-based study on incidence and survival in the Netherlands since 1993.

Eur J Cancer. 2013 Jul;49(11):2579-86. doi: 10.1016/j.ejca.2013.02.034. Epub 2013 Apr 3.

STAR: ultrafast universal RNA-seq aligner.

Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25.

Transcriptional profiling in facioscapulohumeral muscular dystrophy to identify candidate biomarkers.

Proc Natl Acad Sci U S A. 2012 Oct 2;109(40):16234-9. doi: 10.1073/pnas.1209508109. Epub 2012 Sep 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

批次效应识别在高通量筛选数据中使用动态规划算法。

BatchI: Batch effect Identification in high-throughput screening data using a dynamic programming algorithm.

机构信息

Institute of Automatic Control, Silesian University of Technology, Gliwice, Poland.

Department of Internal Medicine, Yale School of Medicine, Yale University, New Haven, CT, USA.