Paul G. Allen School of Computer Science and Engineering.
Department of Electrical and Computer Engineering.
Bioinformatics. 2021 May 1;37(4):439-447. doi: 10.1093/bioinformatics/btaa830.
Successful science often involves not only performing experiments well, but also choosing well among many possible experiments. In a hypothesis generation setting, choosing an experiment well means choosing an experiment whose results are interesting or novel. In this work, we formalize this selection procedure in the context of genomics and epigenomics data generation. Specifically, we consider the task faced by a scientific consortium such as the National Institutes of Health ENCODE Consortium, whose goal is to characterize all of the functional elements in the human genome. Given a list of possible cell types or tissue types ('biosamples') and a list of possible high-throughput sequencing assays, where at least one experiment has been performed in each biosample and for each assay, we ask 'Which experiments should ENCODE perform next?'
We demonstrate how to represent this task as a submodular optimization problem, where the goal is to choose a panel of experiments that maximize the facility location function. A key aspect of our approach is that we use imputed data, rather than experimental data, to directly answer the posed question. We find that, across several evaluations, our method chooses a panel of experiments that span a diversity of biochemical activity. Finally, we propose two modifications of the facility location function, including a novel submodular-supermodular function, that allow incorporation of domain knowledge or constraints into the optimization procedure.
Our method is available as a Python package at https://github.com/jmschrei/kiwano and can be installed using the command pip install kiwano. The source code used here and the similarity matrix can be found at http://doi.org/10.5281/zenodo.3708538.
Supplementary data are available at Bioinformatics online.
成功的科学研究不仅需要出色地完成实验,还需要在众多可能的实验中做出明智的选择。在生成假说的背景下,“做出明智的选择”意味着选择一项实验,其结果是有趣或新颖的。在这项工作中,我们将这种选择过程形式化,应用于基因组学和表观基因组学数据生成的背景下。具体来说,我们考虑了 NIH 基因组学 ENCODE 联盟等科学联盟所面临的任务,其目标是描述人类基因组中的所有功能元件。给定一个可能的细胞类型或组织类型(“生物样本”)列表,以及一个可能的高通量测序实验列表,其中至少在每个生物样本和每个实验中都进行了一次实验,我们会问:“ENCODE 接下来应该进行哪些实验?”
我们展示了如何将此任务表示为一个次模优化问题,目标是选择一组实验,以最大化设施位置函数。我们方法的一个关键方面是,我们使用插补数据而不是实验数据直接回答所提出的问题。我们发现,在几次评估中,我们的方法选择了一组实验,这些实验涵盖了多种生化活性。最后,我们提出了设施位置函数的两种修改,包括一种新颖的次模-超模函数,允许将领域知识或约束纳入优化过程。
我们的方法可以作为 Python 包在 https://github.com/jmschrei/kiwano 上获得,并可以使用命令 pip install kiwano 进行安装。这里使用的源代码和相似性矩阵可以在 http://doi.org/10.5281/zenodo.3708538 找到。
补充数据可在《生物信息学》在线获取。