Joshi Parnal, Banerjee Sagnik, Hu Xiao, Khade Pranav M, Friedberg Iddo
Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.
Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA.
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad048.
Advances in sequencing technologies have led to a surge in genomic data, although the functions of many gene products coded by these genes remain unknown. While in-depth, targeted experiments that determine the functions of these gene products are crucial and routinely performed, they fail to keep up with the inflow of novel genomic data. In an attempt to address this gap, high-throughput experiments are being conducted in which a large number of genes are investigated in a single study. The annotations generated as a result of these experiments are generally biased towards a small subset of less informative Gene Ontology (GO) terms. Identifying and removing biases from protein function annotation databases is important since biases impact our understanding of protein function by providing a poor picture of the annotation landscape. Additionally, as machine learning methods for predicting protein function are becoming increasingly prevalent, it is essential that they are trained on unbiased datasets. Therefore, it is not only crucial to be aware of biases, but also to judiciously remove them from annotation datasets.
We introduce GOThresher, a Python tool that identifies and removes biases in function annotations from protein function annotation databases.
GOThresher is written in Python and released via PyPI https://pypi.org/project/gothresher/ and on the Bioconda Anaconda channel https://anaconda.org/bioconda/gothresher. The source code is hosted on GitHub https://github.com/FriedbergLab/GOThresher and distributed under the GPL 3.0 license.
Supplementary data are available at Bioinformatics online.
测序技术的进步导致基因组数据激增,尽管这些基因编码的许多基因产物的功能仍不为人知。虽然确定这些基因产物功能的深入、有针对性的实验至关重要且经常进行,但它们无法跟上新基因组数据的流入速度。为了弥补这一差距,正在进行高通量实验,即在一项研究中对大量基因进行研究。这些实验产生的注释通常偏向于信息量较少的基因本体论(GO)术语的一个小子集。从蛋白质功能注释数据库中识别并消除偏差很重要,因为偏差会通过提供不准确的注释情况影响我们对蛋白质功能的理解。此外,随着预测蛋白质功能的机器学习方法越来越普遍,至关重要的是要在无偏差的数据集上对其进行训练。因此,不仅要意识到偏差,还要明智地从注释数据集中消除它们。
我们引入了GOThresher,这是一个用于识别并消除蛋白质功能注释数据库中功能注释偏差的Python工具。
GOThresher用Python编写,并通过PyPI(https://pypi.org/project/gothresher/)和Bioconda Anaconda通道(https://anaconda.org/bioconda/gothresher)发布。源代码托管在GitHub(https://github.com/FriedbergLab/GOThresher)上,并根据GPL 3.0许可进行分发。
补充数据可在《生物信息学》在线获取。