Phan An, Joshi Parnal, Kadelka Claus, Friedberg Iddo
Program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States.
Department of Mathematics, Iowa State University, Ames, IA, United States.
Database (Oxford). 2025 May 7;2025. doi: 10.1093/database/baaf036.
The resources required to study gene function are limited, especially when considering the number of genes in the human genome and the complexity of their function. Therefore, genes are prioritized for experimental studies based on many different considerations, including, but not limited to, perceived biomedical importance, such as disease-associated genes, or the understanding of biological processes, such as cell signalling pathways. At the same time, most genes are not studied or are under-characterized, which hampers our understanding of their function and potential effects on human health and wellness. Understanding function annotation disparity is a necessary first step toward understanding how much functional knowledge is gained from the human genome, and toward guidelines for better targeting future studies of the genes in the human genome effectively. Here, we present a comprehensive longitudinal analysis of the human proteome utilizing data analysis tools from economics and information theory. Specifically, we view the human proteome as a population of proteins within a knowledge economy: we treat the quantified knowledge of the protein's function as the analogue of wealth and examine the distribution of information in a population of proteins in the proteome in the same manner distribution of wealth is studied in societies. Our results show a highly skewed distribution of information about human proteins over the last decade, in which the inequality in the annotations given to the proteins remains high. Additionally, we examine the correlation between the knowledge about protein function as captured in databases and the interest in proteins as reflected by mentions in the scientific literature. We show a large gap between knowledge and interest and dissect the factors leading to this gap. In conclusion, our study shows that research efforts should be redirected to less studied proteins to mitigate the disparity among human proteins both in databases and literature.
研究基因功能所需的资源有限,尤其是考虑到人类基因组中的基因数量及其功能的复杂性时。因此,基于许多不同的考虑因素,包括但不限于被认为的生物医学重要性(如疾病相关基因)或对生物过程的理解(如细胞信号通路),对基因进行实验研究的优先级排序。与此同时,大多数基因尚未被研究或特征描述不足,这阻碍了我们对其功能以及对人类健康和福祉潜在影响的理解。理解功能注释差异是迈向了解从人类基因组中获得了多少功能知识的必要第一步,也是制定有效指导未来对人类基因组中基因进行更好靶向研究的指南的必要第一步。在这里,我们利用来自经济学和信息论的数据分析工具,对人类蛋白质组进行了全面的纵向分析。具体而言,我们将人类蛋白质组视为知识经济中的一群蛋白质:我们将蛋白质功能的量化知识视为财富的类似物,并以研究社会中财富分布的相同方式来研究蛋白质组中一群蛋白质的信息分布。我们的结果显示,在过去十年中,关于人类蛋白质的信息分布高度不均衡,其中赋予蛋白质的注释的不平等程度仍然很高。此外,我们研究了数据库中所捕获的关于蛋白质功能的知识与科学文献中提及所反映的对蛋白质的关注之间的相关性。我们发现知识与关注之间存在很大差距,并剖析了导致这种差距的因素。总之,我们的研究表明,研究工作应重新导向较少研究的蛋白质,以减轻人类蛋白质在数据库和文献中的差异。