Suppr超能文献

在非模式物种中寻找生物标志物:涉及牛胚胎发育的转录因子的文献挖掘。

Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development.

机构信息

INRA, SenS, UR1326, IFRIS, Champs-sur-Marne, F-77420, France.

Sector of Computational Proteomics, Institute of Cytology and Genetics, 10 Lavrentyev Ave, Novosibirsk, 630090, Russia.

出版信息

BioData Min. 2012 Aug 29;5(1):12. doi: 10.1186/1756-0381-5-12.

Abstract

BACKGROUND

Since processes in well-known model organisms have specific features different from those in Bos taurus, the organism under study, a good way to describe gene regulation in ruminant embryos would be a species-specific consideration of closely related species to cattle, sheep and pig. However, as highlighted by a recent report, gene dictionaries in pig are smaller than in cattle, bringing a risk to reduce the gene resources to be mined (and so for sheep dictionaries). Bioinformatics approaches that allow an integration of available information on gene function in model organisms, taking into account their specificity, are thus needed. Besides these closely related and biologically relevant species, there is indeed much more knowledge of (i) trophoblast proliferation and differentiation or (ii) embryogenesis in human and mouse species, which provides opportunities for reconstructing proliferation and/or differentiation processes in other mammalian embryos, including ruminants. The necessary knowledge can be obtained partly from (i) stem cell or cancer research to supply useful information on molecular agents or molecular interactions at work in cell proliferation and (ii) mouse embryogenesis to supply useful information on embryo differentiation. However, the total number of publications for all these topics and species is great and their manual processing would be tedious and time consuming. This is why we used text mining for automated text analysis and automated knowledge extraction. To evaluate the quality of this "mining", we took advantage of studies that reported gene expression profiles during the elongation of bovine embryos and defined a list of transcription factors (or TF, n = 64) that we used as biological "gold standard". When successful, the "mining" approach would identify them all, as well as novel ones.

METHODS

To gain knowledge on molecular-genetic regulations in a non model organism, we offer an approach based on literature-mining and score arrangement of data from model organisms. This approach was applied to identify novel transcription factors during bovine blastocyst elongation, a process that is not observed in rodents and primates. As a result, searching through human and mouse corpuses, we identified numerous bovine homologs, among which 11 to 14% of transcription factors including the gold standard TF as well as novel TF potentially important to gene regulation in ruminant embryo development. The scripts of the workflow are written in Perl and available on demand. They require data input coming from all various databases for any kind of biological issue once the data has been prepared according to keywords for the studied topic and species; we can provide data sample to illustrate the use and functionality of the workflow.

RESULTS

To do so, we created a workflow that allowed the pipeline processing of literature data and biological data, extracted from Web of Science (WoS) or PubMed but also from Gene Expression Omnibus (GEO), Gene Ontology (GO), Uniprot, HomoloGene, TcoF-DB and TFe (TF encyclopedia). First, the human and mouse homologs of the bovine proteins were selected, filtered by text corpora and arranged by score functions. The score functions were based on the gene name frequencies in corpora. Then, transcription factors were identified using TcoF-DB and double-checked using TFe to characterise TF groups and families. Thus, among a search space of 18,670 bovine homologs, 489 were identified as transcription factors. Among them, 243 were absent from the high-throughput data available at the time of the study. They thus stand so far for putative TF acting during bovine embryo elongation, but might be retrieved from a recent RNA sequencing dataset (Mamo et al. , 2012). Beyond the 246 TF that appeared expressed in bovine elongating tissues, we restricted our interpretation to those occurring within a list of 50 top-ranked genes. Among the transcription factors identified therein, half belonged to the gold standard (ASCL2, c-FOS, ETS2, GATA3, HAND1) and half did not (ESR1, HES1, ID2, NANOG, PHB2, TP53, STAT3).

CONCLUSIONS

A workflow providing search for transcription factors acting in bovine elongation was developed. The model assumed that proteins sharing the same protein domains in closely related species had the same protein functionalities, even if they were differently regulated among species or involved in somewhat different pathways. Under this assumption, we merged the information on different mammalian species from different databases (literature and biology) and proposed 489 TF as potential participants of embryo proliferation and differentiation, with (i) a recall of 95% with regard to a biological gold standard defined in 2011 and (ii) an extension of more than 3 times the gold standard of TF detected so far in elongating tissues. The working capacity of the workflow was supported by the manual expertise of the biologists on the results. The workflow can serve as a new kind of bioinformatics tool to work on fused data sources and can thus be useful in studies of a wide range of biological processes.

摘要

背景

由于知名模式生物的过程具有不同于牛的特定特征,因此研究对象,描述反刍动物胚胎基因调控的好方法是考虑与牛、绵羊和猪密切相关的物种的种特异性。然而,正如最近的一份报告所强调的那样,猪的基因词典比牛的小,这带来了减少可挖掘基因资源的风险(绵羊词典也是如此)。因此,需要生物信息学方法来整合模型生物中关于基因功能的可用信息,同时考虑到它们的特异性。除了这些密切相关且具有生物学相关性的物种之外,人类和小鼠物种中确实有更多关于(i)滋养层增殖和分化或(ii)胚胎发生的知识,这为重建其他哺乳动物胚胎(包括反刍动物)的增殖和/或分化过程提供了机会。必要的知识可以部分从(i)干细胞或癌症研究中获得,为细胞增殖过程中的分子剂或分子相互作用提供有用信息,以及(ii)小鼠胚胎发生,为胚胎分化提供有用信息。然而,所有这些主题和物种的出版物总数很多,手动处理既繁琐又耗时。这就是为什么我们使用文本挖掘进行自动文本分析和自动知识提取。为了评估这种“挖掘”的质量,我们利用了报告牛胚胎伸长过程中基因表达谱的研究,并定义了一组转录因子(或 TF,n=64),我们将其用作生物“黄金标准”。如果成功,“挖掘”方法将识别出所有这些转录因子,以及新的转录因子。

方法

为了在非模型生物中获得分子遗传调控方面的知识,我们提供了一种基于文献挖掘和模型生物数据评分排列的方法。该方法应用于鉴定牛囊胚伸长过程中的新型转录因子,这一过程在啮齿动物和灵长类动物中没有观察到。作为结果,通过对人类和小鼠文库的搜索,我们鉴定了许多牛同源物,其中包括 11%至 14%的转录因子,包括黄金标准 TF 以及可能对反刍动物胚胎发育基因调控重要的新型 TF。工作流程的脚本用 Perl 编写,可根据需要提供。它们需要来自各种生物数据库的数据输入,用于研究的主题和物种的任何种类;我们可以提供数据样本来说明工作流程的使用和功能。

结果

为此,我们创建了一个工作流程,允许对文献数据和生物数据进行流水线处理,这些数据从 Web of Science(WoS)或 PubMed 提取,也从 Gene Expression Omnibus(GEO)、Gene Ontology(GO)、Uniprot、HomoloGene、TcoF-DB 和 TFe(TF 百科全书)中提取。首先,选择牛蛋白的人类和小鼠同源物,通过文本文库过滤并按评分函数排列。评分函数基于文库中的基因名称频率。然后,使用 TcoF-DB 识别转录因子,并使用 TFe 进行双重检查,以描述 TF 组和家族。因此,在 18670 个牛同源物的搜索空间中,有 489 个被鉴定为转录因子。其中,243 个在研究时可获得的高通量数据中不存在。因此,它们迄今为止代表了在牛胚胎伸长过程中起作用的假定 TF,但可能会从最近的 RNA 测序数据集(Mamo 等人,2012)中检索到。除了在牛延伸组织中表达的 246 个 TF 之外,我们将解释限制在 50 个排名最高的基因列表中。在其中鉴定出的转录因子中,有一半属于黄金标准(ASCL2、c-FOS、ETS2、GATA3、HAND1),一半不属于(ESR1、HES1、ID2、NANOG、PHB2、TP53、STAT3)。

结论

开发了一种用于寻找在牛伸长中起作用的转录因子的工作流程。该模型假设在密切相关的物种中具有相同蛋白结构域的蛋白质具有相同的蛋白功能,即使它们在物种之间的调控不同,或者参与略有不同的途径。在这种假设下,我们合并了来自不同数据库(文献和生物学)的不同哺乳动物物种的信息,并提出了 489 个 TF 作为胚胎增殖和分化的潜在参与者,(i)具有 2011 年定义的生物学黄金标准的 95%的召回率,以及(ii)在延伸组织中检测到的 TF 黄金标准的扩展超过 3 倍。生物学家对结果的专业知识支持了工作流程的工作能力。该工作流程可用作融合数据源的新型生物信息学工具,因此可用于广泛的生物学过程的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/58c4/3563503/b8961136aceb/1756-0381-5-12-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验