Chin Wee Loong, Lassmann Timo
National Centre for Asbestos Related Diseases, QEII Medical Centre, Nedlands, WA 6009, Australia.
Department of Medical Oncology, Sir Charles Gairdner Hospital, Hospital Ave, Nedlands, WA 6009, Australia.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae759.
Over the last two decades, transcriptomics has become a standard technique in biomedical research. We now have large databases of RNA-seq data, accompanied by valuable metadata detailing scientific objectives and the experimental procedures used. The metadata is crucial in understanding and replicating published studies, but so far has been underutilized in helping researchers to discover existing datasets.
We present SampleExplorer, a tool allowing researchers to search for relevant data using both text and gene set queries. SampleExplorer embeds sample metadata and uses a transformer-based language model to retrieve similar datasets. Extensive benchmarking (see Supplementary Materials and Methods) using the ARCHS4 database demonstrates that SampleExplorer provides an effective approach for retrieving biologically relevant samples from large-scale transcriptomicdata. This tool provides an efficient approach for discovering relevant gene expression datasets in large public repositories. It improves sample and dataset identification across diverse experimental contexts, helping researchers leverage existing transcriptomic data for potential replication or verification studies.
Availability and implementation: SampleExplorer is available as a Python package compatible with versions 3.9 to 3.11, available for installation via the Python Package Index (PyPI). The codebase and documentation are accessible at https://github.com/wlchin/SampleExplorer. Supplementary data (Supplementary Materials and Methods) provides detailed methodological information, including an algorithmic description of the retrieval process and data preparation steps.
在过去二十年中,转录组学已成为生物医学研究中的一项标准技术。我们现在拥有大量RNA测序数据的数据库,同时还有详细说明科学目标和所用实验程序的宝贵元数据。这些元数据对于理解和重复已发表的研究至关重要,但到目前为止,在帮助研究人员发现现有数据集方面尚未得到充分利用。
我们展示了SampleExplorer,这是一种工具,允许研究人员使用文本和基因集查询来搜索相关数据。SampleExplorer嵌入了样本元数据,并使用基于Transformer的语言模型来检索相似的数据集。使用ARCHS4数据库进行的广泛基准测试(见补充材料和方法)表明,SampleExplorer为从大规模转录组数据中检索生物学相关样本提供了一种有效的方法。该工具为在大型公共存储库中发现相关基因表达数据集提供了一种有效的方法。它改进了跨不同实验背景的样本和数据集识别,帮助研究人员利用现有的转录组数据进行潜在的重复或验证研究。
可用性和实现方式:SampleExplorer作为一个与3.9至3.11版本兼容的Python包提供,可通过Python包索引(PyPI)进行安装。代码库和文档可在https://github.com/wlchin/SampleExplorer上获取。补充数据(补充材料和方法)提供了详细的方法信息,包括检索过程的算法描述和数据准备步骤。