用于从实验室数据中识别非小细胞肺癌患者基因突变的文本分析框架。

Text analysis framework for identifying mutations among non-small cell lung cancer patients from laboratory data.

作者信息

Yusuf Amman, Boyne Devon J, O'Sullivan Dylan E, Brenner Darren R, Cheung Winson Y, Mirza Imran, Jarada Tamer N

机构信息

Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada.

Department of Community Health Sciences, University of Calgary, Calgary, AB, T2N 4Z6, Canada.

出版信息

BMC Med Res Methodol. 2024 Mar 11;24(1):63. doi: 10.1186/s12874-024-02192-8.

DOI:10.1186/s12874-024-02192-8

PMID:38468224

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10926579/

Abstract

BACKGROUND

Laboratory data can provide great value to support research aimed at reducing the incidence, prolonging survival and enhancing outcomes of cancer. Data is characterized by the information it carries and the format it holds. Data captured in Alberta's biomarker laboratory repository is free text, cluttered and rouge. Such data format limits its utility and prohibits broader adoption and research development. Text analysis for information extraction of unstructured data can change this and lead to more complete analyses. Previous work on extracting relevant information from free text, unstructured data employed Natural Language Processing (NLP), Machine Learning (ML), rule-based Information Extraction (IE) methods, or a hybrid combination between them.

METHODS

In our study, text analysis was performed on Alberta Precision Laboratories data which consisted of 95,854 entries from the Southern Alberta Dataset (SAD) and 6944 entries from the Northern Alberta Dataset (NAD). The data covers all of Alberta and is completely population-based. Our proposed framework is built around rule-based IE methods. It incorporates topics such as Syntax and Lexical analyses to achieve deterministic extraction of data from biomarker laboratory data (i.e., Epidermal Growth Factor Receptor (EGFR) test results). Lexical analysis compromises of data cleaning and pre-processing, Rich Text Format text conversion into readable plain text format, and normalization and tokenization of text. The framework then passes the text into the Syntax analysis stage which includes the rule-based method of extracting relevant data. Rule-based patterns of the test result are identified, and a Context Free Grammar then generates the rules of information extraction. Finally, the results are linked with the Alberta Cancer Registry to support real-world cancer research studies.

RESULTS

Of the original 5512 entries in the SAD dataset and 5017 entries in the NAD dataset which were filtered for EGFR, the framework yielded 5129 and 3388 extracted EGFR test results from the SAD and NAD datasets, respectively. An accuracy of 97.5% was achieved on a random sample of 362 tests.

CONCLUSIONS

We presented a text analysis framework to extract specific information from unstructured clinical data. Our proposed framework has shown that it can successfully extract relevant information from EGFR test results.

摘要

背景

实验室数据可为旨在降低癌症发病率、延长生存期及改善癌症治疗效果的研究提供巨大价值。数据的特征在于其承载的信息及所采用的格式。艾伯塔省生物标志物实验室存储库中收集的数据为自由文本形式，杂乱且未经整理。这种数据格式限制了其效用，阻碍了更广泛的应用及研究发展。对非结构化数据进行信息提取的文本分析可改变这种状况，并带来更全面的分析。先前从自由文本、非结构化数据中提取相关信息的工作采用了自然语言处理（NLP）、机器学习（ML）、基于规则的信息提取（IE）方法，或它们之间的混合组合。

方法

在我们的研究中，对艾伯塔精准实验室的数据进行了文本分析，该数据包括来自艾伯塔省南部数据集（SAD）的95,854条记录和来自艾伯塔省北部数据集（NAD）的6944条记录。这些数据覆盖了整个艾伯塔省，且完全基于人群。我们提出的框架围绕基于规则的IE方法构建。它纳入了诸如句法和词汇分析等主题，以实现从生物标志物实验室数据（即表皮生长因子受体（EGFR）检测结果）中确定性地提取数据。词汇分析包括数据清理和预处理、将富文本格式文本转换为可读的纯文本格式，以及文本的规范化和词元化。然后，该框架将文本传递到句法分析阶段进行基于规则的相关数据提取。识别出检测结果的基于规则的模式，然后上下文无关语法生成信息提取规则。最后，将结果与艾伯塔癌症登记处相链接，以支持实际的癌症研究。