Yusuf Amman, Boyne Devon J, O'Sullivan Dylan E, Brenner Darren R, Cheung Winson Y, Mirza Imran, Jarada Tamer N
Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada.
Department of Community Health Sciences, University of Calgary, Calgary, AB, T2N 4Z6, Canada.
BMC Med Res Methodol. 2024 Mar 11;24(1):63. doi: 10.1186/s12874-024-02192-8.
Laboratory data can provide great value to support research aimed at reducing the incidence, prolonging survival and enhancing outcomes of cancer. Data is characterized by the information it carries and the format it holds. Data captured in Alberta's biomarker laboratory repository is free text, cluttered and rouge. Such data format limits its utility and prohibits broader adoption and research development. Text analysis for information extraction of unstructured data can change this and lead to more complete analyses. Previous work on extracting relevant information from free text, unstructured data employed Natural Language Processing (NLP), Machine Learning (ML), rule-based Information Extraction (IE) methods, or a hybrid combination between them.
In our study, text analysis was performed on Alberta Precision Laboratories data which consisted of 95,854 entries from the Southern Alberta Dataset (SAD) and 6944 entries from the Northern Alberta Dataset (NAD). The data covers all of Alberta and is completely population-based. Our proposed framework is built around rule-based IE methods. It incorporates topics such as Syntax and Lexical analyses to achieve deterministic extraction of data from biomarker laboratory data (i.e., Epidermal Growth Factor Receptor (EGFR) test results). Lexical analysis compromises of data cleaning and pre-processing, Rich Text Format text conversion into readable plain text format, and normalization and tokenization of text. The framework then passes the text into the Syntax analysis stage which includes the rule-based method of extracting relevant data. Rule-based patterns of the test result are identified, and a Context Free Grammar then generates the rules of information extraction. Finally, the results are linked with the Alberta Cancer Registry to support real-world cancer research studies.
Of the original 5512 entries in the SAD dataset and 5017 entries in the NAD dataset which were filtered for EGFR, the framework yielded 5129 and 3388 extracted EGFR test results from the SAD and NAD datasets, respectively. An accuracy of 97.5% was achieved on a random sample of 362 tests.
We presented a text analysis framework to extract specific information from unstructured clinical data. Our proposed framework has shown that it can successfully extract relevant information from EGFR test results.
实验室数据可为旨在降低癌症发病率、延长生存期及改善癌症治疗效果的研究提供巨大价值。数据的特征在于其承载的信息及所采用的格式。艾伯塔省生物标志物实验室存储库中收集的数据为自由文本形式,杂乱且未经整理。这种数据格式限制了其效用,阻碍了更广泛的应用及研究发展。对非结构化数据进行信息提取的文本分析可改变这种状况,并带来更全面的分析。先前从自由文本、非结构化数据中提取相关信息的工作采用了自然语言处理(NLP)、机器学习(ML)、基于规则的信息提取(IE)方法,或它们之间的混合组合。
在我们的研究中,对艾伯塔精准实验室的数据进行了文本分析,该数据包括来自艾伯塔省南部数据集(SAD)的95,854条记录和来自艾伯塔省北部数据集(NAD)的6944条记录。这些数据覆盖了整个艾伯塔省,且完全基于人群。我们提出的框架围绕基于规则的IE方法构建。它纳入了诸如句法和词汇分析等主题,以实现从生物标志物实验室数据(即表皮生长因子受体(EGFR)检测结果)中确定性地提取数据。词汇分析包括数据清理和预处理、将富文本格式文本转换为可读的纯文本格式,以及文本的规范化和词元化。然后,该框架将文本传递到句法分析阶段进行基于规则的相关数据提取。识别出检测结果的基于规则的模式,然后上下文无关语法生成信息提取规则。最后,将结果与艾伯塔癌症登记处相链接,以支持实际的癌症研究。
在SAD数据集中最初筛选出的5512条EGFR记录和NAD数据集中筛选出的5017条EGFR记录中,该框架分别从SAD和NAD数据集中提取出5129条和3388条EGFR检测结果。在362次检测的随机样本上实现了97.5%的准确率。
我们提出了一个从非结构化临床数据中提取特定信息的文本分析框架。我们提出的框架已表明它能够成功地从EGFR检测结果中提取相关信息。