Narayanan Adhithya, Topaloglu Umit, Laurini Javier A, Diaz-Garelli Franck
University of North Carolina at Chapel Hill, Chapel Hill, NC.
Wake Forest Baptist Medical Center, Winston Salem, NC.
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:440-448. eCollection 2020.
Precision oncology research seeks to derive knowledge from existing data. Current work seeks to integrate clinical and genomic data across cancer centers to enable impactful secondary use. However, integrated data reliability depends on the data curation method used and its systematicity. In practice, data integration and mapping are often done manually even though crucial data such as oncological diagnoses (DX) show varying accuracy and specificity levels. We hypothesized that mapping of text-form cancer DX to a standardized terminology (OncoTree) could be automated using existing methods (e.g. natural language processing (NLP) modules and application programming interfaces [APIs]). We found that our best-performing pipeline prototype was effective but limited by API development limitations (accurately mapped 96.2% of textual DX dataset to NCI Thesaurus (NCIt), 44.2% through NCIt to OncoTree). These results suggest the pipeline model could be viable to automate data curation. Such techniques may become increasingly more reliable with further development.
精准肿瘤学研究旨在从现有数据中获取知识。当前的工作致力于整合各癌症中心的临床和基因组数据,以实现有影响力的二次利用。然而,整合数据的可靠性取决于所使用的数据管理方法及其系统性。在实践中,即使诸如肿瘤诊断(DX)等关键数据的准确性和特异性水平各不相同,数据整合和映射通常仍由人工完成。我们假设可以使用现有方法(如自然语言处理(NLP)模块和应用程序编程接口 [API])将文本形式的癌症DX映射到标准化术语(肿瘤树状图)。我们发现,我们性能最佳的管道原型是有效的,但受到API开发限制(将96.2%的文本DX数据集准确映射到美国国立癌症研究所叙词表(NCIt),通过NCIt映射到肿瘤树状图的比例为44.2%)。这些结果表明该管道模型对于自动化数据管理可能是可行的。随着进一步发展,此类技术可能会变得越来越可靠。