JULIE Lab, Friedrich Schiller University Jena, Germany.
Intelligence and Informatics in Medicine, Medical Center rechts der Isar, Technical University Munich, Germany.
Stud Health Technol Inform. 2022 Aug 17;296:66-72. doi: 10.3233/SHTI220805.
We describe the creation of GRASCCO, a novel German-language corpus composed of some 60 clinical documents with more than.43,000 tokens. GRASCCO is a synthetic corpus resulting from a series of alienation steps to obfuscate privacy-sensitive information contained in real clinical documents, the true origin of all GRASCCO texts. Therefore, it is publicly shareable without any legal restrictions We also explore whether this corpus still represents common clinical language use by comparison with a real (non-shareable) clinical corpus we developed as a contribution to the Medical Informatics Initiative in Germany (MII) within the SMITH consortium. We find evidence that such a claim can indeed be made.
我们描述了 GRASCCO 的创建过程,这是一个新的德语语料库,由大约 60 份临床文件组成,超过 43000 个标记。GRASCCO 是一个合成语料库,由一系列使包含在真实临床文件中的隐私敏感信息变得混乱的异化步骤产生,所有 GRASCCO 文本的真实来源。因此,它可以在没有任何法律限制的情况下公开共享。我们还通过与我们作为 SMITH 联盟中德国医学信息学倡议 (MII) 的一部分开发的真实(不可共享)临床语料库进行比较,来探索这个语料库是否仍然代表常见的临床语言使用。我们发现有证据表明确实可以这样声称。