The German NLP text corpus by the Guideline Program in Oncology
JULIE Lab, Friedrich Schiller University of Jena
The Jena University Language & Information Engineering Lab (JULIE Lab) focusses on automatic text analysis and its applications. The group work in multiple domains, languages and genres, including biomedical, clinical, ecological and economical texts as well as the digital humanities. While the emphasis is on building real-world natural language systems, they also value a methodologically sound basis in terms of language technology tools and resources. The JULIE Lab members are heavily engaged in developing syntactic, semantic and discourse processors, as well as domain-specific knowledge resources (ontologies).
Over the years, the JULIE Lab team has gained considerable international reputation and a strong external visibility and is involved in various national and international research projects and collaborations.
Hasso Plattner Institute for Digital Engineering gGmbH
The Hasso Plattner Institute (HPI) is part of the Digital Engineering Faculty of the University of Potsdam, Germany with international subsidiaries. It was established in 1998 focusing on the education of students in digital engineering. The group of Dr. Schapranow brings together experience in digital health and in-memory database technology for years. Furthermore, the group contributed to numerous national and international research projects, such as the BMBF-funded Medical Informatics Initiative in the HiGHmed consortium.
In a joint cooperation the Hasso Plattner Institute for Digital Engineering gGmbH in Potsdam, the Friedrich Schiller University of Jena, and the Guideline Program in Oncology publish the first open German medical text corpus for researchers based on clinical practice guidelines.
Nowadays, the majority of Natural Language Processing (NLP) tools are trained on and developed with English texts primarily. Hence, the German text corpus is a valuable contribution supporting research in medical informatics and computational linguistics. It enables research on methods for medical text analysis by means of real German scientific text. Thus, researchers can successfully evaluate and improve on their concepts and methods for Natural Language Processing on German text.
In contrast to existing medical corpora, GGPONC does not contain any personal data, which requires special data protection. Therefore, we can make GGPONC freely accessible to researchers.
GGPONC is based on the semi-structured data, which also form the basis of the guideline app of the German Cancer Society. As a result, it contains high-quality curated scientific text documents. In addition to the plain text data, we provide extensive metadata, e.g. relating to evidence levels or literature references.
For all texts we created annotations of different entity classes of the Unified Medical Language System (UMLS) automatically. A subset of these annotations have been reviewed by human experts and are provided as a so-called gold standard for the validation of tools.
V. 1.0, 15.07.2020:
In order to get access to the GGPONC text corpus, please send a request via email with a brief description of your research project and intended use of the corpus to: leitlinienprogramm[at]krebsgesellschaft.de