The Jena University Language & Information Engineering Lab (JULIE Lab) focusses on automatic text analysis and its applications. The group work in multiple domains, languages and genres, including biomedical, clinical, ecological and economical texts as well as the digital humanities. While the emphasis is on building real-world natural language systems, they also value a methodologically sound basis in terms of language technology tools and resources. The JULIE Lab members are heavily engaged in developing syntactic, semantic and discourse processors, as well as domain-specific knowledge resources (ontologies).
Over the years, the JULIE Lab team has gained considerable international reputation and a strong external visibility and is involved in various national and international research projects and collaborations.
The Hasso Plattner Institute (HPI) is part of the Digital Engineering Faculty of the University of Potsdam, Germany with international subsidiaries. It was established in 1998 focusing on the education of students in digital engineering. The group of Dr. Schapranow brings together experience in digital health and in-memory database technology for years. Furthermore, the group contributed to numerous national and international research projects, such as the BMBF-funded Medical Informatics Initiative in the HiGHmed consortium.
In a joint cooperation the Hasso Plattner Institute for Digital Engineering gGmbH in Potsdam, the Friedrich Schiller University of Jena, and the Guideline Program in Oncology publish the first open German medical text corpus for researchers based on clinical practice guidelines.
Nowadays, the majority of Natural Language Processing (NLP) tools are trained on and developed with English texts primarily. Hence, the German text corpus is a valuable contribution supporting research in medical informatics and computational linguistics. It enables research on methods for medical text analysis by means of real German scientific text. Thus, researchers can successfully evaluate and improve on their concepts and methods for Natural Language Processing on German text.
In contrast to existing medical corpora, GGPONC does not contain any personal data, which requires special data protection. Therefore, we can make GGPONC freely accessible to researchers.
GGPONC is based on the semi-structured data, which also form the basis of the guideline app of the German Cancer Society. As a result, it contains high-quality curated scientific text documents. In addition to the plain text data, we provide extensive metadata, e.g. relating to evidence levels or literature references.
For all texts, medical experts have manually created annotations for the entity classes Findings, Substances and Procedures (inspired by the SNOMED CT concept model). We have obtained more than 200.000 entity annotations, which can be used to train ML models. Baseline models are provided together with version 2.0 of the corpus.
V. 2.0, 01.06.2022:
V. 1.0, 15.07.2020:
In order to get access to the GGPONC text corpus, please send a request via email with a brief description of your research project and intended use of the corpus to: leitlinienprogramm[at]krebsgesellschaft.de
Florian Borchert, Christina Lohr, Luise Modersohn, Jonas Witt, Thomas Langer, Markus Follmann, Matthias Gietzelt, Bert Arnrich, Udo Hahn and Matthieu-P. Schapranow. GGPONC 2.0 - The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers. LREC 2022 — Proceedings of the 13th International Conference on Language Resources and Evaluation. Marseille, France, 2022 (download Poster)
Florian Borchert, Laura Meister, Thomas Langer, Markus Follmann, Bert Arnrich, and Matthieu-P. Schapranow. Controversial Trials First: Identifying Disagreement Between Clinical Guidelines and New Evidence, Proceedings of the AMIA Annual Symposium, pp. 237-246, San Diego, USA, 2021 (Distinguished Paper Award)
GGPONC is part of: