The Jena University Language & Information Engineering Lab (JULIE Lab) focusses on automatic text analysis and its applications. The group work in multiple domains, languages and genres, including biomedical, clinical, ecological and economical texts as well as the digital humanities. While the emphasis is on building real-world natural language systems, they also value a methodologically sound basis in terms of language technology tools and resources. The JULIE Lab members are heavily engaged in developing syntactic, semantic and discourse processors, as well as domain-specific knowledge resources (ontologies).
Over the years, the JULIE Lab team has gained considerable international reputation and a strong external visibility and is involved in various national and international research projects and collaborations.
The Hasso Plattner Institute (HPI) is part of the Digital Engineering Faculty of the University of Potsdam, Germany with international subsidiaries. It was established in 1998 focusing on the education of students in digital engineering. The group of Dr. Schapranow brings together experience in digital health and in-memory database technology for years. Furthermore, the group contributed to numerous national and international research projects, such as the BMBF-funded Medical Informatics Initiative in the HiGHmed consortium.
In a joint cooperation the Hasso Plattner Institute for Digital Engineering gGmbH in Potsdam, the Friedrich Schiller University of Jena, and the Guideline Program in Oncology publish the first open German medical text corpus for researchers based on clinical practice guidelines.
Nowadays, the majority of Natural Language Processing (NLP) tools are trained on and developed with English texts primarily. Hence, the German text corpus is a valuable contribution supporting research in medical informatics and computational linguistics. It enables research on methods for medical text analysis by means of real German scientific text. Thus, researchers can successfully evaluate and improve on their concepts and methods for Natural Language Processing on German text.
In contrast to existing medical corpora, GGPONC does not contain any personal data, which requires special data protection. Therefore, we can make GGPONC freely accessible to researchers.
GGPONC is based on the semi-structured data, which also form the basis of the guideline app of the German Cancer Society. As a result, it contains high-quality curated scientific text documents. In addition to the plain text data, we provide extensive metadata, e.g. relating to evidence levels or literature references.
For all texts, medical experts have manually created annotations for the entity classes Findings, Substances and Procedures (inspired by the SNOMED CT concept model). We have obtained more than 200.000 entity annotations, which can be used to train ML models. Baseline models are provided together with version 2.0 of the corpus.
V. 2.0, 01.06.2022:
- > 10.000 text segmetns
- > 1,8 Mio. tokens
- > 200 thousand entities annotated
- around 46.000 literature references
- 30 guidelines:
- Stomach cancer
- Actinic keratosis and SCC of the skin
- Prevention of cervix cancer
- Oesophageal cancer
- Bladder cancer
- Prevention of skin cancer
- Malignant melanoma
- Oral cavity cancer
- Breast cancer
- Pancreatic cancer
- Hodgkin lymphoma
- Prostate cancer
- Chronic lymphocytic leukemia
- Colorectal cancer
- Endometrial cancer
- Hepatocellular cancer
- Renal cell cancer
- Supportive therapy
- Cervical cancer
- Malignant ovarian tumors
- Lung cancer
- Testicular tumors
- Laryngeal cancer
- Palliative medicine
- New (since V 1.0):
- Complementary medicine
- Penis cancer
- Anal cancer
- Follicular lymphoma
- Adult soft tissue sarcomas
V. 1.0, 15.07.2020:
- > 8.000 text segments
- > 1,3 Mio. tokens
- around 40.000 literature references
- 25 guidelines
Access & Download
In order to get access to the GGPONC text corpus, please send a request via email with a brief description of your research project and intended use of the corpus to: leitlinienprogramm[at]krebsgesellschaft.de
- GGPONC may be used for non-commercial research activities only.
- GGPONC may not be distributed by the corpus users to any other third party (including any project collaborators). All prospective users of the corpus must apply for access individually.
- The copyright of the corpus is protected in all parts. Any use outside of the Copyright Protection Law is not allowed and illegal without written permit of the German Guideline Program in Oncology (GGPO). No part of the corpus may be reproduced in any form without prior written permission of the GGPO.
- GGPONC is provided free of charge.
- The corpus comes with absolutely no warranties including (but not limited to) the correctness of the information provided in the text corpus itself. The latest version of the clinical guidelines used for the corpus can be found at: https://www.leitlinienprogramm-onkologie.de/english-language/
- Contributions which are based on the corpus must cite the following publication:
- Florian Borchert, Christina Lohr, Luise Modersohn, Jonas Witt, Thomas Langer, Markus Follmann, Matthias Gietzelt, Bert Arnrich, Udo Hahn, and Matthieu-P. Schapranow. 2022. GGPONC 2.0 - The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3650–3660, Marseille, France. European Language Resources Association.“
Florian Borchert, Laura Meister, Thomas Langer, Markus Follmann, Bert Arnrich, and Matthieu-P. Schapranow. Controversial Trials First: Identifying Disagreement Between Clinical Guidelines and New Evidence, Proceedings of the AMIA Annual Symposium, pp. 237-246, San Diego, USA, 2021 (Distinguished Paper Award)
GGPONC is part of: