German Guideline Program in Oncology NLP Corpus (GGPONC)

The German NLP text corpus by the Guideline Program in Oncology

Project Partners

JULIE Lab, Friedrich Schiller University of Jena

The Jena University Language & Information Engineering Lab (JULIE Lab) focusses on automatic text analysis and its applications. The group work in multiple domains, languages and genres, including biomedical, clinical, ecological and economical texts as well as the digital humanities. While the emphasis is on building real-world natural language systems, they also value a methodologically sound basis in terms of language technology tools and resources. The JULIE Lab members are heavily engaged in developing syntactic, semantic and discourse processors, as well as domain-specific knowledge resources (ontologies).

Over the years, the JULIE Lab team has gained considerable international reputation and a strong external visibility and is involved in various national and international research projects and collaborations.

Hasso Plattner Institute for Digital Engineering gGmbH

The Hasso Plattner Institute (HPI) is part of the Digital Engineering Faculty of the University of Potsdam, Germany with international subsidiaries. It was established in 1998 focusing on the education of students in digital engineering. The group of Dr. Schapranow brings together experience in digital health and in-memory database technology for years. Furthermore, the group contributed to numerous national and international research projects, such as the BMBF-funded Medical Informatics Initiative in the HiGHmed consortium.

Motivation

In a joint cooperation the Hasso Plattner Institute for Digital Engineering gGmbH in Potsdam, the Friedrich Schiller University of Jena, and the Guideline Program in Oncology publish the first open German medical text corpus for researchers based on clinical practice guidelines.

Nowadays, the majority of Natural Language Processing (NLP) tools are trained on and developed with English texts primarily. Hence, the German text corpus is a valuable contribution supporting research in medical informatics and computational linguistics. It enables research on methods for medical text analysis by means of real German scientific text. Thus, researchers can successfully evaluate and improve on their concepts and methods for Natural Language Processing on German text.

In contrast to existing medical corpora, GGPONC does not contain any personal data, which requires special data protection. Therefore, we can make GGPONC freely accessible to researchers.

Description

GGPONC is based on the semi-structured data, which also form the basis of the guideline app of the German Cancer Society. As a result, it contains high-quality curated scientific text documents. In addition to the plain text data, we provide extensive metadata, e.g. relating to evidence levels or literature references.

For all texts, medical experts have manually created annotations for the entity classes Findings, Substances and Procedures (inspired by the SNOMED CT concept model). We have obtained more than 200.000 entity annotations, which can be used to train ML models. Baseline models are provided together with version 2.0 of the corpus.

Versions

V. 2.0, 01.06.2022:

  • > 10.000 text segmetns
  • > 1,8 Mio. tokens
  • > 200 thousand entities annotated
  • around 46.000 literature references
  • 30 guidelines:
    • Stomach cancer
    • Actinic keratosis and SCC of the skin
    • Prevention of cervix cancer
    • Oesophageal cancer
    • Bladder cancer
    • Prevention of skin cancer
    • Malignant melanoma
    • Oral cavity cancer
    • Psycho-oncology
    • Breast cancer
    • Pancreatic cancer
    • Hodgkin lymphoma
    • Prostate cancer
    • Chronic lymphocytic leukemia
    • Colorectal cancer
    • Endometrial cancer
    • Hepatocellular cancer
    • Renal cell cancer
    • Supportive therapy
    • Cervical cancer
    • Malignant ovarian tumors
    • Lung cancer
    • Testicular tumors
    • Laryngeal cancer
    • Palliative medicine
    • New (since V 1.0):
      • Complementary medicine
      • Penis cancer
      • Anal cancer
      • Follicular lymphoma
      • Adult soft tissue sarcomas

V. 1.0, 15.07.2020:

  • > 8.000 text segments
  • > 1,3 Mio. tokens
  • around 40.000 literature references
  • 25 guidelines

Access & Download

In order to get access to the GGPONC text corpus, please send a request via email with a brief description of your research project and intended use of the corpus to: leitlinienprogramm[at]krebsgesellschaft.de

Terms of Use

  • GGPONC may be used for non-commercial research activities only.
  • GGPONC may not be distributed by the corpus users to any other third party (including any project collaborators). All prospective users of the corpus must apply for access individually.
  • The copyright of the corpus is protected in all parts.  Any use outside of the Copyright Protection Law is not allowed and illegal without written permit of the German Guideline Program in Oncology (GGPO). No part of the corpus may be reproduced in any form without prior written permission of the GGPO.
  • GGPONC is provided free of charge.
  • The corpus comes with absolutely no warranties including (but not limited to) the correctness of the information provided in the text corpus itself. The latest version of the clinical guidelines used for the corpus can be found at: https://www.leitlinienprogramm-onkologie.de/english-language/
  • Contributions which are based on the corpus must cite the following publication:

Further publications:

Florian Borchert, Laura Meister, Thomas Langer, Markus Follmann, Bert Arnrich, and Matthieu-P. Schapranow. Controversial Trials First: Identifying Disagreement Between Clinical Guidelines and New Evidence, Proceedings of the AMIA Annual Symposium, pp. 237-246, San Diego, USA, 2021 (Distinguished Paper Award)

see also papers citing GGPONC 1.0 or 2.0

GGPONC is part of: