Corpus description

A corpus for the task is built from ScienceDirect open access publications and is available freely for participants, without the need to sign a copyright agreement. It consists of 500 journal articles evenly distributed among the domains Computer Science, Material Sciences and Physics.

Three types of documents are provided: plain text documents with sampled paragraphs, brat .ann standoff documents with annotations for those paragraphs and .xml documents with the original full article text. The training data part of the corpus consists of 350 documents, 50 are kept for development and 100 for testing.


Training data, development data, unlabelled testing data, labelled testing data, evaluation scripts, utility scripts [v2] and annotation guidelines and configuration files are now available for download. Please refer to the README files in the individual .zip files for instructions.