This corpus is oriented to the study of multi-label classifiers text. It consists of scientific papers in the field of High Energy Physics (HEP – High Energy Physics) obtained by the CDS document server of European Nuclear Physics Laboratory (CERN). The corpus is divided into three subsets (called partitions), where each partition consists in two files: one containing the records of each item (with information such as the abstract, authors and,
of course, classes or key words) in compressed XML format, and other that contains a plain text version of the complete paper generated from the PDF available at CERN databases (tar + gzip format). Classes are defined by the XML mark KEYWORD. These are the labels manually assigned from thesaurus DESY. You can get more information about the thesaurus DESY.
Partition hepth: 18,114 Theoretical Physics documents (metadata – 5,3 Mb) (papers – 226 Mb)
Partition hepex: 2,599 Experimental Physics documents(metadata – 1,6 Mb) (papers – 28 Mb)
Partition astroph: 2,716 Astrophysics documents (metadata – 1,1 Mb) (papers – 29 Mb)
Updated on 23.04.2007: Thanks to Ioannis Katakis, from Aristotle University of Thessaloniki, (Greece) por corregir algunos problemas en el XML proporcionado. How to reference This corpus has been prepared by Arturo Montejo Ráez with metadata supplied by Jens Vigen and CDS Support Team.
People who looked at this resource also viewed the following:
People who downloaded this resource also downloaded the following: