INEX Wikipedia Corpus

(Redirected from INEX Wikipedia Snapshot)
Jump to navigation Jump to search

An INEX Wikipedia Corpus is a Wikipedia data snapshot that is a marked-up version of a Wikipedia corpus.



    • The INEX XML Wikipedia collection is a marked-up version of the Wikipedia corpus. The mark-up includes named entities and document structure such as document sections, tables and hyperlinks. The classification and clustering tasks use a 144,625 document subset of INEX 2010 collection that has been pre-processed to provide various representations of the documents. Representations are available as a vector space representation of terms, frequent bi-grams, XML tags, trees, links and named entities. The collection is also available in XML format and text-only format.


    • QUOTE: We propose a XML corpus based on Wikipedia. This corpus can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. This corpus will be used for both, INEX 2007 and the XML Document Mining Challenge. You can find a description of the corpus in this article (published in SIGIR Forum)