KDDCup08 Proposal

Jump to: navigation, search

This page contains a working draft of a proposal for KDDCup 2008 which must be sent to kdd08chairs@gmail.com by October 30, 2007.

Draft Submission: KDDCup08


At a high-level the proposal is for an information extraction task within the biomedical research domain. Highlights of the proposal are that:

  1. The task will directly contribute to an important real-world problem. Specifically, the discovered data would be appended to a real-world database, and this increase will aid researchers discover additional knowledge about organisms, such as bacteria.
  2. The task will spur research because it is difficult but immenently solvable. Recent experiments on the data suggest the ability to attain precision and recall performance in the 40% range; domain experts can attain more than twice this performance.
  3. The data will be used in future research because it is public and will remain public.
  4. The task will be the first to test whether a semi-supervised approach can beat a supervised one. This is accomplished by having a large pool of unlabelled data.
  5. The semi-supervised data set is large: over 2GB of data.
  6. The sought semantic relation is a ternary relation. In other tasks the relation is a binary relation (e.g. Company/Headquarters)
  7. The sought semantic relation is highly inter-sentential. In other tasks the relations are simple and found within a single sentence (are intra-sentential).


As background, biomedical researchers do not typically add their findings into some structured database. Their data instead remains locked up in their published papers until some biomedical research curator extracts the data and appends a database. For protein localization there is a master database for this information is PSortDB (http://db.psort.org/). The database however contains only a miniscule number of the known localizations: approximately ~1000 localizations of the tens of thousands of localizations that are estimated to be in the literature. It is very costly however to hire PostDoc students to perform the curation by the current means of reading through research papers.

Data set description

The data is composed of ~400 positive cases and ~2200 negative cases of the ternary semantic relation drawn from a corpus of ~850 PubMed abstracts. The number of positive and negative cases is expected to possibly double before it is released.
An additional ~200,000 abstracts (~2GB) will also be made available for researches that proposed semi-supervised learning approaches. This data will be made available via DVD as a password-protected file.

Description of the competition tasks

Task 1 - Relation Detection

Task: For each abstract, identify all of the cases of the desired ternary semantic relation.

    • Challenge: 1) ternary relations are novel; 2) intrasentential relations are novel.

Task 2 - Relation Classification

Task: For each detected relation classify the implied certainty of the case. The labels associated with each case are: “experimentally validated”, “hypothesized”, and “assumed”. Data that is based on first-hand observation is significantly more valuable than hypothesized data.

    • Challenge: This categorization of a relation into the implied certainy will require that the model be based on contextual information from elsewhere in the document. For example, the word "experiment" appears somewhere before the relation mention.

Task 3 - Entity Mention Grounding to an Ontology

Task: For each of the three entities (organism, protein, location) identify the unique identifier in the provided database.

    • Challenge: The protein entity will be more challenging one to ground because there are many records with the same name.

Evaluation measures

  • Tasks 1 and 2 will be evaluated based on F-measure performance. Ties will be broken based on the precision on the top 10% of predictions.
  • Task 3 will be evaluated on classification accuracy.

Methods to facilitate participation

Extensive steps will be taken to facilitate participation by a broad range of data miners. These include:

  1. the text data is already marked-up by state-of-the-art natural language processing techniques such as:
    1. Part-of-speech tagging
    2. Syntactic parse-tree tagging
    3. Semantic role labeling (of predicate arguments)
    4. Word sense disambiguation (against WordNet and UMLS)
  2. the main sources biomedical research domain information will be provided in an accessible format.
    1. NCBI (organism data)
    2. Uniprot (protain data)

Proposed Deadlines.

  • Feb 1, 2008, Call for participation announced
  • Feb 1, 2008, Registration for KDD Cup opens
  • Apr 1, 2008, Training data released
  • May 1, 2008, Test data released
  • May 15, 2008, Submissions closed
  • Jul 1, 2008, Notification of competition results
  • Jul 20, 2008, Winner (draft) papers due.
  • Jul 25, 2008, Feedback to authors if any.
  • Jul 31, 2008, Final camera-ready paper due.
  • Aug 6, 2008, Presentation slides due.
  • Aug 24, 2008, KDDCup08/IE Workshop
  • Aug 24, 2008, Awards presentated


Gabor Melli

  • email: kddcup08@gabormelli.com
  • address: Simon Fraser University, 8888 University Drive, Burnaby, BC, CANADA, V5A1S6
  • phone: 206-369-3582
  • bio: Gabor Melli is president of PredictionWorks, a data mining consulting company, and is a PhD candidate at Simon Fraser University, where his research is on information extraction. Gabor is an active member of the Data Mining community. He is the Information Director of ACM's Special Interest Group on Knowledge Discovery and Data Mining and has been a member of the KDD conference's program committee since (2004). He has published several research articles, and most recently led SFU's team during the Document Understanding Conference tasks.

Marko Grobelnik

  •  ?

Sunita Sarawagi

  •  ?

Lyle H. Ungar

  •  ?


Fiona Brinkman

  • Fiona Brinkman is an Associate Professor in the Department of Molecular Biology and Biochemistry at Simon Fraser University.

Amos Bairoch?

William Cohen?

Draft Submission: Workshop on Next-Generation Information Extraction

Given the recent successful combination of a workshop with KDDCup-07, we are considering holding a workshop on information extraction to run concurrently with presentations about the approaches used by the winning participants.

    • To be developed


  • As an alternative to a workshop, there may be an opportunity to join forces with the BioKDD workshop.