Google Research's WikiLinks Dataset

From GM-RKB
Jump to navigation Jump to search

The Google Research's WikiLinks Dataset is an Annotated Dataset of in-links to Wikipedia Entity Page.s



References

2013

  • (Orr, Subramanya & Pereira, 2013) ⇒ Dave Orr, Amar Subramanya, and Fernando Pereira. (2013). “Learning from Big Data: 40 Million Entities in Context.” In: Google Research Blog
    • When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.

      To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages -- over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.


2012


$ awk '{print $1}' data-00003-of-00010 | sort | uniq -c
 2,175,992
 4,049,295 MENTION
10,911,908 TOKEN
  1087,996 URL