Document Representation

Jump to: navigation, search

A Document Representation is a knowledge representation of a document.




  • (W3C, 2007A) ⇒ "5 HTML Document Representation" Retrieved: 2017-07-01
    • QUOTE: In this chapter, we discuss how HTML documents are represented on a computer and over the Internet.

      The section on the document character set addresses the issue of what abstract characters may be part of an HTML document. Characters include the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.

      The section on character encodings addresses the issue of how those characters may be represented in a file or when transferred over the Internet. As some character encodings cannot directly represent all characters an author may want to include in a document, HTML offers other mechanisms, called character references, for referring to any character.

      Since there are a great number of characters throughout human languages, and a great variety of ways to represent those characters, proper care must be taken so that documents may be understood by user agents around the world.


  • (Cambridge University Press, 2009) ⇒ "Document representations and measures of relatedness in vector spaces" Updated: 2009-04-07
    • QUOTE: As in Chapter 6 , we represent documents as vectors in [math]\mathbb{R}^{\vert V\vert}[/math] in this chapter. To illustrate properties of document vectors in vector classification, we will render these vectors as points in a plane as in the example in Figure 14.1 . In reality, document vectors are length-normalized unit vectors that point to the surface of a hypersphere. We can view the 2D planes in our figures as projections onto a plane of the surface of a (hyper-)sphere as shown in Figure 14.2 . Distances on the surface of the sphere and on the projection plane are approximately the same as long as we restrict ourselves to small areas of the surface and choose an appropriate projection (Exercise 14.1).

      Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. We will use Euclidean distance in this chapter as the underlying distance measure. We observed earlier (Exercise 6.4.4 , page [*] ) that there is a direct correspondence between cosine similarity and Euclidean distance for length-normalized vectors. In vector space classification, it rarely matters whether the relatedness of two documents is expressed in terms of similarity or distance.

      However, in addition to documents, centroids or averages of vectors also play an important role in vector space classification. Centroids are not length-normalized. For unnormalized vectors, dot product, cosine similarity and Euclidean distance all have different behavior in general (Exercise 14.8). We will be mostly concerned with small local regions when computing the similarity between a document and a centroid, and the smaller the region the more similar the behavior of the three measures is.


  • (Arguello et al., 2008) ⇒ Arguello, J., Elsas, J. L., Callan, J., & Carbonell, J. G. (2008). Document Representation and Query Expansion Models for Blog Recommendation. ICWSM, 2008(0), 1.
    • ABSTRACT: We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (the blog) is composed of a collection of documents (the blog posts) rather than a single document, (2) the query represents an ongoing – and typically multifaceted – interest in the topic rather than a passing ad-hoc information need and (3) due to the propensity of spam, splogs, and tangential comments, the blogosphere is particularly challenging to use as a source for high-quality query expansion terms. We address these differences at the document representation level, by comparing retrieval models that view either the blog or its constituent posts as the atomic units of retrieval, and at the query expansion level, by making novel use of the links and anchor text in Wikipedia1 to expand a user’s initial query. We develop two complementary models of blog retrieval that perform at comparable levels of precision and recall. We also show consistent and significant improvement across all models using our Wikipedia expansion strategy.


  • (Ranzato & Szummer,2008) ⇒ Ranzato, M. A., & Szummer, M. (2008, July). Semi-supervised learning of compact document representations with deep networks. In: Proceedings of the 25th International Conference on Machine learning (pp. 792-799). ACM [1].
    • ABSTRACT: Finding good representations of text documents is crucial in information retrieval and classification systems. Today the most popular document representation is based on a vector of word counts in the document. This representation neither captures dependencies between related words, nor handles synonyms or polysemous words. In this paper, we propose an algorithm to learn text document representations based on semi-supervised autoencoders that are stacked to form a deep network. The model can be trained efficiently on partially labeled corpora, producing very compact representations of documents, while retaining as much class information and joint word statistics as possible. We show that it is advantageous to exploit even a few labeled samples during training.



  • (Strzalkowski, 1994) ⇒ Strzalkowski, T. (1994, March). Document representation in natural language text retrieval. In: Proceedings of the workshop on Human Language Technology (pp. 364-369). Association for Computational Linguistics.
    • ABSTRACT: In information retrieval, the content of a document may be represented as a collection of terms: words, stems, phrases, or other units derived or inferred from the text of the document. These terms are usually weighted to indicate their importance within the document which can then be viewed as a vector in a Ndimensional space. In this paper we demonstrate that a proper term weighting is at least as important as their selection, and that different types of terms (e.g., words, phrases, names), and terms derived by different means (e.g., statistical, linguistic) must be treated differently for a maximum benefit in rel~ieval. We report some observations made during and after the second Text REtrieval Conference (TREC-2).