- (Belkin & Croft, 1992) ⇒ Nicholas J. Belkin, and W. Bruce Croft. (1992). “Information Filtering and Information Retrieval: Two Sides of the Same Coin?.” In: Communications of the ACM Journal, 35(12). doi:10.1145/138859.138861
Information filtering systems are designed for unstructured or semistructured data, as opposed to database applications, which use very structured data. The systems also deal primarily with textual information, but they may also entail images, voice, video or other data types that are part of multimedia information systems. Information filtering systems also involve a large amount of data and streams of incoming data, whether broadcast from a remote source or sent directly by other sources. Filtering is based on descriptions of individual or group information preferences, or profiles, that typically represent long-term interests. Filtering also implies removal of data from an incoming stream rather than finding data in the stream; users see only the data that is extracted. Models of information retrieval and filtering, and lessons for filtering from retrieval research are presented.
Information filtering is a name used to describe a variety of processes involving the delivery of information to people who need it. Although this term is appearing quite often in popular and technical articles describing applications such as electronic mail, multimedia distributed systems, and electronic office documents, the distinction between filtering and related processes such as retrieval, routing, categorization, and extraction is often not clear. It is only by making that distinction, however, that the specific research issues associated with filtering can be identified and addressed.
• An information filtering system is an information system designed for unstructured or semistructured data. This contrasts with a typical database application that involves very structured data, such as employee records. The notion of structure being used here is not only that the data conforms to a format such as a record type description, but also that the fields of the records consist of simple data types with well-defined meanings. It is possible, for example, to define a database type for a complex document, such as a journal article, but the meaning of the text, figure and table components of that type are much less well-defined than a typical component of an employee record type, such as the salary. Email messages are an example of semistructured data in that they have well-defined header fields and an unstructured text body.
• Information filtering systems deal primarily with textual information. In fact, unstructured data is often used as a synonym for textual data. It is, however, more general than that and should include other types of data such as images, voice, and video that are part of multimedia information systems. None of these data types are handled well by conventional database systems, and all have meanings that are difficult to represent.
• Filtering applications typically involve streams of incoming data, either being broadcast by remote sources (such as newswire services), or sent directly by other sources (email). Filtering has also been used to describe the process of accessing and retrieving information from remote databases, in which case the incoming data is the result of the database searches. This scenario is also used by the developers of systems that generate “intelligent agents " for searching remote, heterogeneous databases.
• Filtering is often meant to imply the removal of data from an incoming stream, rather than finding data in that stream. In the first case, the users of the system see what is left after the data is removed]]; in the latter case, they see the data that is extracted. A common example of the first approach is an email filter designed to remove "~junk" mail. Note that this means profiles may not only express what people want, but also what they do not want.
This list of features suggests that information filtering is a well-defined and unique process. On closer examination, however, many of these features are virtually the same as those found in a variety of other text-based information systems. Text routing, for example, involves sending relevant incoming data to individuals or groups. This process is essentially identical to filtering. Categorization systems  are designed to attach one or more predefined categories to incoming objects (this is done by newswire services, for example). The major difference from filtering in this case is the static nature of the categories, when compared to profiles. Extraction systems  are somewhat different in that they emphasize the extraction of facts from the text of incoming objects, with the determination of which objects are relevant being a secondary issue. Information retrieval systems  share many of the features of information filtering. Indeed, Selective Dissemination of Information (SDI) , one of the original functions of information retrieval systems, appears to be identical to most information filtering applications.
A deeper understanding of the differences between filtering and other text-based processes, together with a definition of the research issues involved, requires a more detailed comparison. This comparison, which is the subject of this article, will be based on models of information retrieval developed over the past 20 years of research in this field. We will develop a similar model for information filtering, and compare these models to define research issues. By clarifying the similarities and differences between filtering and retrieval, developers of filtering systems should be able to benefit from the results obtained in related retrieval experiments.
Models of Information Retrieval and Filtering
General Concepts of Information Retrieval and Information Filtering
Information retrieval (IR) has been characterized in a variety of ways, ranging from a description of its goals, to relatively abstract models of its components and processes. Although not all of these characterizations have been in agreement with one another, they all tend to share some commonalities. Usually, an IR system is considered to have the function of "leading the user to those documents that will best enable him/her to satisfy his/her need for information" . Somewhat more generally, "the goal of an information [retrieval] system is for the user to obtain information from the knowledge resource which helps her/him in problem management" . Such functions, or goals, of IR have been described in models of the type shown in Figure 1. This model indicates basic entities and processes in the IR situation.
In this model, a person with some goals and intentions related to, for instance, a work task, finds that these goals cannot be attained because the person's resources or knowledge are somehow inadequate. A characteristic of such a "problematic situation"  is an anomalous state of knowledge (ASK)  or information need, which prompts the person to engage in active information-seeking behavior, such as submitting a query to an IR system. The query, which must be expressed in a language understood by the system, is a representation of the information need. This is shown on the right-hand side of Figure 1. Due to the inherent difficulty of representing ASKs , the query in an IR system is always regarded as approximate and imperfect.
On the other side of Figure 1, the focus of attention is the information resources that the user of the IR system will eventually access. Here, the model considers the producers or authors of texts*; the groupings of texts into collections (e.g., databases); the representation of texts; and, the organization of these representations into databases of text surrogates. The process of representing the meaning of texts in a form more amenable to processing by computer (sometimes called indexing) is of central importance in IR. A typical surrogate would consist of a set of index terms or keywords.
* We use text here as a general term that could also include multimedia objects.
The comparison of a query and surrogates, or, in some cases, direct interaction between the user and the ...
|1992 InformationFilteringandInformat||Nicholas J. Belkin|
W. Bruce Croft
|Information Filtering and Information Retrieval: Two Sides of the Same Coin?||10.1145/138859.138861||1992|