- (Brin, 1998) ⇒ S. Brin. (1998). “Extracting patterns and relations from the World-Wide Web.” In: Proceedings of on the 1998 International Workshop on Web and Databases (WebDB’98).
- (Xia, 2006) ⇒ L. Xia. (2006). “Adaptive Relationship Extraction by Machine Learning." Masters Thesis, Sheffield University.
- 2.3.4 DIPRE
- DIPRE (Dual Iterative Pattern Expansion) proposed by Brin  is a semi-supervised relation extraction system that aims to discover relationships, such as books and their author (<author, title>) published on the World-Wide-Web. This approach is initiated with a handful of seed pairs from the given relation. It then searches vast amount of web pages for extraction patterns in which seed pairs appears. It uses these learned extraction patterns to discover more examples, and the process can be repeated. As a semi-supervised approach it finds extraction patterns without any annotated training data. The algorithm can be described in 5 steps as follows:
1. R0 ¬ Seed
A Small set of trusted relation instance R0 representing the target relation is provided
by a human. In Brin’s experiment, only five relations consisting of books with authors
2. O ¬ Occurrences (R0,D)
The relations tuples R0 from step one are searched in D, where D is the collection of
all web pages.
3. P ¬ GeneratePatterns (O)
Form generalized patterns based on the set of occurrences identified in step 2. Brin
has noted over generalization could result in a large number of bad patterns being
generated in this stage. It is also noted that the higher the coverage of the patterns the
better the result.
4. Uses patterns generated from step 3 as new seeds to extract more relations.
5. If no more new relations can be learned from D, stop. Otherwise go to step 2.
The algorithm is illustrated in Figure 2.3.1 1.
- Chapter 2 Literature Review
- Figure 2.3.1 1 DIPRE 
- "The pattern generation is the most critical step in DIPRE, which has direct impact on the quality of the extraction. DIRPE defines a pattern as two parts, the ‘URL pattern', and the ‘Text Pattern’, for example, “www.sff.net /locus/c.*” is a URL pattern, “< li>title by author (” is a text pattern. Brin points out that, step 3 is the weakest link in this algorithm, where bogus patterns may be generated, and subsequently affect the quality of the next iteration. Therefore, routine 3 must be carefully designed to minimize the amount of performance degradation. However, this problem is not addressed in Brin’s paper or in his implementation. DIPRE was evaluated on total of 24 million web pages