2006 AdaptiveREbyML thesis
- (Xia, 2006) ⇒ Lei Xia. (2006). “Adaptive Relationship Extraction by Machine Learning." Masters Thesis, Sheffield University.
There may be material already present that needs to be copied here.
- "Large amount of information are available on the web and the newspapers. It is very valuable to have a system can extract information that user is interested automatically, and stored in structured database system for data mining. This thesis investigates adaptive methods for the relationship extraction task from natural language text. A weakly-supervised method, called Snowball  is proposed by Agichtein is selected to be implemented in this thesis. The implemented system extracts relationships from natural language text chosen from collection of newswires. This approach implemented in the project addresses problem such as sharp degradation of performance in weakly supervised information extraction approach, and difficulties of porting supervised system in different extraction domain."
- "This project includes designing, implementing, and testing Snowball approach in extract relationship of Organization-Headquarters from collection of newspapers. The thesis defines extraction relationship task only on single sentence level. Our report includes a literature review of number of successful relation extraction approaches, which help to understand the task and justify our selection of algorithm. The algorithm, system design decisions, and implementation are presented in detail. Performance of the system is measured in the metrics of precision, recall and F-measure. Finally, a conclusion of this project is presented with a discussion of future work."
- DIPRE (Dual Iterative Pattern Expansion) proposed by Brin  is a semi-supervised relation extraction system that aims to discover relationships, such as books and their author (<author, title>) published on the World-Wide-Web. This approach is initiated with a handful of seed pairs from the given relation. It then searches vast amount of web pages for extraction patterns in which seed pairs appears. It uses these learned extraction patterns to discover more examples, and the process can be repeated. As a semi-supervised approach it finds extraction patterns without any annotated training data. The algorithm can be described in 5 steps as follows:
1. R0 ¬ Seed
A Small set of trusted relation instance R0 representing the target relation is provided
by a human. In Brin’s experiment, only five relations consisting of books with authors
2. O ¬ Occurrences (R0,D)
The relations tuples R0 from step one are searched in D, where D is the collection of
all web pages.
3. P ¬ GeneratePatterns (O)
Form generalized patterns based on the set of occurrences identified in step 2. Brin
has noted over generalization could result in a large number of bad patterns being
generated in this stage. It is also noted that the higher the coverage of the patterns the
better the result.
4. Uses patterns generated from step 3 as new seeds to extract more relations.
5. If no more new relations can be learned from D, stop. Otherwise go to step 2.
The algorithm is illustrated in Figure 2.3.1 1.
Chapter 2 Literature Review
- Figure 2.3.1 1 DIPRE 
- "The pattern generation is the most critical step in DIPRE, which has direct impact on the quality of the extraction. DIRPE defines a pattern as two parts, the ‘URL pattern', and the ‘Text Pattern’, for example, “www.sff.net /locus/c.*” is a URL pattern, “< li>title by author (” is a text pattern. Brin points out that, step 3 is the weakest link in this algorithm, where bogus patterns may be generated, and subsequently affect the quality of the next iteration. Therefore, routine 3 must be carefully designed to minimize the amount of performance degradation. However, this problem is not addressed in Brin’s paper or in his implementation. DIPRE was evaluated on total of 24 million web pages
|2006 AdaptiveREbyML thesis||Lei Xia||Adaptive Relationship Extraction by Machine Learning||http://www.dcs.shef.ac.uk/intranet/teaching/projects/archive/msc2006/abs/acp05lx.htm||2006|