2008 CreatingRelationDataFromUnstrData

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Mention Normalization Algorithm

Notes

Cited By

Quotes

Abstract

In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist1 and auction item listings like eBay.2 We call this unstructured, ungrammatical data “posts.” The unstructured nature of posts makes query and integration difficult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search difficult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a “reference set.” The first step aligns each post to each member of each reference set. This allows our algorithm to define a schema over the post and include standard values for the attributes defined by this schema. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and efficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.

6. Related Work

Our research is driven by the principal that the cost of annotating documents for the Semantic Web should be free, that is, automatic and invisible to users (Hendler, 2001). Many researchers have followed this path, attempting to automatically mark up documents for the Semantic Web, as proposed here (Cimiano, Handschuh, & Staab, 2004; Dingli, Ciravegna, & Wilks, 2003; Handschuh, Staab, & Ciravegna, 2002; Vargas-Vera, Motta, Domingue, Lanzoni, Stutt, & Ciravegna, 2002). However, these systems rely on lexical information, such as part-of-speech tagging or shallow Natural Language Processing to do their extraction/annotation (e.g., Amilcare, Ciravegna, 2001). This is not an option when the data is ungrammatical, like the post data. In a similar vein, there are systems such as ADEL (Lerman, Gazen, Minton, & Knoblock, 2004) which rely on the structure to identify and annotate records in Web pages. Again, the failure of the posts to exhibit structure makes this approach inappropriate. So, while there is a fair amount of work in automatic labeling, there is little emphasis on techniques that could label text that is both unstructured and ungrammatical.

Although the idea of record linkage is not new (Fellegi & Sunter, 1969) and is well studied even now (Bilenko & Mooney, 2003) most current research focuses on matching one set of records to another set of records based on their decomposed attributes. There is little work on matching data sets where one record is a single string composed of the other data set’s attributes to match on, as in the case with posts and reference sets. The WHIRL system (Cohen, 2000) allows for record linkage without decomposed attributes, but as shown in Section 4.1 Phoebus outperforms WHIRL, since WHIRL relies solely on the vector-based cosine similarity between the attributes, while Phoebus exploits a larger set of features to represent both field and record level similarity. We note with interest the EROCS system (Chakaravarthy, Gupta, Roy, & Mohania, 2006) where the authors tackle the problem of linking full text documents with relational databases. The technique involves filtering out all non-nouns from the text, and then finding the matches in the database. This is an intriguing approach; interesting future work would involve performing a similar filtering for larger documents and then applying the Phoebus algorithm to match the remaining nouns to reference sets.

Using the reference set’s attributes as normalized values is similar to the idea of data cleaning. However, most data cleaning algorithms assume tuple-to-tuple transformations (Lee et al., 1999; Chaudhuri et al., 2003). That is, some function maps the attributes of one tuple to the attributes of another. This approach would not work on ungrammatical and unstructured data, where all attributes are embedded within the post, which maps to a set of attributes from the reference set. Although our work describes a technique for information extraction, many methods, such as Conditional Random Fields (CRF), assume at least some structure in the extracted attributes to do the extraction. As our extraction experiments show, Phoebus outperforms such methods, such as the Simple Tagger implementation of Conditional Random Fields (McCallum, 2002). Other IE approaches, such as Datamold (Borkar, Deshmukh, & Sarawagi, 2001) and CRAM (Agichtein & Ganti, 2004), segment whole records (like bibliographies) into attributes, with little structural assumption. In fact, CRAM even uses reference sets to aid its extraction. However, both systems require that every token of a record receive a label, which is not possible with posts that are filled with irrelevant, “junk” tokens. Along the lines of CRAM and Datamold, the work of Bellare and McCallum (2007) uses a reference set to train a CRF to extract data, which is similar to our PhoebusCRF implementation. However, there are two differences between PhoebusCRF and their work (Bellare & McCallum, 2007). First, the work of Bellare and McCallum (2007) mentions that reference set records are matched using simple heuristics, but it is unclear how this is done. In our work, matching is done explicitly and accurately through record linkage. Second, their work only uses the records from the reference set to label tokens for training an extraction module, while PhoebusCRF uses the actual values from the matching reference set record to produce useful features for extraction and annotation.

Another IE approach similar to ours performs named entity recognition using “Semi-CRFs” with a dictionary component (Cohen & Sarawagi, 2004), which functions like a reference set. However, in their work the dictionaries are defined as lists of single attribute entities, so finding an entity in the dictionary is a look-up task. Our reference sets are relational data, so finding the match becomes a record linkage task. Further, their work on Semi-CRFs (Cohen & Sarawagi, 2004) focuses on the task of labeling segments of tokens with a uniform label, which is especially useful for named entity recognition. In the case of posts, however, Phoebus needs to relax such a restriction because in some cases such segments will be interrupted, as the case of a hotel name with the area in the middle of the hotel name segment. So, unlike their work, Phoebus makes no assumptions about the structure of posts. Recently, Semi-CRFs have been extended to use database records in the task of integrating unstructured data with relational databases (Mansuri & Sarawagi, 2006). This work is similar to ours in that it links unstructured data, such as paper citations, with relational databases, such as reference sets of authors and venues. The difference is that we view this as a record linkage task, namely finding the right reference set tuple to match. In their paper, even though they use matches from the database to aid extraction, they view the linkage task as an extraction procedure followed by a matching task. Lastly, we are not the first to consider structured SVMs for information extraction. Previous work used structured SVMs to perform Named Entity Recognition (Tsochantaridis et al., 2005) but their extraction task does not use reference sets.

Our method of aiding information extraction with outside information (in the form of reference sets) is similar to the work on ontology-based information extraction (Embley, Campbell, Jiang, Liddle, Ng, Quass, & Smith, 1999). Later versions of their work even talk about using ontology-based information extraction as a means to semantically annotate unstructured data such as car classifieds (Ding, Embley, & Liddle, 2006). However, in contrast to our work, the information extraction is performed by a keyword-lookup into the ontology along with structural and contextual rules to aid the labeling. The ontology itself contains keyword misspellings and abbreviations, so that the look-up can be performed in the presence of noisy data. We believe the ontology-based extraction approach is less scalable than a record linkage type matching task because creating and maintaining the ontology requires extensive data engineering in order to encompass all possible common spelling mistakes and abbreviations. Further, if new data is added to the ontology, additional data engineering must be performed. In our work, we can simply add new tuples to our reference set. Lastly, in contrast to our work, this ontology based work assumes contextual and structural rules will apply, making an assumption about the data to extract from. In our work, we make no such assumptions about the structure of the text we are extracting from.

Yet another interesting approach to information extraction using ontologies is the Textpresso system which extracts data from biological text (Müller & Sternberg, 2004). This system uses a regular expression based keyword look-up to label tokens in some text based on the ontology. Once all tokens are labeled, Textpresso can perform “fact extraction” by extracting sequences of labeled tokens that fit a particular pattern, such as gene-allele reference associations. Although this system again uses a reference set for extraction, it differs in that it does a keyword look-up into the lexicon.

In recent work on learning efficient blocking schemes Bilenko et al., (2006) developed a system for learning disjunctive normal form blocking schemes. However, they learn their schemes using a graphical set covering algorithm, while we use a version of the Sequential Covering Algorithm (SCA). There are also similarities between our BSL algorithm and work on mining association rules from transaction data (Agrawal, Imielinski, & Swami, 1993). Both algorithms discover propositional rules. Further, both algorithms use multiple passes over a data set to discover their rules. However despite these similarities, the techniques really solve different problems. BSL generates a set of candidate matches with a minimal number of false positives. To do this, BSL learns conjunctions that are maximally specific (eliminating many false positives) and unions them together as a single disjunctive rule (to cover the different true positives). Since the conjunctions are maximally specific, BSL uses SCA underneath, which learns rules in a depth-first, general to specific manner (Mitchell, 1997). On the other hand, the work of mining association rules (Agrawal et al., 1993) looks for actual patterns in the data that represent some internal relationships. There may be many such relationships in the data that could be discovered, so this approach covers the data in a breadth-first fashion, selecting the set of rules at each iteration and extending them by appending to each a new possible item.

References

  • Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. In the Proceedings of the 10th ACM Conference on Knowledge Discovery and Data Mining, pp. 20 – 29. ACM Press.
  • Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216. ACM Press.
  • Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In: Proceedings of the 9th ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification, pp. 25–27.
  • Bellare, K., & McCallum, A. (2007). Learning extractors from unlabeled text using relevant databases. In: Proceedings of the AAAI Workshop on Information Integration on the Web, pp. 10–16.
  • Bilenko, M., Kamath, B., & Mooney, R. J. (2006). Adaptive blocking: Learning to scale up record linkage and clustering. In: Proceedings of the 6th IEEE International Conference on Data Mining, pp. 87–96.
  • Bilenko, M., & Mooney, R. J. (2003). Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM Press.
  • Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 175–186. ACM Press.
  • Califf, M. E., & Mooney, R. J. (1999). Relational learning of pattern-match rules for information extraction. In: Proceedings of the 16th National Conference on Artificial Intelligence, pp. 328–334.
  • (Chakaravarthy et al., 2006) ⇒ Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh Mohania. (2006). “Efficiently Linking Text Documents with Relevant Structured Information.” In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB 2006).
  • Chaudhuri, S., Ganjam, K., Ganti, V., & Rajeev Motwani (2003). Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 313–324. ACM Press.
  • Cimiano, P., Handschuh, S., & Staab, S. (2004). Towards the self-annotating web. In: Proceedings of the 13th International Conference on World Wide Web, pp. 462–471. ACM Press.
  • Ciravegna, F. (2001). Adaptive information extraction from text by rule induction and generalisation.. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp. 1251–1256.
  • Cohen, W., & Sarawagi, S. (2004). Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining, pp. 89–98, Seattle, Washington. ACM Press.
  • Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18 (3), 288–321.
  • Cohen, W. W., Ravikumar, P., & Feinberg, S. E. (2003). A comparison of string metrics for matching names and records. In: Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consoliation, pp. 13–18.
  • Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 109–118. VLDB Endowment.
  • Ding, Y., Embley, D. W., & Liddle, S. W. (2006). Automatic creation and simplified querying of semantic web content: An approach based on information-extraction ontologies. In: Proceedings of the Asian Semantic Web Conference, pp. 400–414.
  • Dingli, A., Ciravegna, F., & Wilks, Y. (2003). Automatic semantic annotation using unsupervised information extraction and integration. In: Proceedings of the K-CAP Workshop on Knowledge Markup and Semantic Annotation.
  • Elfeky, M. G., Verykios, V. S., & Elmagarmid, A. K. (2002). TAILOR: A record linkage toolbox. In: Proceedings of 18th International Conference on Data Engineering, pp. 17–28.
  • Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Ng, Y.-K., Quass, D., & Smith, R. D. (1999). Conceptual-model-based data extraction from multiple-record web pages. Data Knowledge Engineering, 31 (3), 227–251.
  • Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183–1210.
  • Handschuh, S., Staab, S., & Ciravegna, F. (2002). S-cream - semi-automatic creation of metadata. In: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management, pp. 165–184. Springer Verlag.
  • Hendler, J. (2001). Agents and the semantic web. IEEE Intelligent Systems, 16 (2), 30–37.
  • Hernandez, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2 (1), 9–37.
  • Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 89, 414–420.
  • Joachims, T. (1999). Advances in Kernel Methods - Support Vector Learning, chap. 11: Making large-Scale SVM Learning Practical. MIT-Press.
  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann.
  • Lee, M.-L., Ling, T. W., Lu, H., & Ko, Y. T. (1999). Cleansing data for mining and warehousing. In: Proceedings of the 10th International Conference on Database and Expert Systems Applications, pp. 751–760. Springer-Verlag.
  • Lerman, K., Gazen, C., Minton, S., & Knoblock, C. A. (2004). Populating the semantic web. In: Proceedings of the AAAI Workshop on Advances in Text Extraction and Mining.
  • Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. English translation in Soviet Physics Doklady, 10 (8), 707–710.
  • Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. In: Proceedings of the International Conference on Data Engineering, p. 29.
  • IEEE Computer Society.
  • McCallum, A. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
  • McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the 6th ACM SIGKDD, pp. 169–178.
  • Michalowski, M., Thakkar, S., & Knoblock, C. A. (2005). Automatically utilizing secondary sources to align information across sources. In AI Magazine, Special Issue on Semantic Integration, Vol. 26, pp. 33–45.
  • Michelson, M., & Knoblock, C. A. (2005). Semantic annotation of unstructured and ungrammatical text. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1091–1098.
  • Michelson, M., & Knoblock, C. A. (2006). Learning blocking schemes for record linkage. In: Proceedings of the 21st National Conference on Artificial Intelligence.
  • Michelson, M., & Knoblock, C. A. (2007). Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. International Journal of Document Analysis and Recognition (IJDAR), Special Issue on Noisy Text Analytics.
  • Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York.
  • Müller, H.-M., & Sternberg, E. E. K. P. W. (2004). Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biology, 2 (11).
  • Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4 (1/2), 93–114.
  • Newcombe, H. B. (1967). Record linkage: The design of efficient systems for linking records into individual and family histories. American Journal of Human Genetics, 19 (3), 335–359.
  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14 (3), 130–137.
  • Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.
  • Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34 (1-3), 233–272.
  • Thakkar, S., Ambite, J. L., & Knoblock, C. A. (2004). A data integration approach to automatically composing and optimizing web services. In: Proceedings of the ICAPS Workshop on Planning and Scheduling for Web and Grid Services.
  • Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the 21st International Conference on Machine Learning, p. 104. ACM Press.
  • Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.
  • Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., & Ciravegna, F. (2002). MnM: Ontology driven semi-automatic and automatic support for semantic markup. In: Proceedings of the 13th International Conference on Knowledge Engineering and Management, pp. 213–221.
  • Wellner, B., McCallum, A., Peng, F., & Hay, M. (2004). An integrated, conditional model of information extraction and coreference with application to citation matching. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 593–601.
  • Winkler, W. E., & Thibaudeau, Y. (1991). An application of the fellegi-sunter model of record linkage to the 1990 U.S. Decennial Census. Tech. rep., Statistical Research Report Series RR91/09 U.S. Bureau of the Census.
  • Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342. ACM Press.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 CreatingRelationDataFromUnstrDataMatthew Michelson
Craig A. Knoblock
Creating Relational Data from Unstructured and Ungrammatical Data SourcesJAIR Journal Serieshttps://www.aaai.org/Papers/JAIR/Vol31/JAIR-3116.pdf2008