2015 IncrementalKnowledgeBaseConstru

(Shin et al., 2015) ⇒ Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. (2015). “Incremental Knowledge Base Construction Using DeepDive.” In: Proceedings of the VLDB Endowment Journal, 8(11). doi:10.14778/2809974.2809991

Subject Headings: DeepDive System.

Notes

Cited By

Quotes

Abstract

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

References

1. U. A. Acar, A. Ihler, R. Mettu, and O. Sümer. Adaptive Inference on General Graphical Models. In UAI, 2008.
2. C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An Introduction to MCMC for Machine Learning. Machine Learning, 2003.
3. G. Angeli, S. Gupta, M. Jose, C. D. Manning, C. Ré, J. Tibshirani, J. Y. Wu, S. Wu, and C. Zhang. Stanford's 2014 Slot Filling Systems. TAC KBP, 2014.
4. Onureena Banerjee, Laurent El Ghaoui, Alexandre D'Aspremont, Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian Or Binary Data, The Journal of Machine Learning Research, 9, p.485-516, 6/1/2008
5. J. Betteridge, A. Carlson, S. A. Hong, E. R. Hruschka Jr, E. L. Law, T. M. Mitchell, and S. H. Wang. Toward Never Ending Language Learning. In AAAI Spring Symposium, 2009.
6. Sergey Brin, Extracting Patterns and Relations from the World Wide Web, Selected Papers from the International Workshop on The World Wide Web and Databases, p.172-183, March 27-28, 1998
7. E. Brown, E. Epstein, J. W. Murdock, and T.-H. Fin. Tools and Methods for Building Watson. IBM Research Report, 2013.
8. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward An Architecture for Never-ending Language Learning. In AAAI, 2010.
9. Fei Chen, AnHai Doan, Jun Yang, Raghu Ramakrishnan, Efficient Information Extraction over Evolving Text Data, Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, p.943-952, April 07-12, 2008 doi:10.1109/ICDE.2008.4497503
10. Fei Chen, Xixuan Feng, Christopher Re, Min Wang, Optimizing Statistical Information Extraction Programs over Evolving Text, Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, p.870-881, April 01-05, 2012 doi:10.1109/ICDE.2012.60
11. Yang Chen, Daisy Zhe Wang, Knowledge Expansion over Probabilistic Knowledge Bases, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, June 22-27, 2014, Snowbird, Utah, USA doi:10.1145/2588555.2610516
12. Arthur L. Delcher, Adam J. Grove, Simon Kasif, Judea Pearl, Logarithmic-time Updates and Queries in Probabilistic Networks, Journal of Artificial Intelligence Research, v.4 n.1, p.37-59, Jnauary 1996
13. Pedro Domingos, Daniel Lowd, Markov Logic: An Interface Layer for Artificial Intelligence, Morgan and Claypool Publishers, 2009
14. Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, Wei Zhang, From Data Fusion to Knowledge Fusion, Proceedings of the VLDB Endowment, v.7 n.10, p.881-892, June 2014 doi:10.14778/2732951.2732962
15. Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates, Web-scale Information Extraction in Knowitall: (preliminary Results), Proceedings of the 13th International Conference on World Wide Web, May 17-20, 2004, New York, NY, USA doi:10.1145/988672.988687
16. D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. Building Watson: An Overview of the DeepQA Project. AI Magazine, 2010.
17. Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, Sergio Flesca, The Lixto Data Extraction Project: Back and Forth Between Theory and Practice, Proceedings of the Twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 14-16, 2004, Paris, France doi:10.1145/1055558.1055560
18. Ashish Gupta, Inderpal Singh Mumick, V. S. Subrahmanian, Maintaining Views Incrementally, ACM SIGMOD Record, v.22 n.2, p.157-166, June 1, 1993 doi:10.1145/170036.170066
19. Marti A. Hearst, Automatic Acquisition of Hyponyms from Large Text Corpora, Proceedings of the 14th Conference on Computational Linguistics, August 23-28, 1992, Nantes, France doi:10.3115/992133.992154
20. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, Daniel S. Weld, Knowledge-based Weak Supervision for Information Extraction of Overlapping Relations, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 19-24, 2011, Portland, Oregon
21. Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, Peter J. Haas, MCDB: A Monte Carlo Approach to Managing Uncertain Data, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, June 09-12, 2008, Vancouver, Canada doi:10.1145/1376616.1376686
22. E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.
23. Shangpu Jiang, Daniel Lowd, Dejing Dou, Learning to Refine An Automatically Extracted Knowledge Base Using Markov Logic, Proceedings of the 2012 IEEE 12th International Conference on Data Mining, p.912-917, December 10-13, 2012 doi:10.1109/ICDM.2012.156
24. M. Levent Koc, Christopher Ré, Incrementally Maintaining Classification Using An RDBMS, Proceedings of the VLDB Endowment, v.4 n.5, p.302-313, February 2011 doi:10.14778/1952376.1952380
25. Yunyao Li, Frederick R. Reiss, Laura Chiticariu, SystemT: A Declarative Information Extraction System, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, p.109-114, June 21-21, 2011, Portland, Oregon
26. J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You Can Only Afford to Pay As You Go. In CIDR, 2007.
27. Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, Distant Supervision for Relation Extraction Without Labeled Data, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, August 02-07, 2009, Suntec, Singapore
28. Ndapandula Nakashole, Martin Theobald, Gerhard Weikum, Scalable Knowledge Harvesting with High Precision and High Recall, Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, February 09-12, 2011, Hong Kong, China doi:10.1145/1935826.1935869
29. A. Nath and P. Domingos. Efficient Belief Propagation for Utility Maximization and Repeated Inference. In AAAI, 2010.
30. Feng Niu, Christopher Ré, AnHai Doan, Jude Shavlik, Tuffy: Scaling Up Statistical Inference in Markov Logic Networks Using An RDBMS, Proceedings of the VLDB Endowment, v.4 n.6, p.373-384, March 2011 doi:10.14778/1978665.1978669
31. Feng Niu, Ce Zhang, Christopher Ré, Jude Shavlik, Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference, International Journal on Semantic Web & Information Systems, v.8 n.3, p.42-73, July 2012 doi:10.4018/jswis.2012070103
32. S. E. Peters, C. Zhang, M. Livny, and C. Ré. A Machine Reading System for Assembling Synthetic Paleontological Databases. PloS ONE, 2014.
33. P. D. Ravikumar, G. Raskutti, M. J. Wainwright, and B. Yu. Model Selection in Gaussian Graphical Models: High-dimensional Consistency of l₁-regularized MLE. In NIPS, 2008.
34. C. Ré, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature Engineering for Knowledge Base Construction. IEEE Data Eng. Bull., 2014.
35. Christian P. Robert, George Casella, Monte Carlo Statistical Methods (Springer Texts in Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, 2005
36. Warren Shen, AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, Proceedings of the 33rd International Conference on Very Large Data Bases, September 23-27, 2007, Vienna, Austria
37. Dan Suciu, Dan Olteanu, R. Christopher, Christoph Koch, Probabilistic Databases, Morgan & Claypool Publishers, 2011
38. M.J. Wainwright, M.I. Jordan, Log-determinant Relaxation for Approximate Inference in Discrete Markov Random Fields, IEEE Transactions on Signal Processing, v.54 n.6, p.2099-2109, June 2006 doi:10.1109/TSp.2006.874409
39. Martin J. Wainwright, Michael I. Jordan, Graphical Models, Exponential Families, and Variational Inference, Foundations and Trends® in Machine Learning, v.1 n.1-2, p.1-305, January 2008 doi:10.1561/2200000001
40. Gerhard Weikum, Martin Theobald, From Information to Knowledge: Harvesting Entities and Relationships from Web Sources, Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 06-11, 2010, Indianapolis, Indiana, USA doi:10.1145/1807085.1807097
41. M. Wick and A. McCallum. Query-aware MCMC. In NIPS, 2011.
42. Michael Wick, Andrew McCallum, Gerome Miklau, Scalable Probabilistic Databases with Factor Graphs and MCMC, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010 doi:10.14778/1920841.1920942
43. Limin Yao, Sebastian Riedel, Andrew McCallum, Collective Cross-document Relation Extraction Without Labelled Data, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, p.1013-1023, October 09-11, 2010, Cambridge, Massachusetts
44. Ce Zhang, Christopher Ré, Towards High-throughput Gibbs Sampling at Scale: A Study Across Storage Managers, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, June 22-27, 2013, New York, New York, USA doi:10.1145/2463676.2463702
45. Ce Zhang, Christopher Ré, DimmWitted: A Study of Main-memory Statistical Analytics, Proceedings of the VLDB Endowment, v.7 n.12, p.1283-1294, August 2014 doi:10.14778/2732977.2733001;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2015 IncrementalKnowledgeBaseConstru	Sen Wu Ce Zhang Jaeho Shin Feiran Wang Christopher De Sa Christopher Ré			Incremental Knowledge Base Construction Using DeepDive				10.14778/2809974.2809991		2015

2015 IncrementalKnowledgeBaseConstru

Notes

Cited By

Quotes

Abstract

References

Navigation menu

Search