2015 IncrementalKnowledgeBaseConstru

From GM-RKB
Jump to navigation Jump to search

Subject Headings: DeepDive System.

Notes

Cited By

Quotes

Abstract

Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

References

  • 1. U. A. Acar, A. Ihler, R. Mettu, and O. Sümer. Adaptive Inference on General Graphical Models. In UAI, 2008.
  • 2. C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An Introduction to MCMC for Machine Learning. Machine Learning, 2003.
  • 3. G. Angeli, S. Gupta, M. Jose, C. D. Manning, C. Ré, J. Tibshirani, J. Y. Wu, S. Wu, and C. Zhang. Stanford's 2014 Slot Filling Systems. TAC KBP, 2014.
  • 4. Onureena Banerjee, Laurent El Ghaoui, Alexandre D'Aspremont, Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian Or Binary Data, The Journal of Machine Learning Research, 9, p.485-516, 6/1/2008
  • 5. J. Betteridge, A. Carlson, S. A. Hong, E. R. Hruschka Jr, E. L. Law, T. M. Mitchell, and S. H. Wang. Toward Never Ending Language Learning. In AAAI Spring Symposium, 2009.
  • 6. Sergey Brin, Extracting Patterns and Relations from the World Wide Web, Selected Papers from the International Workshop on The World Wide Web and Databases, p.172-183, March 27-28, 1998
  • 7. E. Brown, E. Epstein, J. W. Murdock, and T.-H. Fin. Tools and Methods for Building Watson. IBM Research Report, 2013.
  • 8. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward An Architecture for Never-ending Language Learning. In AAAI, 2010.
  • 9. Fei Chen, AnHai Doan, Jun Yang, Raghu Ramakrishnan, Efficient Information Extraction over Evolving Text Data, Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, p.943-952, April 07-12, 2008 doi:10.1109/ICDE.2008.4497503
  • 10. Fei Chen, Xixuan Feng, Christopher Re, Min Wang, Optimizing Statistical Information Extraction Programs over Evolving Text, Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, p.870-881, April 01-05, 2012 doi:10.1109/ICDE.2012.60
  • 11. Yang Chen, Daisy Zhe Wang, Knowledge Expansion over Probabilistic Knowledge Bases, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, June 22-27, 2014, Snowbird, Utah, USA doi:10.1145/2588555.2610516
  • 12. Arthur L. Delcher, Adam J. Grove, Simon Kasif, Judea Pearl, Logarithmic-time Updates and Queries in Probabilistic Networks, Journal of Artificial Intelligence Research, v.4 n.1, p.37-59, Jnauary 1996
  • 13. Pedro Domingos, Daniel Lowd, Markov Logic: An Interface Layer for Artificial Intelligence, Morgan and Claypool Publishers, 2009
  • 14. Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, Wei Zhang, From Data Fusion to Knowledge Fusion, Proceedings of the VLDB Endowment, v.7 n.10, p.881-892, June 2014 doi:10.14778/2732951.2732962
  • 15. Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, Alexander Yates, Web-scale Information Extraction in Knowitall: (preliminary Results), Proceedings of the 13th International Conference on World Wide Web, May 17-20, 2004, New York, NY, USA doi:10.1145/988672.988687
  • 16. D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. Building Watson: An Overview of the DeepQA Project. AI Magazine, 2010.
  • 17. Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, Sergio Flesca, The Lixto Data Extraction Project: Back and Forth Between Theory and Practice, Proceedings of the Twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 14-16, 2004, Paris, France doi:10.1145/1055558.1055560
  • 18. Ashish Gupta, Inderpal Singh Mumick, V. S. Subrahmanian, Maintaining Views Incrementally, ACM SIGMOD Record, v.22 n.2, p.157-166, June 1, 1993 doi:10.1145/170036.170066
  • 19. Marti A. Hearst, Automatic Acquisition of Hyponyms from Large Text Corpora, Proceedings of the 14th Conference on Computational Linguistics, August 23-28, 1992, Nantes, France doi:10.3115/992133.992154
  • 20. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, Daniel S. Weld, Knowledge-based Weak Supervision for Information Extraction of Overlapping Relations, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 19-24, 2011, Portland, Oregon
  • 21. Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, Peter J. Haas, MCDB: A Monte Carlo Approach to Managing Uncertain Data, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, June 09-12, 2008, Vancouver, Canada doi:10.1145/1376616.1376686
  • 22. E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.
  • 23. Shangpu Jiang, Daniel Lowd, Dejing Dou, Learning to Refine An Automatically Extracted Knowledge Base Using Markov Logic, Proceedings of the 2012 IEEE 12th International Conference on Data Mining, p.912-917, December 10-13, 2012 doi:10.1109/ICDM.2012.156
  • 24. M. Levent Koc, Christopher Ré, Incrementally Maintaining Classification Using An RDBMS, Proceedings of the VLDB Endowment, v.4 n.5, p.302-313, February 2011 doi:10.14778/1952376.1952380
  • 25. Yunyao Li, Frederick R. Reiss, Laura Chiticariu, SystemT: A Declarative Information Extraction System, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, p.109-114, June 21-21, 2011, Portland, Oregon
  • 26. J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale Data Integration: You Can Only Afford to Pay As You Go. In CIDR, 2007.
  • 27. Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, Distant Supervision for Relation Extraction Without Labeled Data, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, August 02-07, 2009, Suntec, Singapore
  • 28. Ndapandula Nakashole, Martin Theobald, Gerhard Weikum, Scalable Knowledge Harvesting with High Precision and High Recall, Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, February 09-12, 2011, Hong Kong, China doi:10.1145/1935826.1935869
  • 29. A. Nath and P. Domingos. Efficient Belief Propagation for Utility Maximization and Repeated Inference. In AAAI, 2010.
  • 30. Feng Niu, Christopher Ré, AnHai Doan, Jude Shavlik, Tuffy: Scaling Up Statistical Inference in Markov Logic Networks Using An RDBMS, Proceedings of the VLDB Endowment, v.4 n.6, p.373-384, March 2011 doi:10.14778/1978665.1978669
  • 31. Feng Niu, Ce Zhang, Christopher Ré, Jude Shavlik, Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference, International Journal on Semantic Web & Information Systems, v.8 n.3, p.42-73, July 2012 doi:10.4018/jswis.2012070103
  • 32. S. E. Peters, C. Zhang, M. Livny, and C. Ré. A Machine Reading System for Assembling Synthetic Paleontological Databases. PloS ONE, 2014.
  • 33. P. D. Ravikumar, G. Raskutti, M. J. Wainwright, and B. Yu. Model Selection in Gaussian Graphical Models: High-dimensional Consistency of l1-regularized MLE. In NIPS, 2008.
  • 34. C. Ré, A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature Engineering for Knowledge Base Construction. IEEE Data Eng. Bull., 2014.
  • 35. Christian P. Robert, George Casella, Monte Carlo Statistical Methods (Springer Texts in Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, 2005
  • 36. Warren Shen, AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, Proceedings of the 33rd International Conference on Very Large Data Bases, September 23-27, 2007, Vienna, Austria
  • 37. Dan Suciu, Dan Olteanu, R. Christopher, Christoph Koch, Probabilistic Databases, Morgan & Claypool Publishers, 2011
  • 38. M.J. Wainwright, M.I. Jordan, Log-determinant Relaxation for Approximate Inference in Discrete Markov Random Fields, IEEE Transactions on Signal Processing, v.54 n.6, p.2099-2109, June 2006 doi:10.1109/TSp.2006.874409
  • 39. Martin J. Wainwright, Michael I. Jordan, Graphical Models, Exponential Families, and Variational Inference, Foundations and Trends® in Machine Learning, v.1 n.1-2, p.1-305, January 2008 doi:10.1561/2200000001
  • 40. Gerhard Weikum, Martin Theobald, From Information to Knowledge: Harvesting Entities and Relationships from Web Sources, Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 06-11, 2010, Indianapolis, Indiana, USA doi:10.1145/1807085.1807097
  • 41. M. Wick and A. McCallum. Query-aware MCMC. In NIPS, 2011.
  • 42. Michael Wick, Andrew McCallum, Gerome Miklau, Scalable Probabilistic Databases with Factor Graphs and MCMC, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010 doi:10.14778/1920841.1920942
  • 43. Limin Yao, Sebastian Riedel, Andrew McCallum, Collective Cross-document Relation Extraction Without Labelled Data, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, p.1013-1023, October 09-11, 2010, Cambridge, Massachusetts
  • 44. Ce Zhang, Christopher Ré, Towards High-throughput Gibbs Sampling at Scale: A Study Across Storage Managers, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, June 22-27, 2013, New York, New York, USA doi:10.1145/2463676.2463702
  • 45. Ce Zhang, Christopher Ré, DimmWitted: A Study of Main-memory Statistical Analytics, Proceedings of the VLDB Endowment, v.7 n.12, p.1283-1294, August 2014 doi:10.14778/2732977.2733001;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 IncrementalKnowledgeBaseConstruSen Wu
Ce Zhang
Jaeho Shin
Feiran Wang
Christopher De Sa
Christopher Ré
Incremental Knowledge Base Construction Using DeepDive10.14778/2809974.28099912015