2013 SharkSQLandRichAnalyticsatScale

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Shark System, Spark SQL.

Notes

Cited By

2015

Quotes

Abstract

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100X faster than Apache Hive, and machine learning programs more than 100X faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engine provides. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.

References

  • 1. https://github.com/cloudera/impala.
  • 2. http://hadoop.apache.org/.
  • 3. http://aws.amazon.com/elasticmapreduce/.
  • 4. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin, HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, Proceedings of the VLDB Endowment, v.2 n.1, August 2009
  • 5. Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica, Jingren Zhou, Re-optimizing Data-parallel Computing, Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, April 25-27, 2012, San Jose, CA
  • 6. Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica, PACMan: Coordinated Memory Caching for Parallel Jobs, Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, April 25-27, 2012, San Jose, CA
  • 7. Ron Avnur, Joseph M. Hellerstein, Eddies: Continuously Adaptive Query Processing, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, p.261-272, May 15-18, 2000, Dallas, Texas, USA doi:10.1145/342009.335420
  • 8. Shivnath Babu, Towards Automatic Optimization of MapReduce Programs, Proceedings of the 1st ACM Symposium on Cloud Computing, June 10-11, 2010, Indianapolis, Indiana, USA doi:10.1145/1807128.1807150
  • 9. Alexander Behm, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, Vassilis J. Tsotras, ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-world Models, Distributed and Parallel Databases, v.29 n.3, p.185-216, June 2011 doi:10.1007/s10619-011-7082-y
  • 10. Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, Rares Vernica, Hyracks: A Flexible and Extensible Foundation for Data-intensive Computing, Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, p.1151-1162, April 11-16, 2011 doi:10.1109/ICDE.2011.5767921
  • 11. Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst, HaLoop: Efficient Iterative Data Processing on Large Clusters, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010
  • 12. Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, Jingren Zhou, SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets, Proceedings of the VLDB Endowment, v.1 n.2, August 2008
  • 13. B. Chattopadhyay,, Et Al. Tenzing a Sql Implementation on the Mapreduce Framework. PVLDB, 4(12):1318--1327, 2011.
  • 14. Songting Chen, Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010
  • 15. C. Chu Et Al. Map-Reduce for Machine Learning on Multicore. Advances in Neural Information Processing Systems, 19:281, 2007.
  • 16. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton, MAD Skills: New Analysis Practices for Big Data, Proceedings of the VLDB Endowment, v.2 n.2, August 2009
  • 17. Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, p.10-10, December 06-08, 2004, San Francisco, CA
  • 18. Xixuan Feng, Arun Kumar, Benjamin Recht, Christopher Ré, Towards a Unified Architecture for in-RDBMS Analytics, Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, May 20-24, 2012, Scottsdale, Arizona, USA doi:10.1145/2213836.2213874
  • 19. B. Guffler Et Al. Handling Data Skew in Mapreduce. In CLOSER'11.
  • 20. Alexander Hall, Olaf Bachmann, Robert Büssow, Silviu Gănceanu, Marc Nunkesser, Processing a Trillion Cells Per Mouse Click, Proceedings of the VLDB Endowment, v.5 n.11, p.1436-1446, July 2012
  • 21. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica, Mesos: A Platform for Fine-grained Resource Sharing in the Data Center, Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, March 30-April 01, 2011, Boston, MA
  • 22. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly, Dryad: Distributed Data-parallel Programs from Sequential Building Blocks, ACM SIGOPS Operating Systems Review, v.41 n.3, June 2007 doi:10.1145/1272998.1273005
  • 23. Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, Andrew Goldberg, Quincy: Fair Scheduling for Distributed Computing Clusters, Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, October 11-14, 2009, Big Sky, Montana, USA doi:10.1145/1629575.1629601
  • 24. Michael Isard, Yuan Yu, Distributed Data-parallel Computing Using a High-level Programming Language, Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, June 29-July 02, 2009, Providence, Rhode Island, USA doi:10.1145/1559845.1559962
  • 25. Navin Kabra, David J. DeWitt, Efficient Mid-query Re-optimization of Sub-optimal Query Execution Plans, ACM SIGMOD Record, v.27 n.2, p.106-117, June 1998 doi:10.1145/276305.276315
  • 26. YongChul Kwon, Magdalena Balazinska, Bill Howe, Jerome Rolia, SkewTune: Mitigating Skew in Mapreduce Applications, Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, May 20-24, 2012, Scottsdale, Arizona, USA doi:10.1145/2213836.2213840
  • 27. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, Joseph M. Hellerstein, Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, Proceedings of the VLDB Endowment, v.5 n.8, p.716-727, April 2012
  • 28. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski, Pregel: A System for Large-scale Graph Processing, Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, June 06-10, 2010, Indianapolis, Indiana, USA doi:10.1145/1807167.1807184
  • 29. Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Dremel: Interactive Analysis of Web-scale Datasets, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010
  • 30. Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, Ion Stoica, The Case for Tiny Tasks in Compute Clusters, Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems, p.14-14, May 13-15, 2013, Santa Ana Pueblo, New Mexcio
  • 31. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker, A Comparison of Approaches to Large-scale Data Analysis, Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, June 29-July 02, 2009, Providence, Rhode Island, USA doi:10.1145/1559845.1559865
  • 32. Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, Stan Zdonik, C-store: A Column-oriented DBMS, Proceedings of the 31st International Conference on Very Large Data Bases, August 30-September 02, 2005, Trondheim, Norway
  • 33. Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin, MapReduce and Parallel DBMSs: Friends Or Foes?, Communications of the ACM, v.53 n.1, January 2010 doi:10.1145/1629175.1629197
  • 34. A. Thusoo Et Al. Hive-a Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.
  • 35. Transaction Processing Performance Council. TPC BENCHMARK H.
  • 36. Tolga Urhan, Michael J. Franklin, Laurent Amsaleg, Cost-based Query Scrambling for Initial Delays, ACM SIGMOD Record, v.27 n.2, p.130-141, June 1998 doi:10.1145/276305.276317
  • 37. C. Yang Et Al. Osprey: Implementing Mapreduce-style Fault Tolerance in a Shared-nothing Distributed Database. In ICDE, 2010.
  • 38. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, Ion Stoica, Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, Proceedings of the 5th European Conference on Computer Systems, April 13-16, 2010, Paris, France doi:10.1145/1755913.1755940
  • 39. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster Computing, Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, April 25-27, 2012, San Jose, CA

}};


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 SharkSQLandRichAnalyticsatScaleIon Stoica
Matei Zaharia
Michael J. Franklin
Scott Shenker
Reynold S. Xin
Josh Rosen
Shark: SQL and Rich Analytics at Scale10.1145/2463676.24652882013