Dask Framework

From GM-RKB
Jump to navigation Jump to search

A Dask Framework is a Parallel Computing Framework.



References

2021

2021

2021

  • https://docs.dask.org/en/latest/
    • QUOTE: Dask is a flexible library for parallel computing in Python.
    • Dask is composed of two parts:
      • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
      • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.
    • Dask emphasizes the following virtues:
      • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
      • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
      • Native: Enables distributed computing in pure Python with access to the PyData stack.
      • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
      • Scales up: Runs resiliently on clusters with 1000s of cores
      • Scales down: Trivial to set up and run on a laptop in a single process
      • Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans

2020

  • https://medium.com/@prayankkul27/which-one-should-i-use-apache-spark-or-dask-22ad4a20ab77
    • QUOTE: ... Daskis smaller and lighter weight compare to spark. Dask has fewer features. Dask uses and couples with libraries like numeric python(numpy), pandas, Scikit-learn to gain high-level functionality.
      • Spark is written in Scala and supports various other languages such as R, Python, Java Whereas Dask is written in Python and only supports Python.
      • Spark has its own ecosystem and it is well integrated with other Apache projects whereas Dask is a component of a large python ecosystem. Dask has the main aim to enhance and use libraries like pandas,numpy, scikit-learn.
      • Spark is older and has become a dominant and well-trusted tool in the Big Data world. whereas Dask is younger and its extension of well trusted NumPy/Pandas/Scikit-learn/Jupyter stack.
      • Spark Dataframe has its own API and memory model. Spark also implemented a large subset of complex SQL queries. Whereas Dask reuses Pandas API and memory model. it neither implemented SQL and query optimizer.
      • When it comes to Machine learning Spark has MLlib that is easy to implement with spark Map-reduce style system. whereas Dask relies on and interoperates with existing popular machine learning and data science libraries like Scikit-Learn and XGBoost.
      • Spark does not support multi-dimension array structure whereas Dask has full functionality of the numpy model.
      • Spark can process graph model using graphX library whereas Dask does not have any library or model for graph processing.