Spark RDD (Resilient Distributed Dataset) Data Structure

From GM-RKB
Jump to navigation Jump to search

A Spark RDD (Resilient Distributed Dataset) Data Structure is a read-only distributed data record structure designed for Hadoop Yarn/Hadoop Spark.



References

2017

  • https://spark.apache.org/docs/2.1.0/programming-guide.html
    • QUOTE: At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

      A second abstraction in Spark is shared variables that can be used in parallel operations.

2016

  • https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
    • QUOTE: One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

      You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

2015

  • https://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/RDD.html
    • QUOTE: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; DoubleRDDFunctions contains operations available only on RDDs of Doubles; and SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions when you import org.apache.spark.SparkContext._.

      Internally, each RDD is characterized by five main properties:

      • A list of partitions
      • A function for computing each split
      • A list of dependencies on other RDDs.
      • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
      • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
    • All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. Please refer to the Spark paper for more details on RDD internals.


2014

  • https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
    • QUOTE: One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

      You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.


2012