Data Redistribution Across Partitions Operation

From GM-RKB
Jump to navigation Jump to search

A Data Redistribution Across Partitions Operation is a distributed data structure operation that ...



References

2018

  • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
    • QUOTE: Depending on how you look at Spark (programmer, devop, admin), an RDD is about the content (developer’s and data scientist’s perspective) or how it gets spread out over a cluster (performance), i.e. how many partitions an RDD represents.

      A partition (aka split) is a logical chunk of a large distributed data set.

      Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.