Spark DataFrame Data Structure

From GM-RKB
Jump to navigation Jump to search

A Spark DataFrame Data Structure is a distributed data frame data structure that is a Spark data structure such that it is a distributed collection of data organized into named columns.



References

2016

  • http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html/2
    • QUOTE: Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. There are also advantages when performing computations in a single process as Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, avoiding the garbage-collection costs associated with constructing individual objects for each row in the data set. Because Spark understands the schema, there is no need to use Java serialization to encode the data.

      The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans, but not natural for the majority of developers. The query plan can be built from SQL expressions in strings or from a more functional approach using a fluent-style API.

 df.filter("age > 21");
 Expression builder style:
 df.filter(df.col("age").gt(21));

2015a

  • http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
    • QUOTE: A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

      The DataFrame API is available in Scala, Java, Python, and R.

2015b

2015c