Spark Datasets API

From GM-RKB
Jump to navigation Jump to search

A Spark Datasets API is a Spark data structure such that ...



References

2016

 val sc = new SparkContext(conf)
 val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._
 val sampleData: Seq[ScalaPerson] = ScalaData.sampleData()
 val dataset = sqlContext.createDataset(sampleData)
 dataset.filter(_.age < 21);

2015

  • http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets
    • QUOTE: A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).
      The unified Dataset API can be used both in Scala and Java. Python does not yet have support for the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can access the field of a row by name naturally row.columnName). Full python support will be added in a future release.