PySpark API
Jump to navigation
Jump to search
A PySpark API is a Spark API for Python code.
- Context:
- It can contain, PySpark Classes, such as:
pyspark.SparkContext: Main entry point for Spark functionality.
pyspark.RDD: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
pyspark.sql module.
pyspark.sql.SQLContext: Main entry point for DataFrame and SQL functionality.
pyspark.sql.DataFrame: A distributed collection of data grouped into named columns.
pyspark.streaming module.
pyspark.streaming.StreamingContext: Main entry point for Spark Streaming functionality.
pyspark.streaming.DStream: A Discretized Stream (DStream), the basic abstraction in Spark Streaming.
pyspark.ml package.
pyspark.mllib package.
- It can contain, PySpark Classes, such as:
- Counter-Example(s):
- See: Spark SQL, Spark Library.
References
2016
- https://spark.apache.org/docs/0.9.0/python-programming-guide.html
- QUOTE: The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python.
There are a few key differences between the Python and Scala APIs:
- Python is dynamically typed, so RDDs can hold objects of multiple types.
- PySpark does not yet support a few API calls, such as lookup and non-text input files, though these will be added in future releases.
- http://spark.apache.org/docs/latest/api/python/
- QUOTE: Core classes:
pyspark.SparkContext: Main entry point for Spark functionality.
pyspark.RDD: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
pyspark.streaming.StreamingContext: Main entry point for Spark Streaming functionality.
pyspark.streaming.DStream: A Discretized Stream (DStream), the basic abstraction in Spark Streaming.
pyspark.sql.SQLContext: Main entry point for DataFrame and SQL functionality.
pyspark.sql.DataFrame: A distributed collection of data grouped into named columns.
- http://stackoverflow.com/a/37084862
- QUOTE: As of Spark 1.0, you should launch pyspark applications using
spark-submit
.
While pyspark
will launch the interactive shell, spark-submit
allows you to easily launch a spark job on various cluster managers.
2015
- http://spark.apache.org/docs/latest/api/python/index.html
- pyspark package
- Subpackages
- Contents
- pyspark.sql module
- Module Context
- pyspark.sql.types module
- pyspark.sql.functions module
- pyspark.streaming module
- Module contents
- pyspark.streaming.kafka module
- pyspark.ml package
- Module Context
- pyspark.ml.feature module
- pyspark.ml.classification module
- pyspark.mllib package
- pyspark.mllib.classification module
- pyspark.mllib.clustering module
- pyspark.mllib.feature module
- pyspark.mllib.linalg module
- pyspark.mllib.random module
- pyspark.mllib.recommendation module
- pyspark.mllib.regression module
- pyspark.mllib.stat module
- pyspark.mllib.tree module
- pyspark.mllib.util module