Spark MLlib Pipeline API

From GM-RKB
Jump to navigation Jump to search

A Spark MLlib Pipeline API is a Spark API that represents an ML workflow based on Spark MLlib.



References

2017

  • http://github.com/apache/spark/blob/master/docs/ml-pipeline.md
    • QUOTE: ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.

       MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.

      • DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
      • Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
      • Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
      • Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
      • Parameter: All Transformers and Estimators now share a common API for specifying parameters.