Spark User-Defined Function (UDF)

From GM-RKB
Jump to navigation Jump to search

A Spark User-Defined Function (UDF) is a UDF within Apache Spark.



References

2018

  • https://changhsinlee.com/pyspark-udf/
    • QUOTE: Here’s the problem: I have a Python function that iterates over my data, but going through each row in the dataframe takes several days. If I have a computing cluster with many nodes, how can I distribute this Python function in PySpark to speed up this process — maybe cut the total time down to less than a few hours — with the least amount of work?

      In other words, how do I turn a Python function into a Spark user defined function, or UDF? I’ll explain my solution here. ...

      Registering a UDF: PySpark UDFs work in a similar way as the pandas .map() and .apply() methods for pandas series and dataframes. If I have a function that can use values from a row in the dataframe as input, then I can map it to the entire dataframe. The only difference is that with PySpark UDFs I have to specify the output data type.

2017