Spark User Defined Function (UDF)

From GM-RKB
Jump to navigation Jump to search

A Spark User Defined Function (UDF) is a database user defined function in Spark.



REferences

2017

2017vb

  • https://sigdelta.com/blog/scala-spark-udfs-in-python/
    • QUOTE: Many systems based on SQL, including Apache Spark, have User-Defined Functions (UDFs) support. While it is possible to create UDFs directly in Python, it brings a substantial burden on the efficiency of computations. It is because Spark’s internals are written in Java and Scala, thus, run in JVM; see the figure from PySpark’s Confluence page for details.

      Since Spark SQL is really a declarative interface, the actual computations take place mostly in JVM. But if we write and use UDFs in Python, the calls have to be made to Python interpreter, which is a separate process. Thus, there is considerable overhead of doing so, as visible on the above figure.

      The simplest solution to Python UDFs is to use the available functions, which are quite rich. These functions take and return Column, thus, they can be composed to create more complex functions.