Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times faster at least in one instance. Does anyone know why this is so, and when would a udf be faster (only for instances that an identical spark function exists)?
Here is my testing code (ran on Databricks community ed):
# UDF vs Spark function from faker import Factory from pyspark.sql.functions import lit, concat fake = Factory.create() fake.seed(4321) # Each entry consists of last_name, first_name, ssn, job, and age (at least 1) from pyspark.sql import Row def fake_entry(): name = fake.name().split() return (name, name, fake.ssn(), fake.job(), abs(2016 - fake.date_time().year) + 1) # Create a helper function to call a function repeatedly def repeat(times, func, *args, **kwargs): for _ in xrange(times): yield func(*args, **kwargs) data = list(repeat(500000, fake_entry)) print len(data) data dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'ssn', 'occupation', 'age')) dataDF.cache()
concat_s = udf(lambda s: s+ 's') udfData = dataDF.select(concat_s(dataDF.first_name).alias('name')) udfData.count()
spfData = dataDF.select(concat(dataDF.first_name, lit('s')).alias('name')) spfData.count()
Ran both multiple times, the udf usually took about 1.1 - 1.4 s, and the Spark
concat function always took under 0.15 s.