Asked  6 Months ago    Answers:  5   Viewed   242 times

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column.

I tried:

df.select(to_date(df.STRING_COLUMN).alias('new_date')).show()

and I get a string of nulls. Can anyone help?

 Answers

63

Update (1/10/2018):

For Spark 2.2+ the best way to do this is probably using the to_date or to_timestamp functions, which both support the format argument. From the docs:

>>> from pyspark.sql.functions import to_timestamp
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(to_timestamp(df.t, 'yyyy-MM-dd HH:mm:ss').alias('dt')).collect()
[Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]

Original Answer (for Spark < 2.2)

It is possible (preferrable?) to do this without a udf:

from pyspark.sql.functions import unix_timestamp, from_unixtime

df = spark.createDataFrame(
    [("11/25/1991",), ("11/24/1991",), ("11/30/1991",)], 
    ['date_str']
)

df2 = df.select(
    'date_str', 
    from_unixtime(unix_timestamp('date_str', 'MM/dd/yyy')).alias('date')
)

print(df2)
#DataFrame[date_str: string, date: timestamp]

df2.show(truncate=False)
#+----------+-------------------+
#|date_str  |date               |
#+----------+-------------------+
#|11/25/1991|1991-11-25 00:00:00|
#|11/24/1991|1991-11-24 00:00:00|
#|11/30/1991|1991-11-30 00:00:00|
#+----------+-------------------+
Tuesday, June 1, 2021
 
muncherelli
answered 6 Months ago
96

Pass it to the Date constructor.

> var date = new Date('2012-04-15T18:06:08-07:00')
> date
  Mon Apr 16 2012 04:06:08 GMT+0300 (EEST)

For more information about Date, check https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Date.

Thursday, July 22, 2021
 
footy
answered 5 Months ago
28

Your logic condition is wrong. IIUC, what you want is:

import pyspark.sql.functions as f

df.filter((f.col('d')<5))
    .filter(
        ((f.col('col1') != f.col('col3')) | 
         (f.col('col2') != f.col('col4')) & (f.col('col1') == f.col('col3')))
    )
    .show()

I broke the filter() step into 2 calls for readability, but you could equivalently do it in one line.

Output:

+----+----+----+----+---+
|col1|col2|col3|col4|  d|
+----+----+----+----+---+
|   A|  xx|   D|  vv|  4|
|   A|   x|   A|  xx|  3|
|   E| xxx|   B|  vv|  3|
|   F|xxxx|   F| vvv|  4|
|   G| xxx|   G|  xx|  4|
+----+----+----+----+---+
Sunday, August 1, 2021
 
Keat
answered 4 Months ago
74

Somehow pyspark is unable to load the http or https, one of my colleague found the answer for this so here is the solution,

before creating the spark context and sql context we need to load this two line of code

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.4.1 pyspark-shell'

after creating the sparkcontext and sqlcontext from sc = pyspark.SparkContext.getOrCreate and sqlContext = SQLContext(sc)

add the http or https url into the sc by using sc.addFile(url)

Data_XMLFile = sqlContext.read.format("xml").options(rowTag="anytaghere").load(pyspark.SparkFiles.get("*_public.xml")).coalesce(10).cache()

this solution worked for me

Tuesday, August 24, 2021
 
iammichael
answered 3 Months ago
24

Probably your default locale doesn't support English months in MMM. For example in Poland MMM supports "styczeń" but not "Jan" or "January"

To change this In SimpleDateFormat you need to set locale which supports months written in English, for example

new SimpleDateFormat("EEE MMM dd HH:mm:ss z yyyy", Locale.ENGLISH);
Saturday, August 28, 2021
 
bancer
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share