Asked  7 Months ago    Answers:  5   Viewed   34 times

I have a data frame. Let's call him bob:

> head(bob)
                 phenotype                         exclusion
GSM399350 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399351 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399352 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399353 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399354 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399355 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-

I'd like to concatenate the rows of this data frame (this will be another question). But look:

> class(bob$phenotype)
[1] "factor"

Bob's columns are factors. So, for example:

> as.character(head(bob))
[1] "c(3, 3, 3, 6, 6, 6)"       "c(3, 3, 3, 3, 3, 3)"      
[3] "c(29, 29, 29, 30, 30, 30)"

I don't begin to understand this, but I guess these are indices into the levels of the factors of the columns (of the court of king caractacus) of bob? Not what I need.

Strangely I can go through the columns of bob by hand, and do

bob$phenotype <- as.character(bob$phenotype)

which works fine. And, after some typing, I can get a data.frame whose columns are characters rather than factors. So my question is: how can I do this automatically? How do I convert a data.frame with factor columns into a data.frame with character columns without having to manually go through each column?

Bonus question: why does the manual approach work?

 Answers

76

Just following on Matt and Dirk. If you want to recreate your existing data frame without changing the global option, you can recreate it with an apply statement:

bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)

This will convert all variables to class "character", if you want to only convert factors, see Marek's solution below.

As @hadley points out, the following is more concise.

bob[] <- lapply(bob, as.character)

In both cases, lapply outputs a list; however, owing to the magical properties of R, the use of [] in the second case keeps the data.frame class of the bob object, thereby eliminating the need to convert back to a data.frame using as.data.frame with the argument stringsAsFactors = FALSE.

Tuesday, June 1, 2021
 
smiggle
answered 7 Months ago
49

You can do this with a combination of explode and pivot:

import pyspark.sql.functions as F

# explode to get "long" format
df=df.withColumn('exploded', F.explode('Q'))

# get the name and the name in separate columns
df=df.withColumn('name', F.col('exploded').getItem(0))
df=df.withColumn('value', F.col('exploded').getItem(1))

# now pivot
df.groupby('Id').pivot('name').agg(F.max('value')).na.fill(0)
Wednesday, August 11, 2021
 
tim_d
answered 4 Months ago
17

The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with

 do.call(data.frame, df)

A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class

df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
Friday, August 13, 2021
 
Baba
answered 4 Months ago
74

You're trying to compare a scalar with the entire series which raise the ValueError you saw. A simple method would be to cast the boolean series to int:

In [84]:
df['viz'] = (df['viz'] !='n').astype(int)
df

Out[84]:
   viz  a1_count  a1_mean     a1_std
0    0         3        2   0.816497
1    1         0      NaN        NaN
2    0         2       51  50.000000

You can also use np.where:

In [86]:
df['viz'] = np.where(df['viz'] == 'n', 0, 1)
df

Out[86]:
   viz  a1_count  a1_mean     a1_std
0    0         3        2   0.816497
1    1         0      NaN        NaN
2    0         2       51  50.000000

Output from the boolean comparison:

In [89]:
df['viz'] !='n'

Out[89]:
0    False
1     True
2    False
Name: viz, dtype: bool

And then casting to int:

In [90]:
(df['viz'] !='n').astype(int)

Out[90]:
0    0
1    1
2    0
Name: viz, dtype: int32
Saturday, September 18, 2021
 
wavyGravy
answered 3 Months ago
75

Based on the OP's clarification, it could be

out <- reshape(dat[setdiff(names(dat), 'item_type')], idvar = c('person_id', 'gender'), direction = 'wide', timevar = 'item_id')
dim(out)
#[1]  2000 16006

out[1:3, c(1:3, 16000:16006)]
#   person_id gender item_trans.1 item_trans.15998 item_trans.15999 item_trans.16000 item_trans.16001 item_trans.16002 item_trans.16003
#1          1   MALE     5.091636               NA               NA               NA               NA               NA               NA
#32         2   MALE           NA               NA               NA               NA               NA               NA               NA
#64         3 FEMALE           NA               NA               NA               NA               NA               NA               NA
#   item_trans.16004
#1                NA
#32               NA
#64               NA
Saturday, December 4, 2021
 
Tushar Garg
answered 3 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share