Asked  6 Months ago    Answers:  5   Viewed   59 times

How to replace values in a Pandas series s via a dictionary d has been asked and re-asked many times.

The recommended method (1, 2, 3, 4) is to either use s.replace(d) or, occasionally, use s.map(d) if all your series values are found in the dictionary keys.

However, performance using s.replace is often unreasonably slow, often 5-10x slower than a simple list comprehension.

The alternative, s.map(d) has good performance, but is only recommended when all keys are found in the dictionary.

Why is s.replace so slow and how can performance be improved?

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

Note: This question is not marked as a duplicate because it is looking for specific advice on when to use different methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.

 Answers

80

One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

General case

  • Use df['A'].map(d) if all values mapped; or
  • Use df['A'].map(d).fillna(df['A']).astype(int) if >5% values mapped.

Few, e.g. < 5%, values in d

  • Use df['A'].replace(d)

The "crossover point" of ~5% is specific to Benchmarking below.

Interestingly, a simple list comprehension generally underperforms map in either scenario.

Benchmarking

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 - Full Map #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit df['A'].map(d)                              # 84.3ms
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 - Partial Map #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int)  # 111ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

Explanation

The reason why s.replace is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

This is an excerpt from replace() in pandasgeneric.py.

items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]

if any(are_mappings):
    # handling of nested dictionaries
else:
    to_replace, value = keys, values

return self.replace(to_replace, value, inplace=inplace,
                    limit=limit, regex=regex)

There appear to be many steps involved:

  • Converting dictionary to a list.
  • Iterating through list and checking for nested dictionaries.
  • Feeding an iterator of keys and values into a replace function.

This can be compared to much leaner code from map() in pandasseries.py:

if isinstance(arg, (dict, Series)):
    if isinstance(arg, dict):
        arg = self._constructor(arg, index=arg.keys())

    indexer = arg.index.get_indexer(values)
    new_values = algos.take_1d(arg._values, indexer)
Tuesday, June 1, 2021
 
jerrygarciuh
answered 6 Months ago
73

You can perform this task by forming a |-separated string. This works because pd.Series.str.replace accepts regex:

Replace occurrences of pattern/regex in the Series/Index with some other string. Equivalent to str.replace() or re.sub().

This avoids the need to create a dictionary.

import pandas as pd

df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})

pattern = '|'.join(['LOCAL', 'FOREIGN', 'HELLO'])

df['A'] = df['A'].str.replace(pattern, 'CORP')

#               A
# 0     CORP TEST
# 1     TEST CORP
# 2  ANOTHER CORP
# 3       NOTHING
Wednesday, July 28, 2021
 
kinske
answered 4 Months ago
80

pd.DataFrame objects don't have a map method. You can instead construct an index from two columns and use pd.Index.map with a function:

df_a['deleted'] = df_a.set_index(['number', 'code']).index.map(d.get)
df_a['deleted'] = df_a['deleted'].fillna('none')

Compatibility note

For Pandas versions >0.25, you can use pd.Index.map directly with a dictionary, i.e. use d instead of d.get.

For prior versions, we use d.get instead of d because, unlike pd.Series.map, pd.Index.map does not accept a dictionary directly. But it can accept a function such as dict.get. Note also we split apart the fillna operation as pd.Index.map returns an array rather than a series.

Saturday, August 14, 2021
 
maelgrove
answered 4 Months ago
49

You could simplify the function as shown:

def streaks(df, col):
    sign = np.sign(df[col])
    s = sign.groupby((sign!=sign.shift()).cumsum()).cumsum()
    return df.assign(u_streak=s.where(s>0, 0.0), d_streak=s.where(s<0, 0.0).abs())

Using it:

streaks(df, 'E')

enter image description here


Firstly, compute the sign of each cell present in the column under consideration using np.sign. These assign +1 to positive numbers and -1 to the negative.

Next, identify sets of adjacent values (comparing current cell and it's next) using sign!=sign.shift() and take it's cumulative sum which would serve in the grouping process.

Perform groupby letting these as the key/condition and again take the cumulative sum across the sub-group elements.

Finally, assign the positive computed cumsum values to ustreak and the negative ones (absolute value after taking their modulus) to dstreak.

Monday, August 23, 2021
 
Sidharth Mudgal
answered 3 Months ago
69

Use Series.str.replace and Series.astype

df = pd.Series(['2$-32$-4','123$-12','00123','44'])
df.str.replace(r'$-','0').astype(float)

0    203204
1    123012
2       123
3        44
dtype: float64
Saturday, October 2, 2021
 
Anders Andersen
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share