Asked  7 Months ago    Answers:  5   Viewed   21 times

This is a self-answered post. Below I outline a common problem in the NLP domain and propose a few performant methods to solve it.

Oftentimes the need arises to remove punctuation during text cleaning and pre-processing. Punctuation is defined as any character in string.punctuation:

>>> import string
string.punctuation
'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'

This is a common enough problem and has been asked before ad nauseam. The most idiomatic solution uses pandas str.replace. However, for situations which involve a lot of text, a more performant solution may need to be considered.

What are some good, performant alternatives to str.replace when dealing with hundreds of thousands of records?

 Answers

96

Setup

For the purpose of demonstration, let's consider this DataFrame.

df = pd.DataFrame({'text':['a..b?!??', '%hgh&12','abc123!!!', '$$$1234']})
df
        text
0   a..b?!??
1    %hgh&12
2  abc123!!!
3    $$$1234

Below, I list the alternatives, one by one, in increasing order of performance

str.replace

This option is included to establish the default method as a benchmark for comparing other, more performant solutions.

This uses pandas in-built str.replace function which performs regex-based replacement.

df['text'] = df['text'].str.replace(r'[^ws]+', '')

df
     text
0      ab
1   hgh12
2  abc123
3    1234

This is very easy to code, and is quite readable, but slow.


regex.sub

This involves using the sub function from the re library. Pre-compile a regex pattern for performance, and call regex.sub inside a list comprehension. Convert df['text'] to a list beforehand if you can spare some memory, you'll get a nice little performance boost out of this.

import re
p = re.compile(r'[^ws]+')
df['text'] = [p.sub('', x) for x in df['text'].tolist()]

df
     text
0      ab
1   hgh12
2  abc123
3    1234

Note: If your data has NaN values, this (as well as the next method below) will not work as is. See the section on "Other Considerations".


str.translate

python's str.translate function is implemented in C, and is therefore very fast.

How this works is:

  1. First, join all your strings together to form one huge string using a single (or more) character separator that you choose. You must use a character/substring that you can guarantee will not belong inside your data.
  2. Perform str.translate on the large string, removing punctuation (the separator from step 1 excluded).
  3. Split the string on the separator that was used to join in step 1. The resultant list must have the same length as your initial column.

Here, in this example, we consider the pipe separator |. If your data contains the pipe, then you must choose another separator.

import string

punct = '!"#$%&'()*+,-./:;<=>?@[\]^_`{}~'   # `|` is not present here
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['text'] = '|'.join(df['text'].tolist()).translate(transtab).split('|')

df
     text
0      ab
1   hgh12
2  abc123
3    1234

Performance

str.translate performs the best, by far. Note that the graph below includes another variant Series.str.translate from MaxU's answer.

(Interestingly, I reran this a second time, and the results are slightly different from before. During the second run, it seems re.sub was winning out over str.translate for really small amounts of data.) enter image description here

There is an inherent risk involved with using translate (particularly, the problem of automating the process of deciding which separator to use is non-trivial), but the trade-offs are worth the risk.


Other Considerations

Handling NaNs with list comprehension methods; Note that this method (and the next) will only work as long as your data does not have NaNs. When handling NaNs, you will have to determine the indices of non-null values and replace those only. Try something like this:

df = pd.DataFrame({'text': [
    'a..b?!??', np.nan, '%hgh&12','abc123!!!', '$$$1234', np.nan]})

idx = np.flatnonzero(df['text'].notna())
col_idx = df.columns.get_loc('text')
df.iloc[idx,col_idx] = [
    p.sub('', x) for x in df.iloc[idx,col_idx].tolist()]

df
     text
0      ab
1     NaN
2   hgh12
3  abc123
4    1234
5     NaN

Dealing with DataFrames; If you are dealing with DataFrames, where every column requires replacement, the procedure is simple:

v = pd.Series(df.values.ravel())
df[:] = translate(v).values.reshape(df.shape)

Or,

v = df.stack()
v[:] = translate(v)
df = v.unstack()

Note that the translate function is defined below in with the benchmarking code.

Every solution has tradeoffs, so deciding what solution best fits your needs will depend on what you're willing to sacrifice. Two very common considerations are performance (which we've already seen), and memory usage. str.translate is a memory-hungry solution, so use with caution.

Another consideration is the complexity of your regex. Sometimes, you may want to remove anything that is not alphanumeric or whitespace. Othertimes, you will need to retain certain characters, such as hyphens, colons, and sentence terminators [.!?]. Specifying these explicitly add complexity to your regex, which may in turn impact the performance of these solutions. Make sure you test these solutions on your data before deciding what to use.

Lastly, unicode characters will be removed with this solution. You may want to tweak your regex (if using a regex-based solution), or just go with str.translate otherwise.

For even more performance (for larger N), take a look at this answer by Paul Panzer.


Appendix

Functions

def pd_replace(df):
    return df.assign(text=df['text'].str.replace(r'[^ws]+', ''))


def re_sub(df):
    p = re.compile(r'[^ws]+')
    return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])

def translate(df):
    punct = string.punctuation.replace('|', '')
    transtab = str.maketrans(dict.fromkeys(punct, ''))

    return df.assign(
        text='|'.join(df['text'].tolist()).translate(transtab).split('|')
    )

# MaxU's version (https://stackoverflow.com/a/50444659/4909087)
def pd_translate(df):
    punct = string.punctuation.replace('|', '')
    transtab = str.maketrans(dict.fromkeys(punct, ''))

    return df.assign(text=df['text'].str.translate(transtab))

Performance Benchmarking Code

from timeit import timeit

import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['pd_replace', 're_sub', 'translate', 'pd_translate'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        l = ['a..b?!??', '%hgh&12','abc123!!!', '$$$1234'] * c
        df = pd.DataFrame({'text' : l})
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=30)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()
Tuesday, June 1, 2021
 
sohum
answered 7 Months ago
81

Use HDF5. Beats writing flat files hands down. And you can query. Docs are here

Here's a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.

Docs now include a performance section:

See here

Sunday, August 1, 2021
 
mgierw
answered 4 Months ago
66

Numpy's numpy.add.at and pandas.factorize

This is intended to be fast. However, I tried to organize it to be readable as well.

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)

b = np.zeros((n, m), dtype=np.int64)

np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)

John     White
Tom       Blue
Jerry    Black
dtype: object

groupby, size, and idxmax

df.groupby(['name', 'color']).size().unstack().idxmax(1)

name
Jerry    Black
John     White
Tom       Blue
dtype: object

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

Counter

¯_(ツ)_/¯

from collections import Counter

df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object
Monday, August 2, 2021
 
Lasse Edsvik
answered 4 Months ago
94

The case argument is actually a convenience as an alternative to specifying flags=re.IGNORECASE. It has no bearing on replacement if the replacement is not regex-based.

So, when regex=True, these are your possible choices:

pd.Series('Jr. eng').str.replace(r'jr.', 'jr', regex=True, case=False)
# pd.Series('Jr. eng').str.replace(r'jr.', 'jr', case=False)

0    jr eng
dtype: object

Or,

pd.Series('Jr. eng').str.replace(r'jr.', 'jr', regex=True, flags=re.IGNORECASE)
# pd.Series('Jr. eng').str.replace(r'jr.', 'jr', flags=re.IGNORECASE)

0    jr eng
dtype: object

You can also get cheeky and bypass both keyword arguments by incorporating the case insensitivity flag as part of the pattern as ?i. See

pd.Series('Jr. eng').str.replace(r'(?i)jr.', 'jr')
0    jr eng
dtype: object

Note
You will need to escape the period . in regex mode, because the unescaped dot is a meta-character with a different meaning (match any character). If you want to dynamically escape meta-chars in patterns, you can use re.escape.

For more information on flags and anchors, see this section of the docs and re HOWTO.


From the source code, it is clear that the "case" argument is ignored if regex=False. See

# Check whether repl is valid (GH 13438, GH 15055)
if not (is_string_like(repl) or callable(repl)):
    raise TypeError("repl must be a string or callable")

is_compiled_re = is_re(pat)
if regex:
    if is_compiled_re:
        if (case is not None) or (flags != 0):
            raise ValueError("case and flags cannot be set"
                             " when pat is a compiled regex")
    else:
        # not a compiled regex
        # set default case
        if case is None:
            case = True

        # add case flag, if provided
        if case is False:
            flags |= re.IGNORECASE
    if is_compiled_re or len(pat) > 1 or flags or callable(repl):
        n = n if n >= 0 else 0
        compiled = re.compile(pat, flags=flags)
        f = lambda x: compiled.sub(repl=repl, string=x, count=n)
    else:
        f = lambda x: x.replace(pat, repl, n)

You can see the case argument is only checked inside the if statement.

IOW, the only way is to ensure regex=True so that replacement is regex-based.

Thursday, August 5, 2021
 
Nick
answered 4 Months ago
95

There are two solutions:

  1. df.col.apply method is more straightforward but also a little bit slower:

    In [1]: import pandas as pd
    
    In [2]: import re
    
    In [3]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
    
    In [4]: df
    Out[4]: 
       col1       col2
    0     1      admin
    1     2         aa
    2     3         bb
    3     4  c_admin_d
    4     5   ee_admin
    
    In [5]: r = re.compile(r'.*(admin).*')
    
    In [6]: df.col2.apply(lambda x: bool(r.match(x)))
    Out[6]: 
    0     True
    1    False
    2    False
    3     True
    4     True
    Name: col2, dtype: bool
    
    In [7]: %timeit -n 100000 df.col2.apply(lambda x: bool(r.match(x)))
    167 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

  1. np.vectorize method require import numpy, but it's more efficient (about 4 times faster in my timeit test).

    In [1]: import numpy as np
    
    In [2]: import pandas as pd
    
    In [3]: import re
    
    In [4]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
    
    In [5]: df
    Out[5]: 
       col1       col2
    0     1      admin
    1     2         aa
    2     3         bb
    3     4  c_admin_d
    4     5   ee_admin
    
    In [6]: r = re.compile(r'.*(admin).*')
    
    In [7]: regmatch = np.vectorize(lambda x: bool(r.match(x)))
    
    In [8]: regmatch(df.col2.values)
    Out[8]: array([ True, False, False,  True,  True])
    
    In [9]: %timeit -n 100000 regmatch(df.col2.values)
    43.4 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

Since you have changed your question to check any cell, and also concern about time efficiency:

# if you want to check all columns no mater what `dtypes` they are
dfs = df.astype(str, copy=True, errors='raise')
regmatch(dfs.values) # This will return a 2-d array of booleans
regmatch(dfs.values).any() # For existence.

You can still use df.applymap method, but again, it will be slower.

dfs = df.astype(str, copy=True, errors='raise')
r = re.compile(r'.*(admin).*')
dfs.applymap(lambda x: bool(r.match(x))) # This will return a dataframe of booleans.
dfs.applymap(lambda x: bool(r.match(x))).any().any() # For existence.
Thursday, August 12, 2021
 
PLPeeters
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share