Asked  6 Months ago    Answers:  5   Viewed   60 times

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

For example:

lst = ['kdSj;af-!?', 'aBC+dsfa?-', 'sdKaJg|dksaf-*']

I currently apply the mask like this:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])
df = df[mask]

My dataframe is large (~1mio rows) and lst has length 100. Is there a more efficient way? For example, if the first item in lst is found, we should not have to test any subsequent strings for that row.

 Answers

12

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

This is easy to do using re.escape:

>>> import re
>>> esc_lst = [re.escape(s) for s in lst]

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

>>> pattern = '|'.join(esc_lst)

The masking stage then becomes a single low-level loop through the rows:

df[col].str.contains(pattern, case=False)

Here's a simple setup to get a sense of performance:

from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop

The method in the question took approximately 5 seconds using the same input data.

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

Tuesday, June 1, 2021
 
khaverim
answered 6 Months ago
37

Try unnesting

s=unnesting(df,['class'])

Then, we do crosstab

pd.crosstab(s['year'],s['class'])

Method from sklearn

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df['class']),columns=mlb.classes_, index=df.year).sum(level=0)
Out[293]: 
      A  B  C
year         
2001  2  2  1
2002  1  1  1
2003  0  1  1

Method of get_dummies

df.set_index('year')['class'].apply(','.join).str.get_dummies(sep=',').sum(level=0)
Out[297]: 
      A  B  C
year         
2001  2  2  1
2002  1  1  1
2003  0  1  1

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
Monday, August 23, 2021
 
newbStudent
answered 4 Months ago
57

You can use str to access the string methods for the column/Series and then slice the strings as normal:

>>> df = pd.DataFrame(['foo', 'bar', 'baz'], columns=['col1'])
>>> df
  col1
0  foo
1  bar
2  baz

>>> df.col1.str[1]
0    o
1    a
2    a

This str attribute also gives you access variety of very useful vectorised string methods, many of which are instantly recognisable from Python's own assortment of built-in string methods (split, replace, etc.).

Tuesday, August 24, 2021
 
Andrew
answered 3 Months ago
66

Yes it is possible. In the API docs you will find this information:

A filter is uniquely identified by its String representation which is parsed and matched by the MetaFilter.MetaMatcher to determine if the Meta is allowed or not.

The MetaFilter.DefaultMetaMatcher interprets the filter as a sequence of any name-value properties (separated by a space), prefixed by "+" for inclusion and "-" for exclusion. E.g.:

MetaFilter filter = new MetaFilter("+author Mauro -theme smoke testing +map *API -skip"); filter.allow(new Meta(asList("map someAPI")));

The use of the MetaFilter.GroovyMetaMatcher is triggered by the prefix "groovy:" and allows the filter to be interpreted as a Groovy expression.

MetaFilter filter = new MetaFilter("groovy: (a == '11' | a == '22') && b == '33'");

So probably if you play with the conditions, you will get your run configuration customized. Try this example:

mvn clean install -P -Djbehave.meta.filter="myCustomRunConf:(+product && +action)"

More info amout the MetaFilter class in the API docs: http://jbehave.org/reference/stable/javadoc/core/org/jbehave/core/embedder/MetaFilter.html

Monday, October 25, 2021
 
Rechlay
answered 1 Month ago
78

Another pandas solution with str.split, sum and value_counts :

print pd.Series(s.str.split(',').sum()).value_counts()
abc    2
ghi    2
def    1
dtype: int64

EDIT:

More efficent methods:

import pandas as pd
s = pd.Series(['abc,def,ghi','ghi,abc'])
s = pd.concat([s]*10000).reset_index(drop=True)

In [17]: %timeit pd.Series(s.str.split(',').sum()).value_counts()
1 loops, best of 3: 3.1 s per loop

In [18]: %timeit s.str.split(',', expand=True).stack().value_counts()
10 loops, best of 3: 46.2 ms per loop

In [19]: %timeit pd.DataFrame([ x.split(',') for x in s.tolist() ]).stack().value_counts()
10 loops, best of 3: 22.2 ms per loop

In [20]: %timeit pd.Series([item for sublist in [ x.split(',') for x in s.tolist() ] for item in sublist]).value_counts()
100 loops, best of 3: 16.6 ms per loop
Sunday, November 28, 2021
 
mootymoots
answered 2 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share