# Pandas percentage of total with groupby

This is obviously simple, but as a numpy newbe I'm getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.

I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).

``````df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': range(1, 7) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
``````

This returns:

``````                  sales
state office_id
AZ    2          839507
4          373917
6          347225
CA    1          798585
3          890850
5          454423
CO    1          819975
3          202969
5          614011
WA    2          163942
4          369858
6          959285
``````

I can't seem to figure out how to "reach up" to the `state` level of the `groupby` to total up the `sales` for the entire `state` to calculate the fraction.

95

Paul H's answer is right that you will have to make a second `groupby` object, but you can calculate the percentage in a simpler way -- just `groupby` the `state_office` and divide the `sales` column by its sum. Copying the beginning of Paul H's answer:

``````# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
'office_id': list(range(1, 7)) * 2,
'sales': [np.random.randint(100000, 999999)
for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
``````

Returns:

``````                     sales
state office_id
AZ    2          16.981365
4          19.250033
6          63.768601
CA    1          19.331879
3          33.858747
5          46.809373
CO    1          36.851857
3          19.874290
5          43.273852
WA    2          34.707233
4          35.511259
6          29.781508
``````
Tuesday, June 1, 2021

20

You may use

``````df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+s+PY)b")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+s+LG)b")
``````

Or, to extract all matches and join them with a space:

``````df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+PY)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+LG)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
``````

Note you need to use a capturing group in the regex pattern so that `extract` could actually extract the text:

Extract capture groups in the regex pat as columns in a DataFrame.

Note the `b` word boundary is necessary to match `PY` / `LG` as a whole word.

Also, if you want to only start a match from a letter, you may revamp the pattern to

``````r"([a-zA-Z][a-zA-Z'-]*s+PY)b"
r"([a-zA-Z][a-zA-Z'-]*s+LG)b"
^^^^^^^^          ^
``````

where `[a-zA-Z]` will match a letter and `[a-zA-Z'-]*` will match 0 or more letters, apostrophes or hyphens.

Python 3.7 with Pandas 0.24.2:

``````pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)

df = pd.DataFrame({
'col_a': ['Python PY is a general-purpose language LG',
'Programming language LG in Python PY',
'Its easier LG to understand  PY',
'The syntax of the language LG is clean PY',
'Python PY is a general purpose PY language LG']
})
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+PY)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+LG)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
``````

Output:

``````                                           col_a              col_b_PY     col_c_LG
0     Python PY is a general-purpose language LG             Python PY  language LG
1           Programming language LG in Python PY             Python PY  language LG
2                Its easier LG to understand  PY        understand  PY    easier LG
3      The syntax of the language LG is clean PY              clean PY  language LG
4  Python PY is a general purpose PY language LG  Python PY purpose PY  language LG
``````
Thursday, August 5, 2021

39

You could do something like the following:

``````target_value = 15
df['max_duration'] = df.groupby('Date')['Duration'].transform('max')
df.query('max_duration == Duration')
.assign(dist=lambda df: np.abs(df['Value'] - target_value))
.assign(min_dist=lambda df: df.groupby('Date')['dist'].transform('min'))
.query('min_dist == dist')
.loc[:, ['Date', 'ID']
``````

Results:

``````        Date ID
4   1/1/2018  e
11  1/2/2018  e
``````
Saturday, August 28, 2021

34
``````f = lambda x: x.rolling(2).sum().shift()
df['c'] = df.groupby('a').b.apply(f)

df
`````` Monday, October 11, 2021

18

Use `GroupBy.transform` with lambda function, `add_prefix` and `join` to original:

``````f = lambda x: 100 * x / float(x.sum())
``````

Or specify new columns names:

``````df[['perc_quantity','perc_count']] = (df.groupby(['Date','tank'])['quantity','count']
.transform(f))
``````

``````print (df)
Date  tank  hose  quantity  count  set   flow  perc_quantity
0  01-01-2018     1     1        20    100  211  12.32      33.333333
1  01-01-2018     1     2        20    200  111  22.32      33.333333
2  01-01-2018     1     3        20    200  123  42.32      33.333333
3  02-01-2018     1     1        10    100  211  12.32      33.333333
4  02-01-2018     1     2        10    200  111  22.32      33.333333
5  02-01-2018     1     3        10    200  123  42.32      33.333333

perc_count
0        20.0
1        40.0
2        40.0
3        20.0
4        40.0
5        40.0
``````
Friday, December 3, 2021