Asked  7 Months ago    Answers:  5   Viewed   54 times

This is obviously simple, but as a numpy newbe I'm getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.

I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': range(1, 7) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

This returns:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.

 Answers

95

Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

Returns:

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508
Tuesday, June 1, 2021
 
JakeGR
answered 7 Months ago
20

You may use

df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+s+PY)b")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+s+LG)b")

Or, to extract all matches and join them with a space:

df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+PY)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+LG)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)

Note you need to use a capturing group in the regex pattern so that extract could actually extract the text:

Extract capture groups in the regex pat as columns in a DataFrame.

Note the b word boundary is necessary to match PY / LG as a whole word.

Also, if you want to only start a match from a letter, you may revamp the pattern to

r"([a-zA-Z][a-zA-Z'-]*s+PY)b"
r"([a-zA-Z][a-zA-Z'-]*s+LG)b"
   ^^^^^^^^          ^

where [a-zA-Z] will match a letter and [a-zA-Z'-]* will match 0 or more letters, apostrophes or hyphens.

Python 3.7 with Pandas 0.24.2:

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)

df = pd.DataFrame({
    'col_a': ['Python PY is a general-purpose language LG',
             'Programming language LG in Python PY',
             'Its easier LG to understand  PY',
             'The syntax of the language LG is clean PY',
             'Python PY is a general purpose PY language LG']
    })
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+PY)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+s+LG)b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)

Output:

                                           col_a              col_b_PY     col_c_LG
0     Python PY is a general-purpose language LG             Python PY  language LG
1           Programming language LG in Python PY             Python PY  language LG
2                Its easier LG to understand  PY        understand  PY    easier LG
3      The syntax of the language LG is clean PY              clean PY  language LG
4  Python PY is a general purpose PY language LG  Python PY purpose PY  language LG
Thursday, August 5, 2021
 
Ujjawal Khare
answered 4 Months ago
39

You could do something like the following:

target_value = 15
df['max_duration'] = df.groupby('Date')['Duration'].transform('max')
df.query('max_duration == Duration')
  .assign(dist=lambda df: np.abs(df['Value'] - target_value))
  .assign(min_dist=lambda df: df.groupby('Date')['dist'].transform('min'))
  .query('min_dist == dist')
  .loc[:, ['Date', 'ID']

Results:

        Date ID
4   1/1/2018  e
11  1/2/2018  e
Saturday, August 28, 2021
 
Carlo Pellegrini
answered 4 Months ago
34
f = lambda x: x.rolling(2).sum().shift()
df['c'] = df.groupby('a').b.apply(f)

df

enter image description here

Monday, October 11, 2021
 
user1882705
answered 2 Months ago
18

Use GroupBy.transform with lambda function, add_prefix and join to original:

f = lambda x: 100 * x / float(x.sum())
df = df.join(df.groupby(['Date','tank'])['quantity','count'].transform(f).add_prefix('perc_'))

Or specify new columns names:

df[['perc_quantity','perc_count']] = (df.groupby(['Date','tank'])['quantity','count']
                                        .transform(f))

print (df)
         Date  tank  hose  quantity  count  set   flow  perc_quantity  
0  01-01-2018     1     1        20    100  211  12.32      33.333333   
1  01-01-2018     1     2        20    200  111  22.32      33.333333   
2  01-01-2018     1     3        20    200  123  42.32      33.333333   
3  02-01-2018     1     1        10    100  211  12.32      33.333333   
4  02-01-2018     1     2        10    200  111  22.32      33.333333   
5  02-01-2018     1     3        10    200  123  42.32      33.333333   

   perc_count  
0        20.0  
1        40.0  
2        40.0  
3        20.0  
4        40.0  
5        40.0  
Friday, December 3, 2021
 
MadProgrammer
answered 4 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share