Asked  4 Months ago    Answers:  5   Viewed   584 times

Suppose I have a structured dataframe as follows:

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

The A column has previously been sorted. I wish to find the first row index of where df[df.A!='a']. The end goal is to use this index to break the data frame into groups based on A.

Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'. Therefore it is important that whatever method that you use the scanning stops once the first element is found.

 Answers

79

idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.

use idxmax on df.A.ne('a')

df.A.ne('a').idxmax()

3

or the numpy equivalent

(df.A.values != 'a').argmax()

3

However, if A has already been sorted, then we can use searchsorted

df.A.searchsorted('a', side='right')

array([3])

Or the numpy equivalent

df.A.values.searchsorted('a', side='right')

3
Wednesday, June 30, 2021
 
borrible
answered 4 Months ago
68

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

Tuesday, June 1, 2021
 
CBroe
answered 5 Months ago
94

You can use first_valid_index with select by loc:

s = pd.Series([np.nan,2,np.nan])
print (s)
0    NaN
1    2.0
2    NaN
dtype: float64

print (s.first_valid_index())
1

print (s.loc[s.first_valid_index()])
2.0

# If your Series contains ALL NaNs, you'll need to check as follows:

s = pd.Series([np.nan, np.nan, np.nan])
idx = s.first_valid_index()  # Will return None
first_valid_value = s.loc[idx] if idx is not None else None
print(first_valid_value)
None
Sunday, August 1, 2021
 
Jeremy Pare
answered 3 Months ago
85

I think you need GroupBy.first:

df.groupby(["id", "id2"])["timestamp"].first()

Or drop_duplicates:

df.drop_duplicates(subset=['id','id2'])

For same output:

df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
   id id2            timestamp
0  10  a1  2017-07-12 13:37:00
1  10  a2  2017-07-12 19:00:00
2  11  a1  2017-07-12 13:37:00

df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
   id id2            timestamp
0  10  a1  2017-07-12 13:37:00
1  10  a2  2017-07-12 19:00:00
2  11  a1  2017-07-12 13:37:00
Monday, August 16, 2021
 
JustSteveKing
answered 2 Months ago
39

You could do something like the following:

target_value = 15
df['max_duration'] = df.groupby('Date')['Duration'].transform('max')
df.query('max_duration == Duration')
  .assign(dist=lambda df: np.abs(df['Value'] - target_value))
  .assign(min_dist=lambda df: df.groupby('Date')['dist'].transform('min'))
  .query('min_dist == dist')
  .loc[:, ['Date', 'ID']

Results:

        Date ID
4   1/1/2018  e
11  1/2/2018  e
Saturday, August 28, 2021
 
Carlo Pellegrini
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :