Python - Group-by multiple columns with .mean() and .agg()

The name of the picture


Python - Group-by multiple columns with .mean() and .agg()



I want to group-by three columns, and then find the mean of a fourth numerical column for all rows which are duplicated across the first three columns. I can achieve this with the following function:


df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False)['col4'].mean()



The problem is that I also want a fifth column which will aggregate for all rows grouped by the groupby function, which I don't know how to do on top of the previous function. For example:


df
index col1 col2 col3 col4 col5
0 Week_1 James John 1 when and why?
1 Week_1 James John 3 How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1 How far is it?
5 Week_2 Andrew Simon 2 Are you going?


CURRENT(with above function):
index col1 col2 col3 col4
0 Week_1 James John 2
1 Week_2 James John 2
2 Week_2 Mark Jim 3
3 Week_2 Andrew Simon 1.5

DESIRED:
index col1 col2 col3 col4 col5
0 Week_1 James John 2 when and why?, How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?going?
4 Week_2 Andrew Simon 1.5 How far is it?, Are you going?



I've tried here and here, however the .mean() function I'm using complicates the process. Any help would be appreciated. (If possible, I would like to specify a custom delimiter separating the strings of col5 when aggregated).




2 Answers
2



You can define for each column aggregate function:


df2=df.groupby(['col1','col2','col3'], as_index=False).agg({'col4':'mean', 'col5':','.join})
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?,How?
1 Week_2 Andrew Simon 1.5 How far is it?,Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?



General solution is numeric columns aggregate by mean and other by join:


mean


join


f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False).agg(f)
print (df2)

col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?, How?
1 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?





Thanks mate, works like a charm. Have a good day, you've made mine a lot easier!
– Laurie Bamber
6 hours ago





@LaurieBamber - Thank you, for you too :)
– jezrael
6 hours ago


df = pd.DataFrame({
'col1':['a','a','b','b'],
'col2':[1,2,1,1],
'col3':['str1','str2','str3','str4']
})

result = df.groupby(['col1','col2'])['col3'].apply(lambda x:','.join(list(x)))






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Stripe::AuthenticationError No API key provided. Set your API key using “Stripe.api_key = ”

CRM reporting Extension - SSRS instance is blank

Keycloak server returning user_not_found error when user is already imported with LDAP