Python - Group-by multiple columns with .mean() and .agg()

Multi tool use
Multi tool use
The name of the picture


Python - Group-by multiple columns with .mean() and .agg()



I want to group-by three columns, and then find the mean of a fourth numerical column for all rows which are duplicated across the first three columns. I can achieve this with the following function:


df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False)['col4'].mean()



The problem is that I also want a fifth column which will aggregate for all rows grouped by the groupby function, which I don't know how to do on top of the previous function. For example:


df
index col1 col2 col3 col4 col5
0 Week_1 James John 1 when and why?
1 Week_1 James John 3 How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1 How far is it?
5 Week_2 Andrew Simon 2 Are you going?


CURRENT(with above function):
index col1 col2 col3 col4
0 Week_1 James John 2
1 Week_2 James John 2
2 Week_2 Mark Jim 3
3 Week_2 Andrew Simon 1.5

DESIRED:
index col1 col2 col3 col4 col5
0 Week_1 James John 2 when and why?, How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?going?
4 Week_2 Andrew Simon 1.5 How far is it?, Are you going?



I've tried here and here, however the .mean() function I'm using complicates the process. Any help would be appreciated. (If possible, I would like to specify a custom delimiter separating the strings of col5 when aggregated).




2 Answers
2



You can define for each column aggregate function:


df2=df.groupby(['col1','col2','col3'], as_index=False).agg({'col4':'mean', 'col5':','.join})
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?,How?
1 Week_2 Andrew Simon 1.5 How far is it?,Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?



General solution is numeric columns aggregate by mean and other by join:


mean


join


f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False).agg(f)
print (df2)

col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?, How?
1 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?





Thanks mate, works like a charm. Have a good day, you've made mine a lot easier!
– Laurie Bamber
6 hours ago





@LaurieBamber - Thank you, for you too :)
– jezrael
6 hours ago


df = pd.DataFrame({
'col1':['a','a','b','b'],
'col2':[1,2,1,1],
'col3':['str1','str2','str3','str4']
})

result = df.groupby(['col1','col2'])['col3'].apply(lambda x:','.join(list(x)))






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

KRJMqi8Y9DyF0OPwy8,Ib,QWtW AWO4n6Zcp 9Nl07AKjVAP9PIBW3bnsm8M3jIQ NYEa4,O6DD1ywyq,0XlK,7RTRe0,1P1LJGmO kbLj0O
JJqKj 3Hv

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

PHP parse/syntax errors; and how to solve them?

415 Unsupported Media Type while sending json file over REST Template