<
>

pandas学习笔记(第五弹)

2020-07-29 06:45:07 来源:易采站长站 作者:

注:本教程为系列教程此章节接前面第一弹
跳转到专题地址
跳转到上一章
本章主要讲述分组的用法

16 分组聚合、过滤、转换
16.1 准备数据

# 设置最多显示8列数据
pd.options.display.max_columns = 8

collage_data = pd.read_csv("pandasLearnData/college.csv")
collage_data.head(5)

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.00.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.00.52140.24223970021941.5
2Amridge UniversityMontgomeryAL0.00.77950.85404010023370
3University of Alabama in HuntsvilleHuntsvilleAL0.00.45960.26404550024097
4Alabama State UniversityMontgomeryAL1.00.75540.12702660033118.5

5 rows × 27 columns

16.2 定义聚合

gb_STABBR = collage_data.groupby("STABBR")

16.3 聚合的属性原理
16.3.1 聚合类别
type(gb_STABBR)

pandas.core.groupby.generic.DataFrameGroupBy

16.3.2 聚合类型的所用方法和属性
for attr in dir(gb_STABBR):
print("" if attr.startswith("_") else attr+"t",end="")

CITYCURROPERDISTANCEONLYGRAD_DEBT_MDN_SUPPHBCUINSTNMMD_EARN_WNE_P10MENONLYPCTFLOANPCTPELLPPTUG_EFRELAFFILSATMTMIDSATVRMIDSTABBRUG25ABVUGDSUGDS_2MORUGDS_AIANUGDS_ASIANUGDS_BLACKUGDS_HISPUGDS_NHPIUGDS_NRAUGDS_UNKNUGDS_WHITEWOMENONLYaggaggregateallanyapplybackfillbfillboxplotcorrcorrwithcountcovcumcountcummaxcummincumprodcumsumdescribediffdtypesexpandingffillfillnafilterfirstget_groupgroupsheadhistidxmaxidxminindiceslastmadmaxmeanmedianminndimngroupngroupsnthnuniqueohlcpadpct_changepipeplotprodquantilerankresamplerollingsemshiftsizeskewstdsumtailtaketransformtshiftvar

16.3.3 查看分组数量
gb_STABBR.ngroups

59

16.3.4 返回分组的键值
# groups是返回一个字典 {"分组键":索引列表, ...}
gb_STABBR.groups.keys()

dict_keys(['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'PW', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'])

16.3.5 根据键获取某一个分组
gb_STABBR.get_group("AK")

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
60University of Alaska AnchorageAnchorageAK0.00.26470.43864250019449.5
61Alaska Bible CollegePalmerAK0.00.28570.4286NaNPrivacySuppressed
62University of Alaska FairbanksFairbanksAK0.00.25500.45193620019355
63University of Alaska SoutheastJuneauAK0.00.19960.55503740016875
64Alaska Pacific UniversityAnchorageAK0.00.52970.49104700023250
65AVTEC-Alaska’s Institute of TechnologySewardAK0.00.06640.712733500PrivacySuppressed
66Charter College-AnchorageAnchorageAK0.00.75030.54723920013875
67Alaska Career CollegeAnchorageAK0.00.78600.5612287008994
5171Ilisagvik CollegeBarrowAK0.00.00000.649824900PrivacySuppressed
5417Alaska Christian CollegeSoldotnaAK0.00.67920.2264NaNPrivacySuppressed

10 rows × 27 columns

16.3.6 DataFrameGroupBy对象是一个可迭代对象

from collections import Iterable
isinstance(gb_STABBR,Iterable)

True

# 显示前3个分组
i = 0
for name,group in gb_STABBR:
i += 1
print("组名:",name,"tt类型:",type(group))
display(group.head(2))
if i >= 3:
break

组名: AK 类型:

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
60University of Alaska AnchorageAnchorageAK0.00.26470.43864250019449.5
61Alaska Bible CollegePalmerAK0.00.28570.4286NaNPrivacySuppressed

2 rows × 27 columns

组名: AL 类型:

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.00.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.00.52140.24223970021941.5

2 rows × 27 columns

组名: AR 类型:

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
128University of Arkansas at Little RockLittle RockAR0.00.47750.40623390021736
129University of Arkansas for Medical SciencesLittle RockAR0.00.61440.51336140012500

2 rows × 27 columns

16.3.7 head显示每个分组的头几行

# 前面一个head用于空值每个分组内显示前几行,后面表示总共显示前几行
gb_STABBR.head(2).head(5)

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.00.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.00.52140.24223970021941.5
43Prince Institute-SoutheastElmhurstIL0.00.93750.6569PrivacySuppressed20992
60University of Alaska AnchorageAnchorageAK0.00.26470.43864250019449.5
61Alaska Bible CollegePalmerAK0.00.28570.4286NaNPrivacySuppressed

5 rows × 27 columns

16.3.8 nth指定显示分组内的相应行

# 显示每个分组的第一行,最后一行
gb_STABBR.nth([1,-1]).head(5)

INSTNMCITYHBCUMENONLYPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
STABBR
AKAlaska Bible CollegePalmer0.00.00.28570.4286NaNPrivacySuppressed
AKAlaska Christian CollegeSoldotna0.00.00.67920.2264NaNPrivacySuppressed
ALUniversity of Alabama at BirminghamBirmingham0.00.00.52140.24223970021941.5
ALStrayer University-Huntsville CampusHuntsvilleNaNNaNNaNNaN4920036173.5
ARCareer Academy of Hair Design-FayettevilleFayettevilleNaNNaNNaNNaNNaN6365

5 rows × 26 columns

16.4 对分组对象使用聚合函数
16.4.1 agg函数传入聚合函数名称

# 统计每个每个州的本科生人数
gb_STABBR["UGDS"].agg("sum").head(5)

STABBR
AK 24932.0
AL 248298.0
AR 134820.0
AS 1276.0
AZ 520439.0
Name: UGDS, dtype: float64

16.4.2 直接使用统计函数
gb_STABBR["UGDS"].sum().head(5)

STABBR
AK 24932.0
AL 248298.0
AR 134820.0
AS 1276.0
AZ 520439.0
Name: UGDS, dtype: float64

16.4.3 agg传入多种聚合函数进行统计
gb_STABBR["UGDS"].agg(["mean",np.sum,"std"]).head(5)

meansumstd
STABBR
AK2493.20000024932.04051.726650
AL2789.865169248298.04657.877043
AR1644.146341134820.03142.774213
AS1276.0000001276.0NaN
AZ4130.468254520439.014893.640651

16.5 对多个字段进行分组
16.5.1 分组

#对州和城市进行分组
collage_data.groupby(["STABBR","CITY"]).head(2).head(5)

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.00.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.00.52140.24223970021941.5
2Amridge UniversityMontgomeryAL0.00.77950.85404010023370
3University of Alabama in HuntsvilleHuntsvilleAL0.00.45960.26404550024097
4Alabama State UniversityMontgomeryAL1.00.75540.12702660033118.5

5 rows × 27 columns

16.5.2 对多个字段分组并使用多种聚合方法

grouped = collage_data.groupby(["STABBR","CITY"]).agg({"INSTNM":"size","UGDS":["mean","std","sum"]})
grouped.head(5)

INSTNMUGDS
sizemeanstdsum
STABBRCITY
AKAnchorage44218.755922.37060516875.0
Barrow1109.00NaN109.0
Fairbanks15536.00NaN5536.0
Juneau11428.00NaN1428.0
Palmer127.00NaN27.0

16.6 消除多级索引
16.6.1 查看索引结构

# 可以看到列索引具有两级
grouped.columns

MultiIndex([('INSTNM', 'size'),
( 'UGDS', 'mean'),
( 'UGDS', 'std'),
( 'UGDS', 'sum')],
)

16.6.2 获取一级索引
columns_level1 = grouped.columns.get_level_values(0)
columns_level1

Index(['INSTNM', 'UGDS', 'UGDS', 'UGDS'], dtype='object')

16.6.3 获取二级索引
columns_level2 = grouped.columns.get_level_values(1)
columns_level2

Index(['size', 'mean', 'std', 'sum'], dtype='object')

16.6.4 合并索引
grouped.columns = columns_level1 + "_" + columns_level2
grouped.head(5)

INSTNM_sizeUGDS_meanUGDS_stdUGDS_sum
STABBRCITY
AKAnchorage44218.755922.37060516875.0
Barrow1109.00NaN109.0
Fairbanks15536.00NaN5536.0
Juneau11428.00NaN1428.0
Palmer127.00NaN27.0

16.6.5 行索引可直接利用reset_index清除

grouped.reset_index().head(5)

STABBRCITYINSTNM_sizeUGDS_meanUGDS_stdUGDS_sum
0AKAnchorage44218.755922.37060516875.0
1AKBarrow1109.00NaN109.0
2AKFairbanks15536.00NaN5536.0
3AKJuneau11428.00NaN1428.0
4AKPalmer127.00NaN27.0

16.6.6 在使用groupby方法时,传入参数不增加索引

# as_index:表示是否改变索引,sort:表示是否根据分组字段进行排序
collage_data.groupby(["STABBR","CITY"],as_index=False,sort=False)
.agg({"INSTNM":"size","UGDS":["mean","std","sum"]})
.head(5)

STABBRCITYINSTNMUGDS
sizemeanstdsum
0ALNormal14206.000000NaN4206.0
1ALBirmingham142236.4285713314.61128331310.0
2ALMontgomery111562.3000001717.28248215623.0
3ALHuntsville71511.5000002042.2586279069.0
4ALTuscaloosa311887.33333315666.54525835662.0

16.7 自定义聚合函数
16.7.1 定义聚合函数

def my_max(s):
max_value = 0
for i in s:
max_value = max_value if max_value > i and i != np.NAN else i
return s.max()

16.7.2 使用自定义聚合函数
collage_data.groupby("STABBR")[["UGDS"]].agg(my_max).head(5)

UGDS
STABBR
AK12865.0
AL29851.0
AR21405.0
AS1276.0
AZ151558.0

16.7.3 和自带的聚合函数混合使用

collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)

meansummy_max
STABBR
AK2493.20000024932.012865.0
AL2789.865169248298.029851.0
AR1644.146341134820.021405.0
AS1276.0000001276.01276.0
AZ4130.468254520439.0151558.0

16.7.4 修改列名

# 默认就是显示该聚合函数的名称(__name__属性)
# 方法1: 修改__name__属性即可
my_max.__name__ = "max"
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).head(5)

meansummax
STABBR
AK2493.20000024932.012865.0
AL2789.865169248298.029851.0
AR1644.146341134820.021405.0
AS1276.0000001276.01276.0
AZ4130.468254520439.0151558.0

# 方法2: 使用DataFrame的rename方法修改列名称
# inplace 参数表示是否在操作的DataFrame中进行修改
collage_data.groupby("STABBR")["UGDS"].agg(["mean","sum",my_max]).rename({"my_max":"max"},inplace=False).head(5)

meansummax
STABBR
AK2493.20000024932.012865.0
AL2789.865169248298.029851.0
AR1644.146341134820.021405.0
AS1276.0000001276.01276.0
AZ4130.468254520439.0151558.0

16.8 用 args 和 *kwargs 自定义聚合函数
16.8.1 定义聚合函数

def  my_between_cnt(s,low,high):
return s.between(low,high).sum()

16.8.2 使用自定义聚合函数
collage_data.groupby("STABBR")["UGDS"].agg(my_between_cnt,10000,20000).head(5)

STABBR
AK 1.0
AL 4.0
AR 0.0
AS 0.0
AZ 4.0
Name: UGDS, dtype: float64

16.8.3 使用闭包简化代码
def init_agg_func(func,col_name,*args,**kwargs):
def wrapper(s):
return func(s,*args,**kwargs)
wrapper.__name__ = col_name
return wrapper

my_max = init_agg_func(np.max,"max")
between_10_20k = init_agg_func(my_between_cnt,"10k~20k",10000,20000)
between_20_30k = init_agg_func(my_between_cnt,"20k~30k",20000,30000)

# 统计每个州本科生人数的最大值,和在10k-20k,20k-30k之间的人数
collage_data.groupby("STABBR")["UGDS"].agg([my_max,between_10_20k,between_20_30k]).head(5)

max10k~20k20k~30k
STABBR
AK12865.01.00.0
AL29851.04.02.0
AR21405.00.01.0
AS1276.00.00.0
AZ151558.04.02.0

16.9 使用filter对分组进行过滤

# 对DataFrameGroupBy对象使用过滤时,会根据回调函数返回的布尔值决定是否保留该分组
# 显示每个州本科生人数和大于100万的州
collage_data.groupby("STABBR").filter(lambda x:x["UGDS"].sum() > 1000000)

INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
192Academy of Art UniversitySan FranciscoCA0.00.55240.40433600035093
193ITT Technical Institute-Rancho CordovaRancho CordovaCA0.00.76670.72353880025827.5
194Academy of Chinese Culture and Health SciencesOaklandCA0.0NaNNaNNaNPrivacySuppressed
195The Academy of Radio and TV BroadcastingHuntington BeachCA0.01.00000.4545284009500
196Avalon School of Cosmetology-AlamedaAlamedaCA0.00.67680.3387216009860
7528WestMed College – MercedMercedCANaNNaNNaNNaN15623.5
7529Vantage CollegeEl PasoTXNaNNaNNaNNaN9500
7530SAE Institute of Technology San FranciscoEmeryvilleCANaNNaNNaNNaN9500
7533Bay Area Medical Academy – San Jose Satellite …San JoseCANaNNaNNaNNaNPrivacySuppressed
7534Excel Learning Center-San Antonio SouthSan AntonioTXNaNNaNNaNNaN12125

1245 rows × 27 columns


INSTNMCITYSTABBRHBCUPCTFLOANUG25ABVMD_EARN_WNE_P10GRAD_DEBT_MDN_SUPP
0Alabama A & M UniversityNormalAL1.00.82840.10493030033888
1University of Alabama at BirminghamBirminghamAL0.00.52140.24223970021941.5
2Amridge UniversityMontgomeryAL0.00.77950.85404010023370
3University of Alabama in HuntsvilleHuntsvilleAL0.00.45960.26404550024097
4Alabama State UniversityMontgomeryAL1.00.75540.12702660033118.5
7530SAE Institute of Technology San FranciscoEmeryvilleCANaNNaNNaNNaN9500
7531Rasmussen College – Overland ParkOverland ParkKSNaNNaNNaNNaN21163
7532National Personal Training Institute of ClevelandHighland HeightsOHNaNNaNNaNNaN6333
7533Bay Area Medical Academy – San Jose Satellite …San JoseCANaNNaNNaNNaNPrivacySuppressed
7534Excel Learning Center-San Antonio SouthSan AntonioTXNaNNaNNaNNaN12125

7535 rows × 27 columns

16.10 分组对象的 apply方法的使用

# 注意这里apply和DataFrame中的apply方法不一样,这里传入回调函数的是一个分组的DataFrame
# 你需要返回的是一个Series
# 求每个分组各个字段的平均值
collage_data.groupby("STABBR").apply(lambda df:df.mean()).head(5)

HBCUMENONLYWOMENONLYRELAFFILCURROPERPCTPELLPCTFLOANUG25ABV
STABBR
AK0.0000000.00.0000000.3000001.0000000.3945300.3816600.506240
AL0.1666670.00.0111110.2500000.9375000.6036210.5097340.387039
AR0.0487800.00.0000000.2093020.9651160.5814700.5055560.356059
AS0.0000000.00.0000000.0000001.0000000.7245000.0000000.177400
AZ0.0000000.00.0000000.0676690.8796990.5497920.5437020.480859

5 rows × 22 columns

16.11 用连续变量分组
16.11.1 cut概念

# 被划分成了六个片元(就是六个区间,将这些区间映射到传入的Series中)
my_cut = pd.cut(collage_data["UGDS"],bins=[-np.Inf,10000,20000,30000,40000,50000,np.Inf])
my_cut

0          (-inf, 10000.0]1       (10000.0, 20000.0]2          (-inf, 10000.0]3          (-inf, 10000.0]4          (-inf, 10000.0]               ...
7530 NaN
7531 NaN
7532 NaN
7533 NaN
7534 NaN
Name: UGDS, Length: 7535, dtype: category
Categories (6, interval[float64]): [(-inf, 10000.0] < (10000.0, 20000.0] < (20000.0, 30000.0] < (30000.0, 40000.0] < (40000.0, 50000.0] < (50000.0, inf]]

16.11.2 使用cut进行分组
collage_data.groupby(my_cut,as_index=True)[["UGDS","INSTNM"]].max()

UGDSINSTNM
UGDS
(-inf, 10000.0]9999.0eClips School of Cosmetology and Barbering
(10000.0, 20000.0]19979.0Youngstown State University
(20000.0, 30000.0]29977.0West Virginia University
(30000.0, 40000.0]39958.0Valencia College
(40000.0, 50000.0]49340.0Western Governors University
(50000.0, inf]151558.0University of Phoenix-Arizona

作者:尼克不可

暂时禁止评论

微信扫一扫

易采站长站微信账号