本文共 5930 字,大约阅读时间需要 19 分钟。
import numpy as npimport pandas as pd
创建DataFrame
In [2]:df = pd.DataFrame({'col_a': np.arange(10), 'col_b': np.random.randn(10), 'col_c': np.random.choice(['A', 'B', 'C'], 10), 'col_d': np.random.choice([0, 1], 10)})df.head(5)# R code:# df <- data.frame(col_a = 0:9,# col_b = rnorm(10),# col_c = sample(c('A', 'B', 'C'), size = 10, replace = TRUE),# col_d = sample(c(0, 1), size = 10, replace = TRUE), # stringsAsFactors = FALSE)# head(df, 5)Out[2]:col_a col_b col_c col_d0 0 0.308520 C 11 1 -1.829450 B 12 2 -0.710135 C 03 3 1.354760 A 04 4 -0.581359 A 1
获取DataFrame维度
In [3]:print(df.shape, df.shape[0], df.shape[1])# R code:# dim(df), rnow(df), ncol(df)(10, 4) 10 4
获取DataFrame列名
In [4]:df.columns# R code:# names(df)Out[4]:Index(['col_a', 'col_b', 'col_c', 'col_d'], dtype='object')
数据选取
In [5]:# 选取前5行数据df.iloc[:5]# R code:# df[1:5, ]Out[5]:col_a col_b col_c col_d0 0 0.308520 C 11 1 -1.829450 B 12 2 -0.710135 C 03 3 1.354760 A 04 4 -0.581359 A 1
# 选取col_a和col_b列df[['col_a', 'col_b']]# R code:# df[, c('col_a', 'col_b')]Out[6]:col_a col_b0 0 0.3085201 1 -1.8294502 2 -0.7101353 3 1.3547604 4 -0.5813595 5 1.6335426 6 -0.2539507 7 1.7990878 8 0.4129919 9 0.374330
# 选取前5行和前2列df.iloc[:5, :2]# R code:# df[1:5, 1:2]Out[7]:col_a col_b0 0 0.3085201 1 -1.8294502 2 -0.7101353 3 1.3547604 4 -0.581359
# 选取单个值(scalar)df.iat[0, 1]# R code:# df[1, 2]Out[8]:0.3085196186883713
按条件选取数据
In [9]:df[(df['col_a'] > 3) & (df['col_b'] < 0)]# or # df.query('col_a > 3 & col_b < 0')# R code:# df[df$col_a > 3 & df$col_b < 0, ]Out[9]:col_a col_b col_c col_d4 4 -0.581359 A 16 6 -0.253950 B 1In [10]:df[df['col_c'].isin(['A', 'B'])]# R code:# df[df$col_c %in% c('A', 'B'), ]Out[10]:col_a col_b col_c col_d1 1 -1.829450 B 13 3 1.354760 A 04 4 -0.581359 A 15 5 1.633542 B 16 6 -0.253950 B 17 7 1.799087 A 19 9 0.374330 A 0
增加新列
In [11]:df['col_e'] = df['col_a'] + df['col_b']df# df$col_e <- df$col_a + df$col_bOut[11]:col_a col_b col_c col_d col_e0 0 0.308520 C 1 0.3085201 1 -1.829450 B 1 -0.8294502 2 -0.710135 C 0 1.2898653 3 1.354760 A 0 4.3547604 4 -0.581359 A 1 3.4186415 5 1.633542 B 1 6.6335426 6 -0.253950 B 1 5.7460507 7 1.799087 A 1 8.7990878 8 0.412991 C 0 8.4129919 9 0.374330 A 0 9.374330
删除列
In [12]:# 删除col_e列df = df.drop(columns='col_e')df# R code:# df <- df[, !names(df) == 'col_e']Out[12]:col_a col_b col_c col_d0 0 0.308520 C 11 1 -1.829450 B 12 2 -0.710135 C 03 3 1.354760 A 04 4 -0.581359 A 15 5 1.633542 B 16 6 -0.253950 B 17 7 1.799087 A 18 8 0.412991 C 09 9 0.374330 A 0In [13]:# 删除第一列df.drop(columns=df.columns[0])# R code:# df[, -1]Out[13]:col_b col_c col_d0 0.308520 C 11 -1.829450 B 12 -0.710135 C 03 1.354760 A 04 -0.581359 A 15 1.633542 B 16 -0.253950 B 17 1.799087 A 18 0.412991 C 09 0.374330 A 0
转置
In [14]:df.T# R code:# t(df)Out[14]:0 1 2 3 4 5 6 7 8 9col_a 0 1 2 3 4 5 6 7 8 9col_b 0.30852 -1.82945 -0.710135 1.35476 -0.581359 1.63354 -0.25395 1.79909 0.412991 0.37433col_c C B C A A B B A C Acol_d 1 1 0 0 1 1 1 1 0 0
数据类型转换
In [15]:df['col_a'].astype(str)# as.character(df$col_a)Out[15]:0 01 12 23 34 45 56 67 78 89 9Name: col_a, dtype: object
转换为类别(categories)/因子(factor)类型
In [16]:pd.Categorical(df['col_c'])# factor(df$col_d)Out[16]:[C, B, C, A, A, B, B, A, C, A]Categories (3, object): [A, B, C]
数据汇总
按行进行计算In [17]:df[['col_a', 'col_b']].sum(axis=1)# R code:# apply(df[, c('col_a', 'col_b')], 1, sum)Out[17]:0 0.3085201 -0.8294502 1.2898653 4.3547604 3.4186415 6.6335426 5.7460507 8.7990878 8.4129919 9.374330dtype: float64
按列进行计算
In [18]:df[['col_a', 'col_b']].mean(axis=0)# R code:# apply(df[, c('col_a', 'col_b')], 2, mean)Out[18]:col_a 4.500000col_b 0.250834dtype: float64In [19]:df[['col_a', 'col_b']].apply(lambda x: x.mean() + 10)# R code:# apply(df[, c('col_a', 'col_b')], 2, function(x) mean(x) + 10)Out[19]:col_a 14.500000col_b 10.250834dtype: float64
数据合并
合并列In [20]:df2 = pd.DataFrame({'col_x': np.arange(10), 'col_y': np.arange(10)[::-1]})df2Out[20]:col_x col_y0 0 91 1 82 2 73 3 64 4 55 5 46 6 37 7 28 8 19 9 0In [21]:pd.concat([df, df2], axis=1)# R code:# cbind(df, df2)Out[21]:col_a col_b col_c col_d col_x col_y0 0 0.308520 C 1 0 91 1 -1.829450 B 1 1 82 2 -0.710135 C 0 2 73 3 1.354760 A 0 3 64 4 -0.581359 A 1 4 55 5 1.633542 B 1 5 46 6 -0.253950 B 1 6 37 7 1.799087 A 1 7 28 8 0.412991 C 0 8 19 9 0.374330 A 0 9 0
合并行
In [22]:df3 = pd.DataFrame({'col_a': [-1, -2], 'col_b' : [0, 1], 'col_c': ['B', 'C'], 'col_d': [1, 0]})df3Out[22]:col_a col_b col_c col_d0 -1 0 B 11 -2 1 C 0In [23]:pd.concat([df, df3], axis=0, ignore_index=True)# R code:# rbind(df, df3)Out[23]:col_a col_b col_c col_d0 0 0.308520 C 11 1 -1.829450 B 12 2 -0.710135 C 03 3 1.354760 A 04 4 -0.581359 A 15 5 1.633542 B 16 6 -0.253950 B 17 7 1.799087 A 18 8 0.412991 C 09 9 0.374330 A 010 -1 0.000000 B 111 -2 1.000000 C 0
转载地址:http://eacjo.baihongyu.com/