博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
pandas VS baseR
阅读量:6566 次
发布时间:2019-06-24

本文共 5930 字,大约阅读时间需要 19 分钟。

import numpy as npimport pandas as pd

创建DataFrame

In [2]:df = pd.DataFrame({'col_a': np.arange(10),                    'col_b': np.random.randn(10),                    'col_c': np.random.choice(['A', 'B', 'C'], 10),                    'col_d': np.random.choice([0, 1], 10)})df.head(5)# R code:# df <- data.frame(col_a = 0:9,#                  col_b = rnorm(10),#                  col_c = sample(c('A', 'B', 'C'), size = 10, replace = TRUE),#                  col_d = sample(c(0, 1), size = 10, replace = TRUE), #                  stringsAsFactors = FALSE)# head(df, 5)Out[2]:col_a   col_b   col_c   col_d0   0   0.308520    C   11   1   -1.829450   B   12   2   -0.710135   C   03   3   1.354760    A   04   4   -0.581359   A   1

获取DataFrame维度

In [3]:print(df.shape, df.shape[0], df.shape[1])# R code:# dim(df), rnow(df), ncol(df)(10, 4) 10 4

获取DataFrame列名

In [4]:df.columns# R code:# names(df)Out[4]:Index(['col_a', 'col_b', 'col_c', 'col_d'], dtype='object')

数据选取

In [5]:# 选取前5行数据df.iloc[:5]# R code:# df[1:5, ]Out[5]:col_a   col_b   col_c   col_d0   0   0.308520    C   11   1   -1.829450   B   12   2   -0.710135   C   03   3   1.354760    A   04   4   -0.581359   A   1
# 选取col_a和col_b列df[['col_a', 'col_b']]# R code:# df[, c('col_a', 'col_b')]Out[6]:col_a   col_b0   0   0.3085201   1   -1.8294502   2   -0.7101353   3   1.3547604   4   -0.5813595   5   1.6335426   6   -0.2539507   7   1.7990878   8   0.4129919   9   0.374330
# 选取前5行和前2列df.iloc[:5, :2]# R code:# df[1:5, 1:2]Out[7]:col_a   col_b0   0   0.3085201   1   -1.8294502   2   -0.7101353   3   1.3547604   4   -0.581359
# 选取单个值(scalar)df.iat[0, 1]# R code:# df[1, 2]Out[8]:0.3085196186883713

按条件选取数据

In [9]:df[(df['col_a'] > 3) & (df['col_b'] < 0)]# or # df.query('col_a > 3 & col_b < 0')# R code:# df[df$col_a > 3 & df$col_b < 0, ]Out[9]:col_a   col_b   col_c   col_d4   4   -0.581359   A   16   6   -0.253950   B   1In [10]:df[df['col_c'].isin(['A', 'B'])]# R code:# df[df$col_c %in% c('A', 'B'), ]Out[10]:col_a   col_b   col_c   col_d1   1   -1.829450   B   13   3   1.354760    A   04   4   -0.581359   A   15   5   1.633542    B   16   6   -0.253950   B   17   7   1.799087    A   19   9   0.374330    A   0

增加新列

In [11]:df['col_e'] = df['col_a'] + df['col_b']df# df$col_e <- df$col_a + df$col_bOut[11]:col_a   col_b   col_c   col_d   col_e0   0   0.308520    C   1   0.3085201   1   -1.829450   B   1   -0.8294502   2   -0.710135   C   0   1.2898653   3   1.354760    A   0   4.3547604   4   -0.581359   A   1   3.4186415   5   1.633542    B   1   6.6335426   6   -0.253950   B   1   5.7460507   7   1.799087    A   1   8.7990878   8   0.412991    C   0   8.4129919   9   0.374330    A   0   9.374330

删除列

In [12]:# 删除col_e列df = df.drop(columns='col_e')df# R code:# df <- df[, !names(df) == 'col_e']Out[12]:col_a   col_b   col_c   col_d0   0   0.308520    C   11   1   -1.829450   B   12   2   -0.710135   C   03   3   1.354760    A   04   4   -0.581359   A   15   5   1.633542    B   16   6   -0.253950   B   17   7   1.799087    A   18   8   0.412991    C   09   9   0.374330    A   0In [13]:# 删除第一列df.drop(columns=df.columns[0])# R code:# df[, -1]Out[13]:col_b   col_c   col_d0   0.308520    C   11   -1.829450   B   12   -0.710135   C   03   1.354760    A   04   -0.581359   A   15   1.633542    B   16   -0.253950   B   17   1.799087    A   18   0.412991    C   09   0.374330    A   0

转置

In [14]:df.T# R code:# t(df)Out[14]:0   1   2   3   4   5   6   7   8   9col_a   0   1   2   3   4   5   6   7   8   9col_b   0.30852 -1.82945    -0.710135   1.35476 -0.581359   1.63354 -0.25395    1.79909 0.412991    0.37433col_c   C   B   C   A   A   B   B   A   C   Acol_d   1   1   0   0   1   1   1   1   0   0

数据类型转换

In [15]:df['col_a'].astype(str)# as.character(df$col_a)Out[15]:0    01    12    23    34    45    56    67    78    89    9Name: col_a, dtype: object

转换为类别(categories)/因子(factor)类型

In [16]:pd.Categorical(df['col_c'])# factor(df$col_d)Out[16]:[C, B, C, A, A, B, B, A, C, A]Categories (3, object): [A, B, C]

数据汇总

按行进行计算

In [17]:df[['col_a', 'col_b']].sum(axis=1)# R code:# apply(df[, c('col_a', 'col_b')], 1, sum)Out[17]:0    0.3085201   -0.8294502    1.2898653    4.3547604    3.4186415    6.6335426    5.7460507    8.7990878    8.4129919    9.374330dtype: float64

按列进行计算

In [18]:df[['col_a', 'col_b']].mean(axis=0)# R code:# apply(df[, c('col_a', 'col_b')], 2, mean)Out[18]:col_a    4.500000col_b    0.250834dtype: float64In [19]:df[['col_a', 'col_b']].apply(lambda x: x.mean() + 10)# R code:# apply(df[, c('col_a', 'col_b')], 2, function(x) mean(x) + 10)Out[19]:col_a    14.500000col_b    10.250834dtype: float64

数据合并

合并列

In [20]:df2 = pd.DataFrame({'col_x': np.arange(10),                     'col_y': np.arange(10)[::-1]})df2Out[20]:col_x   col_y0   0   91   1   82   2   73   3   64   4   55   5   46   6   37   7   28   8   19   9   0In [21]:pd.concat([df, df2], axis=1)# R code:# cbind(df, df2)Out[21]:col_a   col_b   col_c   col_d   col_x   col_y0   0   0.308520    C   1   0   91   1   -1.829450   B   1   1   82   2   -0.710135   C   0   2   73   3   1.354760    A   0   3   64   4   -0.581359   A   1   4   55   5   1.633542    B   1   5   46   6   -0.253950   B   1   6   37   7   1.799087    A   1   7   28   8   0.412991    C   0   8   19   9   0.374330    A   0   9   0

合并行

In [22]:df3 = pd.DataFrame({'col_a': [-1, -2],                     'col_b' : [0, 1],                     'col_c': ['B', 'C'],                     'col_d': [1, 0]})df3Out[22]:col_a   col_b   col_c   col_d0   -1  0   B   11   -2  1   C   0In [23]:pd.concat([df, df3], axis=0, ignore_index=True)# R code:# rbind(df, df3)Out[23]:col_a   col_b   col_c   col_d0   0   0.308520    C   11   1   -1.829450   B   12   2   -0.710135   C   03   3   1.354760    A   04   4   -0.581359   A   15   5   1.633542    B   16   6   -0.253950   B   17   7   1.799087    A   18   8   0.412991    C   09   9   0.374330    A   010  -1  0.000000    B   111  -2  1.000000    C   0

转载地址:http://eacjo.baihongyu.com/

你可能感兴趣的文章
PL/pgSQL学习笔记之九
查看>>
Android实现button一边圆角一边直角
查看>>
Java程序员从笨鸟到菜鸟之(五十二)细谈Hibernate(三)Hibernate常用API详解及源码分析--csdn 曹胜欢...
查看>>
AndroidStudio使用第三方jar包报错(Error: duplicate files during packaging of APK)
查看>>
C/C++:sizeof('a')的值为什么不一样?
查看>>
如果说编程语言是一种宗教,你的信仰是?
查看>>
Error:(23, 25) 错误: 程序包R不存在
查看>>
继承、实现、依赖、关联、聚合、组合的联系与区别
查看>>
使用Mercurial从Google Code获得 项目源代码
查看>>
解决方案编写思路
查看>>
Python爬虫之urllib模块2
查看>>
代数几何:三角函数
查看>>
【java】java自带的java.util.logging.Logger日志功能
查看>>
12.2. mcelog - Decode kernel machine check log on x86 machines
查看>>
WF4.0实战(七):请假流程(带驳回操作)
查看>>
[转]微服务(Microservice)那点事
查看>>
自动换行的draw2d标签
查看>>
Db4o结合Linq、Lambda表达式的简单示例
查看>>
25.2. String
查看>>
Mac环境下用Java(Sikuli+Robot)实现页游自动化
查看>>