Pandas高级教程之:window操作-天翼云

Pandas高级教程之:window操作

2023-03-24 10:31:38 阅读次数：159

简介

在数据统计中，经常需要进行一些范围操作，这些范围我们可以称之为一个window 。Pandas提供了一个rolling方法，通过滚动window来进行统计计算。

本文将会探讨一下rolling中的window用法。

滚动窗口

我们有5个数，我们希望滚动统计两个数的和，那么可以这样：

In [1]: s = pd.Series(range(5))

In [2]: s.rolling(window=2).sum()
Out[2]: 
0    NaN
1    1.0
2    3.0
3    5.0
4    7.0
dtype: float64

rolling 对象可以通过for来遍历：

In [3]: for window in s.rolling(window=2):
   ...:     print(window)
   ...: 
0    0
dtype: int64
0    0
1    1
dtype: int64
1    1
2    2
dtype: int64
2    2
3    3
dtype: int64
3    3
4    4
dtype: int64

pandas中有四种window操作，我们看下他们的定义：

名称	方法	返回对象	是否支持时间序列	是否支持链式groupby操作
固定或者可滑动的窗口	`rolling`	`Rolling`	Yes	Yes
scipy.signal库提供的加权非矩形窗口	`rolling`	`Window`	No	No
累积值的窗口	`expanding`	`Expanding`	No	Yes
值上的累积和指数加权窗口	`ewm`	`ExponentialMovingWindow`	No	Yes (as of version 1.2)

看一个基于时间rolling的例子：

In [4]: s = pd.Series(range(5), index=pd.date_range('2020-01-01', periods=5, freq='1D'))

In [5]: s.rolling(window='2D').sum()
Out[5]: 
2020-01-01    0.0
2020-01-02    1.0
2020-01-03    3.0
2020-01-04    5.0
2020-01-05    7.0
Freq: D, dtype: float64

设置min_periods可以指定window中的最小的NaN的个数：

In [8]: s = pd.Series([np.nan, 1, 2, np.nan, np.nan, 3])

In [9]: s.rolling(window=3, min_periods=1).sum()
Out[9]: 
0    NaN
1    1.0
2    3.0
3    3.0
4    2.0
5    3.0
dtype: float64

In [10]: s.rolling(window=3, min_periods=2).sum()
Out[10]: 
0    NaN
1    NaN
2    3.0
3    3.0
4    NaN
5    NaN
dtype: float64

# Equivalent to min_periods=3
In [11]: s.rolling(window=3, min_periods=None).sum()
Out[11]: 
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
dtype: float64

Center window

默认情况下window的统计是以最右为准，比如window=5,那么前面的0，1，2，3 因为没有达到5，所以为NaN。

In [19]: s = pd.Series(range(10))

In [20]: s.rolling(window=5).mean()
Out[20]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

可以对这种方式进行修改，设置 center=True 可以从中间统计：

In [21]: s.rolling(window=5, center=True).mean()
Out[21]: 
0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    NaN
9    NaN
dtype: float64

Weighted window 加权窗口

使用 win_type 可以指定加权窗口的类型。其中win_type 必须是scipy.signal 中的window类型。

举几个例子：

In [47]: s = pd.Series(range(10))

In [48]: s.rolling(window=5).mean()
Out[48]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

In [49]: s.rolling(window=5, win_type="triang").mean()
Out[49]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

# Supplementary Scipy arguments passed in the aggregation function
In [50]: s.rolling(window=5, win_type="gaussian").mean(std=0.1)
Out[50]: 
0    NaN
1    NaN
2    NaN
3    NaN
4    2.0
5    3.0
6    4.0
7    5.0
8    6.0
9    7.0
dtype: float64

扩展窗口

扩展窗口会产生聚合统计信息的值，其中包含该时间点之前的所有可用数据。

In [51]: df = pd.DataFrame(range(5))

In [52]: df.rolling(window=len(df), min_periods=1).mean()
Out[52]: 
     0
0  0.0
1  0.5
2  1.0
3  1.5
4  2.0

In [53]: df.expanding(min_periods=1).mean()
Out[53]: 
     0
0  0.0
1  0.5
2  1.0
3  1.5
4  2.0

指数加权窗口

指数加权窗口与扩展窗口相似，但每个先验点相对于当前点均按指数加权。

加权计算的公式是这样的：

y_t=Σ^t_{i=0}{w_ix_{t-i}\over{Σ^t_{i=0}w_i}}yt=Σi=0tΣi=0twiwixt−i

其中x_txt是输入，y_tyt是输出，w_iwi是权重。

EW有两种模式，一种模式是 adjust=True ，这种情况下 Error: Font metrics not found for font: .

一种模式是 adjust=False ，这种情况下：
ParseError: KaTeX parse error: Undefined control sequence: \n at position 8: y_0=x_0\̲n̲

y_t=(1-a)y_{t-1}+ax_t

其中 0<

活动

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

Pandas高级教程之:window操作

Pandas高级教程之:window操作

Center window

相关文章

JavaScript 面试题解析与代码实践

探索JavaScript BOM：了解浏览器的内部机制和强大的API

Python 的Tkinter包系列之一：窗口初步

解析window.history.go()和history.back()的妙用技巧

前端window.blur() 和 window.focus() 防止切屏的基本知识

javaScript（五）：BOM操作

window.history()方法总结

jQuery中$(document).ready()和window.onload的区别

pandas数据分析37——链接MySQL转化为数据框

pandas数据分析41——不同地区不同城市数据分级统计汇总

作者介绍

最新文章

前端window.blur() 和 window.focus() 防止切屏的基本知识

javaScript（五）：BOM操作

window.history()方法总结

pandas高级处理-交叉表与透视表

python操作Excel和CSV、文件解压缩

pandas使用read_csv时报错解决

热门文章

pandas判断dataframe中一列是否为日期格式

Pandas读取csv

构造可用next迭代dataframe的行、列的方法

pandas删除index与某一列有重复值所在的行

pandas dataframe随机采样

pandas从dataframe中选择部分行、列

热门标签

相关产品

弹性云主机

天翼云电脑（公众版）

对象存储

云硬盘

随机文章

python pandas判断是否为空

Pandas between 截取时间段

pandas插入新列

pandas从dataframe中选择部分行、列

前端window.blur() 和 window.focus() 防止切屏的基本知识

Pandas介绍