在Python中进行数据统计时,有些数据我们可能需要统计每月或指时间范围的数据记录数,本文主要介绍Python pandas中通过时间计算统计每月数据记录数的方法,以及相关的示例代码。

示例数据:

Student_id	actvity_timestamp
1001 2019-09-05 08:26:12
1001 2019-09-06 09:26:12
1001 2019-09-21 10:11:01
1001 2019-10-24 11:44:01
1001 2019-10-25 11:31:01
1001 2019-10-26 12:13:01
1002 2019-09-11 12:21:01
1002 2019-09-12 13:11:01
1002 2019-11-23 16:22:01

实现输出:

Student_id

days_in_Sept

days_in_Oct

days_in_Nov

1001

3

3

0

1002

2

0

1

1、使用pd.crosstab()实现

crossTab()函数是用于统计分组频率的特殊透视表。计算两个(或多个)因子的简单交叉表。默认情况下,计算因子的频率表,除非传递值数组和聚合函数。

import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict({
    "Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
    "activity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
})

months = pd.to_datetime(df.activity_timestamp).dt.strftime("%B")
print()
result = pd.crosstab(
    df.Student_id,
    months,
    values=pd.to_datetime(df.activity_timestamp).dt.date,
    aggfunc=pd.Series.nunique 
print(result)

相关文档

https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html?highlight=strftime#pandas.Series.dt.strftime

2、使用df.pivot_table()实现

pivot_table()是创建一个电子表格样式的数据透视表作为DataFrame。透视表中的级别将存储在结果DataFrame的索引和列上的MultiIndex对象(分层索引)中。

import pandas as pd
import numpy as np


df = pd.DataFrame.from_dict({
    "Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
    "actvity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
})


df['actvity_timestamp'] = pd.to_datetime(df['actvity_timestamp']) # to datetime format 
df['activity_month'] = df['actvity_timestamp'].dt.strftime('%b')  # get month short name
df['activity_date'] = df['actvity_timestamp'].dt.date     # get activity dates

df_out = (df.pivot_table(index='Student_id',   # group under each student id
                         columns='activity_month',  # month short name as new columns
                         values='activity_date',    # aggregate on dates
                         aggfunc='nunique',    #activities on the same date counted once
                         fill_value=0)
            .rename_axis(columns=None)                          
         )
print(df_out)

相关文档:Python pandas.DataFrame.pivot_table函数方法的使用

推荐文档