top of page

Data Science in Drilling - Episode 10

Writer: Zeyu YanZeyu Yan

Date and Time Processing in Pandas


written by Zeyu Yan, Ph.D., Head of Data Science from Nvicta AI


Data Science in Drilling is a multi-episode series written by the technical team members in Nvicta AI. Nvicta AI is a startup company who helps drilling service companies increase their value offering by providing them with advanced AI and automation technologies and services. The goal of this Data Science in Drilling series is to provide both data engineers and drilling engineers an insight of the state-of-art techniques combining both drilling engineering and data science.


This episode is about date and time processing in Pandas. Enjoy!


Enjoying great knowledge is just like enjoying delicious lobster roll.



Introduction


Date and time are frequently used in Pandas. In this blog post, we will cover how to work with them in depth.


What We'll Cover Today
  1. What is Pandas Timestamp.

  2. How to process date in Pandas.

  3. How to process time in Pandas.

  4. How to work with time zones in Pandas.


Pandas Timestamp


pandas.Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases. It’s the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas. A Timestamp object could be created through the following ways:

import pandas as pd

pd.to_datetime('2020-9-13')

Or

pd.Timestamp('2020-9-13')

Both of them return the same result:

Timestamp('2020-09-13 00:00:00')

The advantage of a Pandas Timestamp object is that it has series of useful attributes and methods, for example:

ts = pd.Timestamp('2020-9-13')

The name of the weekday and the name of the month can be retrieved as follows:

ts.day_name()
ts.month_name()

The results are:

Sunday
September

The day, month and year can be retrieved as follows:

ts.day
ts.month
ts.year

The results are:

13
9
2020

To see if the given date is the start or the end of the month:

ts.is_month_start
ts.is_month_end

The results are:

False
False

To see if the given date is the start or the end of the quarter:

ts.is_quarter_start
ts.is_quarter_end

The results are:

False
False

To see if the given year is a leap year:

ts.is_leap_year

The result is:

True

To retrieve the week number of the year for the given week:

ts.weekofyear

The result is:

37

For more available attributes and methods of a Pandas Timestamp object, please refer to the official documentation.


Date Processing in Pandas


Let’s first talk about how to work with date in Pandas. Define the following dummy DataFrame as follows:

import numpy as np

df = pd.DataFrame({
    'date': ['Dec-2020', 'Oct-2019', 'Feb-2021', 'Jan-2018'],
    'value': np.arange(4)
})
df

The dummy DataFrame is:


Let’s convert the date column from string to datetime:

df['date_dt'] = pd.to_datetime(df['date'])
df[['date', 'date_dt', 'value']]

The resulted DataFrame is:


Create an extra column which contains the week of the year (1 to 52) for the date in each row:

# Extract week of the year (1 to 52)
df['date_dt_woy'] = df['date_dt'].dt.week
df[['date_dt', 'date_dt_woy']]

The resulted DataFrame is:


Create an extra column which contains the month (1 to 12) for the date in each row:

# Extract month (1 to 12)
df['date_dt_month'] = df['date_dt'].dt.month
df[['date_dt', 'date_dt_month']]

The resulted DataFrame is:


Create an extra column which contains the quarter (1 to 4) for the date in each row:

# Extract quarter (1 to 4)
df['date_dt_quarter'] = df['date_dt'].dt.quarter
df[['date_dt', 'date_dt_quarter']]

The resulted DataFrame is:


To extract year:

# Extract year
df['date_dt_year'] = df['date_dt'].dt.year
df[['date_dt', 'date_dt_year']]

The resulted DataFrame is:


To extract day:

# Extract day
df['date_dt_day'] = df['date_dt'].dt.day
df[['date_dt', 'date_dt_day']]

The resulted DataFrame is:


The day of week (0 to 6) of a given date can be retrieved:

# Extract day of week (0 to 6)
df['date_dt_dow'] = df['date_dt'].dt.dayofweek
df[['date_dt', 'date_dt_dow']]

The resulted DataFrame is:


The name of the weekday (Monday through Sunday) of a given date can be retrieved:

df['date_dt_down'] = df['date_dt'].dt.day_name()
df[['date_dt', 'date_dt_down']]

The resulted DataFrame is:


This is how an extra column to check if the date in a given row is in weekend can be created:

df['is_in_weekend'] = df['date_dt_down'].apply(lambda x: 1 if x in ['Saturday', 'Sunday'] else 0)
df[['date_dt', 'date_dt_down', 'is_in_weekend']]

The resulted DataFrame is:


Let’s calculate the difference in days between each row and its previous row:

df['days_diff'] = df['date_dt'].diff()
df[['date_dt', 'days_diff']]

The resulted DataFrame is:


We can convert the difference in days to difference in months:

df['months_diff'] = np.round((
    df['days_diff'] / np.timedelta64(1, 'M')
), 0)
df[['date_dt', 'days_diff', 'months_diff']]

The resulted DataFrame is:


Time Processing in Pandas


Now let’s talk about how to work with time in Pandas. Define a dummy Series and convert it into a DataFrame:

date_series = pd.Series(
    pd.date_range('2021-10-12 10:12:00', periods=10, freq='H')
)
df = pd.DataFrame({
    'date': date_series
})
df

The resulted DataFrame is:


Retrieve hour, minute and second:

df['hour'] = df['date'].dt.hour
df['min'] = df['date'].dt.minute
df['sec'] = df['date'].dt.second
df

The resulted DataFrame is:


Time can also be retrieved as a whole:

df['time'] = df['date'].dt.time
df

The resulted DataFrame is:


Create two new dummy Series and combine them as a DataFrame:

date_series_1 = pd.Series(
    pd.date_range('2020-1-5 08:00:00', periods=7, freq='M')
)
date_series_2 = pd.Series(
    pd.date_range('2021-3-12 22:40:00', periods=7, freq='W')
)
df = pd.DataFrame({
    'start_date': date_series_1,
    'end_date': date_series_2
})
df

Here the pd.date_range function is an important function in Pandas for conveniently creating date ranges. 'M' stands for 1 month and 'W' stands for 1 week for frequencies in this case. If one wants the frequency to be 2 weeks for example, then freq='2W' can be conveniently used. The resulted DataFrame in this case is:


Let's calculate the time difference between the end date and the start date in seconds for each row:

df['diff_seconds'] = (df['end_date'] - df['start_date']) / np.timedelta64(1, 's')
df 

The resulted DataFrame is:


The time difference can also be calculated in minutes:

df['diff_minutes'] =  (df['end_date'] - df['start_date']) / np.timedelta64(1, 'm')
df

The resulted DataFrame is:


Working with Time Zones


Sometimes the datetime will contain time zones. Let's go through a quick example on how to work with time zones in Pandas. Define the following dummy DataFrame with time zones:

df = pd.DataFrame()
df['time'] = pd.concat([
    pd.Series(
        pd.date_range(start='2020-08-05 06:00:00', periods=3, freq='H', tz='Asia/Shanghai')
    ),
    pd.Series(
        pd.date_range(start='2021-05-20 13:20:00', periods=3, freq='H', tz='US/Central')
    )
], axis=0)
df

To check all the available time zones in Python, the following commands can be used:

import pytz
pytz.all_timezones

The dummy DataFrame is:


Convert the time in each row to UTC time and New York time:

df['time_utc'] = pd.to_datetime(df['time'], utc=True)
df['time_ny'] = df['time_utc'].dt.tz_convert('America/New_York')
df

The resulted DataFrame is:


Conclusions


In this article, we went through how to work with date and time in Pandas. Hope you enjoy it! More skills about Pandas and other interesting topics will be covered in the future episodes. Stay tuned!


Get in Touch


Thank you for reading! Please let us know if you like this series or if you have critiques. If this series was helpful to you, please follow us and share this series to your friends.


If you or your company needs any help on projects related to drilling automation and optimization, AI, and data science, please get in touch with us Nvicta AI. We are here to help. Cheers!

Comments


bottom of page