Python: working with dates and times

This blog post is continuity of my learning journey of data scientist with python through Datacamp. You can see detail illustration and execution of following scenarios at my kaggle workspace & github

You'll probably never have a time machine, but how about a machine for analyzing time? As soon as time enters any analysis, things can get weird. It's easy to get tripped up on day and month boundaries, time zones, daylight saving time, and all sorts of other things that can confuse the unprepared. If you're going to do any kind of analysis involving time, you’ll want to use Python to sort it out. Working with data sets on hurricanes and bike trips, we’ll cover counting events, figuring out how much time has elapsed between events and plotting data over time. You'll work in both standard Python and in Pandas, and we'll touch on the dateutil library, the only timezone library endorsed by the official Python documentation. After this course, you'll confidently handle date and time data in any format like a champion.

<====================================================================================================================================>

Dates and Calendars

FREE

Hurricanes (also known as cyclones or typhoons) hit the U.S. state of Florida several times per year. To start off this course, you'll learn how to work with date objects in Python, starting with the dates of every hurricane to hit Florida since 1950. You'll learn how Python handles dates, common date operations, and the right way to format dates to avoid confusion.

___________________________________________________________________________________________________________________________________

How many hurricanes come early?

In this chapter, you will work with a list of the hurricanes that made landfall in Florida from 1950 to 2017. There were 235 in total. Check out the variable florida_hurricane_dates, which has all of these dates.

Atlantic hurricane season officially begins on June 1. How many hurricanes since 1950 have made landfall in Florida before the official start of hurricane season?

Instructions

100 XP

Complete the for loop to iterate through florida_hurricane_dates.

Complete the if statement to increment the counter (early_hurricanes) if the hurricane made landfall before June.

# Counter for how many before June 1

early_hurricanes = 0

# We loop over the dates

for hurricane in florida_hurricane_dates:

# Check if the month is before June (month number 6)

if hurricane.month < 6:

early_hurricanes = early_hurricanes + 1

print(early_hurricanes)

___________________________________________________________________________________________________________________________________

Subtracting dates

Python date objects let us treat calendar dates as something similar to numbers: we can compare them, sort them, add, and even subtract them. This lets us do math with dates in a way that would be a pain to do by hand.

The 2007 Florida hurricane season was one of the busiest on record, with 8 hurricanes in one year. The first one hit on May 9th, 2007, and the last one hit on December 13th, 2007. How many days elapsed between the first and last hurricane in 2007?

Instructions

100 XP

Import date from datetime.

Create a date object for May 9th, 2007, and assign it to the start variable.

Create a date object for December 13th, 2007, and assign it to the end variable.

Subtract start from end, to print the number of days in the resulting timedelta object.

# Import date

from datetime import date

# Create a date object for May 9th, 2007

start = date(2007, 5, 9)

# Create a date object for December 13th, 2007

end = date(2007, 12, 13)

# Subtract the two dates and print the number of days

print((end - start).days)

+100 XP

Good job! One thing to note: be careful using this technique for historical dates hundreds of years in the past. Our calendar systems have changed over time, and not every date from then would be the same day and month today.

___________________________________________________________________________________________________________________________________

Counting events per calendar month

Hurricanes can make landfall in Florida throughout the year. As we've already discussed, some months are more hurricane-prone than others.

Using florida_hurricane_dates, let's see how hurricanes in Florida were distributed across months throughout the year.

We've created a dictionary called hurricanes_each_month to hold your counts and set the initial counts to zero. You will loop over the list of hurricanes, incrementing the correct month in hurricanes_each_month as you go, and then print the result.

Instructions

100 XP

Within the for loop:

Assign month to be the month of that hurricane.

Increment hurricanes_each_month for the relevant month by 1.

# A dictionary to count hurricanes per calendar month

hurricanes_each_month = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6:0,

7: 0, 8:0, 9:0, 10:0, 11:0, 12:0}

# Loop over all hurricanes

for hurricane in florida_hurricane_dates:

# Pull out the month

month = hurricane.month

# Increment the count in your dictionary by one

hurricanes_each_month[month] += 1

print(hurricanes_each_month)

<script.py> output:

    {1: 0, 2: 1, 3: 0, 4: 1, 5: 8, 6: 32, 7: 21, 8: 49, 9: 70, 10: 43, 11: 9, 12: 1}

Success! This illustrated a generally useful pattern for working with complex data: creating a dictionary, performing some operation on each element, and storing the results back in the dictionary.

___________________________________________________________________________________________________________________________________

Putting a list of dates in order

Much like numbers and strings, date objects in Python can be put in order. Earlier dates come before later ones, and so we can sort a list of date objects from earliest to latest.

What if our Florida hurricane dates had been scrambled? We've gone ahead and shuffled them so they're in random order and saved the results as dates_scrambled. Your job is to put them back in chronological order, and then print the first and last dates from this sorted list.

Instructions 1/2

50 XP

Print the first and last dates in dates_scrambled.

# Print the first and last scrambled dates

print(dates_scrambled[0])

print(dates_scrambled[-1])

Sort dates_scrambled and save the results to dates_ordered.

Print the first and last dates in dates_ordered.

# Print the first and last scrambled dates

print(dates_scrambled[0])

print(dates_scrambled[-1])

# Put the dates in order

dates_ordered = sorted(dates_scrambled)

# Print the first and last ordered dates

print(dates_ordered[0])

print(dates_ordered[-1])

<script.py> output:

1988-08-04

2011-07-18

1950-08-31

2017-10-29

Excellent! You can use sorted() on several data types in Python, including sorting lists of numbers, lists of strings, or even lists of lists, which by default are compared on the first element.

___________________________________________________________________________________________________________________________________

Printing dates in a friendly format

Because people may want to see dates in many different formats, Python comes with very flexible functions for turning date objects into strings.

Let's see what event was recorded first in the Florida hurricane data set. In this exercise, you will format the earliest date in the florida_hurriance_dates list in two ways so you can decide which one you want to use: either the ISO standard or the typical US style.

Instructions

100 XP

Assign the earliest date in florida_hurricane_dates to first_date.

Print first_date in the ISO standard. For example, December 1st, 2000 would be "2000-12-01".

Print first_date in the US style, using .strftime(). For example, December 1st, 2000 would be "12/1/2000" .

# Assign the earliest date to first_date

first_date = min(florida_hurricane_dates)

# Convert to ISO and US formats

iso = "Our earliest hurricane date: " + first_date.isoformat()

us = "Our earliest hurricane date: " + first_date.strftime("%m/%d/%Y")

print("ISO: " + iso)

print("US: " + us)

<script.py> output:

ISO: Our earliest hurricane date: 1950-08-31

US: Our earliest hurricane date: 08/31/1950

Correct! When in doubt, use the ISO format for dates. ISO dates are unambiguous. And if you sort them 'alphabetically', for example, in filenames, they will be in the correct order.

___________________________________________________________________________________________________________________________________

Representing dates in different ways

date objects in Python have a great number of ways they can be printed out as strings. In some cases, you want to know the date in a clear, language-agnostic format. In other cases, you want something which can fit into a paragraph and flow naturally.

Let's try printing out the same date, August 26, 1992 (the day that Hurricane Andrew made landfall in Florida), in a number of different ways, to practice using the .strftime() method.

A date object called andrew has already been created.

Instructions 1/3

35 XP

Print andrew in the format 'YYYY-MM'.

# Import date

from datetime import date

# Create a date object

andrew = date(1992, 8, 26)

# Print the date in the format 'YYYY-MM'

print(andrew.strftime('%Y-%M'))

Print andrew in the format 'MONTH (YYYY)', where MONTH is the full name (%B).

# Import date

from datetime import date

# Create a date object

andrew = date(1992, 8, 26)

# Print the date in the format 'MONTH (YYYY)'

print(andrew.strftime('%MM (%Y)'))

Print andrew in the format 'YYYY-DDD' where DDD is the day of the year.

# Import date

from datetime import date

# Create a date object

andrew = date(1992, 8, 26)

# Print the date in the format 'YYYY-DDD'

print(andrew.strftime('%Y - %D'))

Nice! Pick the format that best matches your needs. For example, astronomers usually use the 'day number' out of 366 instead of the month and date, to avoid ambiguities between languages.

<====================================================================================================================================>

Combining Dates and Times

Bike sharing programs have swept through cities around the world -- and luckily for us, every trip gets recorded! Working with all of the comings and goings of one bike in Washington, D.C., you'll practice working with dates and times together. You'll parse dates and times from text, analyze peak trip times, calculate ride durations, and more.

___________________________________________________________________________________________________________________________________

Creating datetimes by hand

Often you create datetime objects based on outside data. Sometimes though, you want to create a datetime object from scratch.

You're going to create a few different datetime objects from scratch to get the hang of that process. These come from the bikeshare data set that you'll use throughout the rest of the chapter.

Instructions 3/3

0 XP

Instructions 3/3

0 XP

Import the datetime class.

Create a datetime for October 1, 2017 at 15:26:26.

Print the results in ISO format.

# Import datetime

from datetime import datetime

# Create a datetime object

dt = datetime(2017, 10, 1, 15, 26, 26)

# Print the results in ISO 8601 format

print(dt.isoformat())

Import the datetime class.

Create a datetime for December 31, 2017 at 15:19:13.

Print the results in ISO format.

# Import datetime

from datetime import datetime

# Create a datetime object

dt = datetime(2017, 12, 31, 15, 19, 13)

# Print the results in ISO 8601 format

print(dt.isoformat())

Create a new datetime by replacing the year in dt with 1917 (instead of 2017)

# Import datetime

from datetime import datetime

# Create a datetime object

dt = datetime(2017, 12, 31, 15, 19, 13)

# Replace the year with 1917

dt_old = dt.replace(year=1917)

# Print the results in ISO 8601 format

print(dt_old)

___________________________________________________________________________________________________________________________________

Counting events before and after noon

In this chapter, you will be working with a list of all bike trips for one Capital Bikeshare bike, W20529, from October 1, 2017 to December 31, 2017. This list has been loaded as onebike_datetimes.

Each element of the list is a dictionary with two entries: start is a datetime object corresponding to the start of a trip (when a bike is removed from the dock) and end is a datetime object corresponding to the end of a trip (when a bike is put back into a dock).

You can use this data set to understand better how this bike was used. Did more trips start before noon or after noon?

Instructions

100 XP

Instructions

100 XP

Within the for loop, complete the if statement to check if the trip started before noon.

Within the for loop, increment trip_counts['AM'] if the trip started before noon, and trip_counts['PM'] if it started after noon.

# Create dictionary to hold results

trip_counts = {'AM': 0, 'PM': 0}

# Loop over all trips

for trip in onebike_datetimes:

# Check to see if the trip starts before noon

if trip['start'].hour < 12:

# Increment the counter for before noon

trip_counts['AM'] += 1

else:

# Increment the counter for after noon

trip_counts['PM'] += 1

print(trip_counts)

<script.py> output:

{'AM': 94, 'PM': 196}

Great! It looks like this bike is used about twice as much after noon than it is before noon. One obvious follow up would be to see which hours the bike is most likely to be taken out for a ride.

___________________________________________________________________________________________________________________________________

Turning strings into datetimes

When you download data from the Internet, dates and times usually come to you as strings. Often the first step is to turn those strings into datetime objects.

In this exercise, you will practice this transformation.

Reference

%Y 4 digit year (0000-9999)

%m 2 digit month (1-12)

%d 2 digit day (1-31)

%H 2 digit hour (0-23)

%M 2 digit minute (0-59)

%S 2 digit second (0-59)

Instructions 1/3

35 XP

Instructions 1/3

35 XP

Determine the format needed to convert s to datetime and assign it to fmt.

Convert the string s to datetime using fmt.

Take Hint (-10 XP)

# Import the datetime class

from datetime import datetime

# Starting string, in YYYY-MM-DD HH:MM:SS format

s = '2017-02-03 00:00:01'

# Write a format string to parse s

fmt = '%Y-%m-%d %H:%M:%S'

# Create a datetime object d

d = datetime.strptime(s, fmt)

# Print d

print(d)

Determine the format needed to convert s to datetime and assign it to fmt.

Convert the string s to datetime using fmt.

# Import the datetime class

from datetime import datetime

# Starting string, in YYYY-MM-DD format

s = '2030-10-15'

# Write a format string to parse s

fmt = '%Y-%m-%d'

# Create a datetime object d

d = datetime.strptime(s, fmt)

# Print d

print(d)

Determine the format needed to convert s to datetime and assign it to fmt.

Convert the string s to datetime using fmt.

# Import the datetime class

from datetime import datetime

# Starting string, in MM/DD/YYYY HH:MM:SS format

s = '12/15/1986 08:00:00'

# Write a format string to parse s

fmt = '%m/%d/%Y %H:%M:%S'

# Create a datetime object d

d = datetime.strptime(s, fmt)

# Print d

print(d)

Great! Now you can parse dates in most common formats. Unfortunately, Python does not have the ability to parse non-zero-padded dates and times out of the box (such as 1/2/2018). If needed, you can use other string methods to create zero-padded strings suitable for strptime().

___________________________________________________________________________________________________________________________________

Parsing pairs of strings as datetimes

Up until now, you've been working with a pre-processed list of datetimes for W20529's trips. For this exercise, you're going to go one step back in the data cleaning pipeline and work with the strings that the data started as.

Explore onebike_datetime_strings in the IPython shell to determine the correct format. datetime has already been loaded for you.

Reference

%Y 4 digit year (0000-9999)

%m 2 digit month (1-12)

%d 2 digit day (1-31)

%H 2 digit hour (0-23)

%M 2 digit minute (0-59)

%S 2 digit second (0-59)

Instructions

100 XP

Outside the for loop, fill out the fmt string with the correct parsing format for the data.

Within the for loop, parse the start and end strings into the trip dictionary with start and end keys and datetime objects for values.

# Write down the format string

fmt = "%Y-%m-%d %H:%M:%S"

# Initialize a list for holding the pairs of datetime objects

onebike_datetimes = []

# Loop over all trips

for (start, end) in onebike_datetime_strings:

trip = {'start': datetime.strptime(start, fmt),

'end': datetime.strptime(end, fmt)}

# Append the trip

onebike_datetimes.append(trip)

+100 XP

Excellent! Now you know how to process lists of strings into a more useful structure. If you haven't come across this approach before, many complex data cleaning tasks follow this same format: start with a list, process each element, and add the processed data to a new list.	

In [1]: onebike_datetime_strings

Out[1]:

[('2017-10-01 15:23:25', '2017-10-01 15:26:26'),

('2017-10-01 15:42:57', '2017-10-01 17:49:59'),

('2017-10-02 06:37:10', '2017-10-02 06:42:53'),

('2017-10-02 08:56:45', '2017-10-02 09:18:03'),

('2017-10-02 18:23:48', '2017-10-02 18:45:05'),

('2017-10-02 18:48:08', '2017-10-02 19:10:54'),

 ___________________________________________________________________________________________________________________________________

Recreating ISO format with strftime()

In the last chapter, you used strftime() to create strings from date objects. Now that you know about datetime objects, let's practice doing something similar.

Re-create the .isoformat() method, using .strftime(), and print the first trip start in our data set.

Reference

%Y 4 digit year (0000-9999)

%m 2 digit month (1-12)

%d 2 digit day (1-31)

%H 2 digit hour (0-23)

%M 2 digit minute (0-59)

%S 2 digit second (0-59)

Instructions

100 XP

Instructions

100 XP

Complete fmt to match the format of ISO 8601.

Print first_start with both .isoformat() and .strftime(); they should match.

# Import datetime

from datetime import datetime

# Pull out the start of the first trip

first_start = onebike_datetimes[0]['start']

# Format to feed to strftime()

fmt = "%Y-%m-%dT%H:%M:%S"

# Print out date with .isoformat(), then with .strftime() to compare

print(first_start.isoformat())

print(first_start.strftime(fmt))

<script.py> output:

2017-10-01T15:23:25

+100 XP

Awesome! There are a wide variety of time formats you can create with strftime(), depending on your needs. However, if you don't know exactly what you need, .isoformat() is a perfectly fine place to start.

___________________________________________________________________________________________________________________________________

Unix timestamps

Datetimes are sometimes stored as Unix timestamps: the number of seconds since January 1, 1970. This is especially common with computer infrastructure, like the log files that websites keep when they get visitors.

Instructions

100 XP

Complete the for loop to loop over timestamps.

Complete the code to turn each timestamp ts into a datetime.

# Import datetime

from datetime import datetime

# Starting timestamps

timestamps = [1514665153, 1514664543]

# Datetime objects

dts = []

# Loop

for ts in timestamps:

dts.append(datetime.fromtimestamp(ts))

# Print results

print(dts)

<script.py> output:

    [datetime.datetime(2017, 12, 30, 20, 19, 13), datetime.datetime(2017, 12, 30, 20, 9, 3)]

+70 XP

Nice! The largest number that some older computers can hold in one variable is 2147483648, which as a Unix timestamp is in January 2038. On that day, many computers which haven't been upgraded will fail. Hopefully, none of them are running anything critical!

___________________________________________________________________________________________________________________________________

Turning pairs of datetimes into durations

When working with timestamps, we often want to know how much time has elapsed between events. Thankfully, we can use datetime arithmetic to ask Python to do the heavy lifting for us so we don't need to worry about day, month, or year boundaries. Let's calculate the number of seconds that the bike was out of the dock for each trip.

Continuing our work from a previous coding exercise, the bike trip data has been loaded as the list onebike_datetimes. Each element of the list consists of two datetime objects, corresponding to the start and end of a trip, respectively.

Instructions

100 XP

Within the loop:

Use the the start and end objects to find the length of the trip and call it trip_duration.

Calculate the trip_length_seconds from trip_duration.

Append trip_length_seconds to the list onebike_durations.

# Initialize a list for all the trip durations

onebike_durations = []

for trip in onebike_datetimes:

# Create a timedelta object corresponding to the length of the trip

trip_duration = trip['end'] - trip['start']

# Get the total elapsed seconds in trip_duration

trip_length_seconds = trip_duration.total_seconds()

# Append the results to our list

onebike_durations.append(trip_length_seconds)

Success! Remember that timedelta objects are represented in Python as a number of days and seconds of elapsed time. Be careful not to use .seconds on a timedelta object, since you'll just get the number of seconds without the days!

___________________________________________________________________________________________________________________________________

Average trip time

W20529 took 291 trips in our data set. How long were the trips on average? We can use the built-in Python functions sum() and len() to make this calculation.

Based on your last coding exercise, the data has been loaded as onebike_durations. Each entry is a number of seconds that the bike was out of the dock.

Instructions

100 XP

Calculate total_elapsed_time across all trips in onebike_durations.

Calculate number_of_trips for onebike_durations.

Divide total_elapsed_time by number_of_trips to get the average trip length.

# What was the total duration of all trips?

total_elapsed_time = sum(onebike_durations)

# What was the total number of trips?

number_of_trips = len(onebike_durations)

# Divide the total duration by the number of trips

print(total_elapsed_time / number_of_trips)

<script.py> output:

1178.9310344827586

Great work, and not remotely average! For the average to be a helpful summary of the data, we need for all of our durations to be reasonable numbers, and not a few that are way too big, way too small, or even malformed. For example, if there is anything fishy happening in the data, and our trip ended before it started, we'd have a negative trip length.

___________________________________________________________________________________________________________________________________

The long and the short of why time is hard

Out of 291 trips taken by W20529, how long was the longest? How short was the shortest? Does anything look fishy?

As before, data has been loaded as onebike_durations.

Instructions

100 XP

Calculate shortest_trip from onebike_durations.

Calculate longest_trip from onebike_durations.

Print the results, turning shortest_trip and longest_trip into strings so they can print.

# Calculate shortest and longest trips

shortest_trip = min(onebike_durations)

longest_trip = max(onebike_durations)

# Print out the results

print("The shortest trip was " + str(shortest_trip) + " seconds")

print("The longest trip was " + str(longest_trip) + " seconds")

<script.py> output:

The shortest trip was -3346.0 seconds

The longest trip was 76913.0 seconds

Weird huh?! For at least one trip, the bike returned before it left. Why could that be? Here's a hint: it happened in early November, around 2AM local time. What happens to clocks around that time each year? By the end of the next chapter, we'll have all the tools we need to deal with this situation!

<====================================================================================================================================>

Time Zones and Daylight Saving

In this chapter, you'll learn to confidently tackle the time-related topic that causes people the most trouble: time zones and daylight saving. Continuing with our bike data, you'll learn how to compare clocks around the world, how to gracefully handle "spring forward" and "fall back," and how to get up-to-date timezone data from the dateutil library.

___________________________________________________________________________________________________________________________________

Creating timezone aware datetimes

In this exercise, you will practice setting timezones manually.

Instructions 1/3

35 XP

Import timezone.

Set the tzinfo to UTC, without using timedelta.

# Import datetime, timezone

from datetime import datetime, timezone

# October 1, 2017 at 15:26:26, UTC

dt = datetime(2017, 10, 1, 15, 26, 26, tzinfo=timezone.utc)

# Print results

print(dt.isoformat())

Set pst to be a timezone set for UTC-8.

Set dt's timezone to be pst.

# Import datetime, timedelta, timezone

from datetime import datetime, timedelta, timezone

# Create a timezone for Pacific Standard Time, or UTC-8

pst = timezone(timedelta(hours=-8))

# October 1, 2017 at 15:26:26, UTC-8

dt = datetime(2017, 10, 1, 15, 26, 26, tzinfo=pst)

# Print results

print(dt.isoformat())

Set tz to be a timezone set for UTC+11.

Set dt's timezone to be tz.

# Import datetime, timedelta, timezone

from datetime import datetime, timedelta, timezone

# Create a timezone for Australian Eastern Daylight Time, or UTC+11

aedt = timezone(timedelta(hours=11))

# October 1, 2017 at 15:26:26, UTC+11

dt = datetime(2017, 10, 1, 15, 26, 26, tzinfo=aedt)

# Print results

print(dt.isoformat())

<script.py> output:

2017-10-01T15:26:26+00:00

<script.py> output:

2017-10-01T15:26:26+08:00

<script.py> output:

2017-10-01T15:26:26-08:00

<script.py> output:

2017-10-01T15:26:26+11:00

+100 XP

Great! Did you know that Russia and France are tied for the most number of time zones, with 12 each? The French mainland only has one timezone, but because France has so many overseas dependencies they really add up!

___________________________________________________________________________________________________________________________________

Setting timezones

Now that you have the hang of setting timezones one at a time, let's look at setting them for the first ten trips that W20529 took.

timezone and timedelta have already been imported.

Instructions

100 XP

Create edt, a timezone object whose UTC offset is -4 hours.

Within the for loop:

Set the tzinfo for trip['start'].

Set the tzinfo for trip['end'].

# Create a timezone object corresponding to UTC-4

edt = timezone(timedelta(hours=-4))

# Loop over trips, updating the start and end datetimes to be in UTC-4

for trip in onebike_datetimes[:10]:

# Update trip['start'] and trip['end']

trip['start'] = trip['start'].replace(tzinfo = edt)

trip['end'] = trip['end'].replace(tzinfo = edt)

+100 XP

Awesome! Did you know that despite being over 2,500 miles (4,200 km) wide (about as wide as the continential United States or the European Union) China has only one official timezone? There's a second, unofficial timezone, too. It is used by much of the Uyghurs population in the Xinjiang province in the far west of China.

  ___________________________________________________________________________________________________________________________________

What time did the bike leave in UTC?

Having set the timezone for the first ten rides that W20529 took, let's see what time the bike left in UTC. We've already loaded the results of the previous exercise into memory.

Instructions

100 XP

Within the for loop, set dt to be the trip['start'] but moved to UTC.

# Loop over the trips

for trip in onebike_datetimes[:10]:

# Pull out the start and set it to UTC

dt = trip['start'].astimezone(timezone.utc)

# Print the start time in UTC

print('Original:', trip['start'], '| UTC:', dt.isoformat())

+70 XP

Excellent! Did you know that there is no official time zone at the North or South pole? Since all the lines of longitude meet each other, it's up to each traveler (or research station) to decide what time they want to use.

Hint

Remember that .replace() just changes the timezone whereas .astimezone() actually moves the hours and days to match.

___________________________________________________________________________________________________________________________________

Putting the bike trips into the right time zone

Instead of setting the timezones for W20529 by hand, let's assign them to their IANA timezone: 'America/New_York'. Since we know their political jurisdiction, we don't need to look up their UTC offset. Python will do that for us.

Instructions

100 XP

Import tz from dateutil.

Assign et to be the timezone 'America/New_York'.

Within the for loop, set start and end to have et as their timezone (use .replace()).

# Import tz

from dateutil import tz

# Create a timezone object for Eastern Time

et = tz.gettz('America/New_York')

# Loop over trips, updating the datetimes to be in Eastern Time

for trip in onebike_datetimes[:10]:

# Update trip['start'] and trip['end']

trip['start'] = trip['start'].replace(tzinfo=et)

trip['end'] = trip['end'].replace(tzinfo=et)

Great! Time zone rules actually change quite frequently. IANA time zone data gets updated every 3-4 months, as different jurisdictions make changes to their laws about time or as more historical information about timezones are uncovered. tz is smart enough to use the date in your datetime to determine which rules to use historically.

___________________________________________________________________________________________________________________________________

What time did the bike leave? (Global edition)

When you need to move a datetime from one timezone into another, use .astimezone() and tz. Often you will be moving things into UTC, but for fun let's try moving things from 'America/New_York' into a few different time zones.

Instructions 1/3

35 XP

Set uk to be the timezone for the UK: 'Europe/London'.

Change local to be in the uk timezone and assign it to notlocal.

Take Hint (-10 XP)

# Create the timezone object

uk = tz.gettz('Europe/London')

# Pull out the start of the first trip

local = onebike_datetimes[0]['start']

# What time was it in the UK?

notlocal = local.astimezone(uk)

# Print them out and see the difference

print(local.isoformat())

print(notlocal.isoformat())

Set ist to be the timezone for India: 'Asia/Kolkata'.

Change local to be in the ist timezone and assign it to notlocal.

# Create the timezone object

ist = tz.gettz('Asia/Kolkata')

# Pull out the start of the first trip

local = onebike_datetimes[0]['start']

# What time was it in India?

notlocal = local.astimezone(ist)

# Print them out and see the difference

print(local.isoformat())

print(notlocal.isoformat())

Set sm to be the timezone for Samoa: 'Pacific/Apia'.

Change local to be in the sm timezone and assign it to notlocal.

# Create the timezone object

sm = tz.gettz('Pacific/Apia')

# Pull out the start of the first trip

local = onebike_datetimes[0]['start']

# What time was it in Samoa?

notlocal = local.astimezone(sm)

# Print them out and see the difference

print(local.isoformat())

print(notlocal.isoformat())

+100 XP

Did you notice the time offset for this one? It's at UTC+14! Samoa used to be UTC-10, but in 2011 it changed to the other side of the International Date Line to better match New Zealand, its closest trading partner. However, they wanted to keep the clocks the same, so the UTC offset shifted from -10 to +14, since 24-10 is 14. Timezones... not simple!

___________________________________________________________________________________________________________________________________

How many hours elapsed around daylight saving?

Since our bike data takes place in the fall, you'll have to do something else to learn about the start of daylight savings time.

Let's look at March 12, 2017, in the Eastern United States, when Daylight Saving kicked in at 2 AM.

If you create a datetime for midnight that night, and add 6 hours to it, how much time will have elapsed?

Instructions 1/3

35 XP

You already have a datetime called start, set for March 12, 2017 at midnight, set to the timezone 'America/New_York'.

Add six hours to start and assign it to end. Look at the UTC offset for the two

# Import datetime, timedelta, tz, timezone

from datetime import datetime, timedelta, timezone

from dateutil import tz

# Start on March 12, 2017, midnight, then add 6 hours

start = datetime(2017, 3, 12, tzinfo = tz.gettz('America/New_York'))

end = start + timedelta(hours=6)

print(start.isoformat() + " to " + end.isoformat())

You already have a datetime called start, set for March 12, 2017 at midnight, set to the timezone 'America/New_York'.

Add six hours to start and assign it to end. Look at the UTC offset for the two results.

# Import datetime, timedelta, tz, timezone

from datetime import datetime, timedelta, timezone

from dateutil import tz

# Start on March 12, 2017, midnight, then add 6 hours

start = datetime(2017, 3, 12, tzinfo = tz.gettz('America/New_York'))

end = start + timedelta(hours=6)

print(start.isoformat() + " to " + end.isoformat())

# How many hours have elapsed?

print((end - start).total_seconds()/(60*60))

Move your datetime objects into UTC and calculate the elapsed time again.

Once you're in UTC, what result do you get?

# Import datetime, timedelta, tz, timezone

from datetime import datetime, timedelta, timezone

from dateutil import tz

# Start on March 12, 2017, midnight, then add 6 hours

start = datetime(2017, 3, 12, tzinfo = tz.gettz('America/New_York'))

end = start + timedelta(hours=6)

print(start.isoformat() + " to " + end.isoformat())

# How many hours have elapsed?

print((end - start).total_seconds()/(60*60))

# What if we move to UTC?

print((end.astimezone(timezone.utc) - start.astimezone(timezone.utc))\

.total_seconds()/(60*60))

+90 XP

When we compare times in local time zones, everything gets converted into clock time. Remember if you want to get absolute time differences, always move to UTC!      

________________________________________________________________________________________________________________________

March 29, throughout a decade

Daylight Saving rules are complicated: they're different in different places, they change over time, and they usually start on a Sunday (and so they move around the calendar).

For example, in the United Kingdom, as of the time this lesson was written, Daylight Saving begins on the last Sunday in March. Let's look at the UTC offset for March 29, at midnight, for the years 2000 to 2010.

Instructions

100 XP

Using tz, set the timezone for dt to be 'Europe/London'.

Within the for loop:

Use the .replace() method to change the year for dt to be y.

Call .isoformat() on the result to observe the results.

# Import datetime and tz

from datetime import datetime

from dateutil import tz

# Create starting date

dt = datetime(2000, 3, 29, tzinfo = tz.gettz('Europe/London'))

# Loop over the dates, replacing the year, and print the ISO timestamp

for y in range(2000, 2011):

print(dt.replace(year=y).isoformat())

<script.py> output:

2000-03-29T00:00:00+01:00

2001-03-29T00:00:00+01:00

2002-03-29T00:00:00+00:00

2003-03-29T00:00:00+00:00

2004-03-29T00:00:00+01:00

2005-03-29T00:00:00+01:00

2006-03-29T00:00:00+01:00

2007-03-29T00:00:00+01:00

2008-03-29T00:00:00+00:00

2009-03-29T00:00:00+00:00

2010-03-29T00:00:00+01:00

+100 XP

Nice! As you can see, the rules for Daylight Saving are not trivial. When in doubt, always use tz instead of hand-rolling timezones, so it will catch the Daylight Saving rules (and rule changes!) for you.

________________________________________________________________________________________________________________________

Finding ambiguous datetimes

At the end of lesson 2, we saw something anomalous in our bike trip duration data. Let's see if we can identify what the problem might be.

The data has is loaded as onebike_datetimes, and tz has already been imported from dateutil.

Instructions

100 XP

Loop over the trips in onebike_datetimes:

Print any rides whose start is ambiguous.

Print any rides whose end is ambiguous.

# Loop over trips

for trip in onebike_datetimes:

# Rides with ambiguous start

if tz.datetime_ambiguous(trip['start']):

print("Ambiguous start at " + str(trip['start']))

# Rides with ambiguous end

if tz.datetime_ambiguous(trip['end']):

print("Ambiguous end at " + str(trip['end']))

<script.py> output:

Ambiguous start at 2017-11-05 01:56:50-04:00

Ambiguous end at 2017-11-05 01:01:04-04:00

+100 XP

Good work! Note that tz.datetime_ambiguous() only catches ambiguous datetimes from Daylight Saving changes. Other weird edge cases, like jurisdictions which change their Daylight Saving rules, hopefully should be caught by tz. And if they're not, at least those kinds of things are pretty rare in most data sets!

________________________________________________________________________________________________________________________

Cleaning daylight saving data with fold

As we've just discovered, there is a ride in our data set which is being messed up by a Daylight Savings shift. Let's fix up the data set so we actually have a correct minimum ride length. We can use the fact that we know the end of the ride happened after the beginning to fix up the duration messed up by the shift out of Daylight Savings.

Since Python does not handle tz.enfold() when doing arithmetic, we must put our datetime objects into UTC, where ambiguities have been resolved.

onebike_datetimes is already loaded and in the right timezone. tz and timezone have been imported.

Instructions

100 XP

Complete the if statement to be true only when a ride's start comes after its end.

When start is after end, call tz.enfold() on the end so you know it refers to the one after the daylight savings time change.

After the if statement, convert the start and end to UTC so you can make a proper comparison.

trip_durations = []

for trip in onebike_datetimes:

# When the start is later than the end, set the fold to be 1

if trip['start'] > trip['end']:

trip['end'] = tz.enfold(trip['end'])

# Convert to UTC

start = trip['start'].astimezone(timezone.utc)

end = trip['end'].astimezone(timezone.utc)

# Subtract the difference

trip_length_seconds = (end-start).total_seconds()

trip_durations.append(trip_length_seconds)

# Take the shortest trip duration

print("Shortest trip: " + str(min(trip_durations)))

+100 XP

Good work! Now you know how to handle some pretty gnarly edge cases in datetime data. To give a sense for how tricky these things are: we actually still don't know how long the rides are which only started or ended in our ambiguous hour but not both. If you're collecting data, store it in UTC or with a fixed UTC offset!

<====================================================================================================================================>

Easy and Powerful: Dates and Times in Pandas

To conclude this course, you'll apply everything you've learned about working with dates and times in standard Python to working with dates and times in Pandas. With additional information about each bike ride, such as what station it started and stopped at and whether or not the rider had a yearly membership, you'll be able to dig much more deeply into the bike trip data. In this chapter, you'll cover powerful Pandas operations, such as grouping and plotting results by time.

________________________________________________________________________________________________________________________

Loading a csv file in Pandas

The capital_onebike.csv file covers the October, November and December rides of the Capital Bikeshare bike W20529.

Here are the first two columns:

Start date End date ...

2017-10-01 15:23:25 2017-10-01 15:26:26 ...

2017-10-01 15:42:57 2017-10-01 17:49:59 ...

Instructions

100 XP

Import Pandas.

Complete the call to read_csv() so that it correctly parses the date columns Start date and End date.

# Import pandas

import pandas as pd

# Load CSV into the rides variable

rides = pd.read_csv('capital-onebike.csv',

parse_dates = ['Start date','End date'])

# Print the initial (0th) row

print(rides.iloc[0])

<script.py> output:

Start date 2017-10-01 15:23:25

End date 2017-10-01 15:26:26

Start station number 31038

Start station Glebe Rd & 11th St N

End station number 31036

End station George Mason Dr & Wilson Blvd

Bike number W20529

Member type Member

Name: 0, dtype: object

+100 XP

Geat! Did you know that pandas has a pd.read_excel(), pd.read_json(), and even a pd.read_clipboard() function to read tabular data that you've copied from a document or website? Most have date parsing functionality too.

________________________________________________________________________________________________________________________

Making timedelta columns

Earlier in this course, you wrote a loop to subtract datetime objects and determine how long our sample bike had been out of the docks. Now you'll do the same thing with Pandas.

rides has already been loaded for you.

Instructions

100 XP

Subtract the Start date column from the End date column to get a Series of timedeltas; assign the result to ride_durations.

Convert ride_durations into seconds and assign the result to the 'Duration' column of rides.

# Subtract the start date from the end date

ride_durations = rides['End date'] - rides['Start date']

# Convert the results to seconds

rides['Duration'] = ride_durations.dt.total_seconds()

print(rides['Duration'].head())

Hint

Access columns in rides with the [] operator.

The method to turn a timedelta into a number is .dt.total_seconds().

Assign a Series x to be column "x" in a dataframe df with df['x'] = x.

+0 XP

Great! Because Pandas supports method chaining, you could also perform this operation in one line: rides['Duration'] = (rides['End date'] - rides['Start date']).dt.total_seconds()

________________________________________________________________________________________________________________________

How many joyrides?

Suppose you have a theory that some people take long bike rides before putting their bike back in the same dock. Let's call these rides "joyrides".

You only have data on one bike, so while you can't draw any bigger conclusions, it's certainly worth a look.

Are there many joyrides? How long were they in our data set? Use the median instead of the mean, because we know there are some very long trips in our data set that might skew the answer, and the median is less sensitive to outliers.

Instructions

100 XP

Create a Pandas Series which is True when Start station and End station are the same, and assign the result to joyrides.

Calculate the median duration of all rides.

Calculate the median duration of joyrides.

# Create joyrides

joyrides = (rides['Start station'] == rides['End station'])

# Total number of joyrides

print("{} rides were joyrides".format(joyrides.sum()))

# Median of all rides

print("The median duration overall was {:.2f} seconds"\

.format(rides['Duration'].median()))

# Median of joyrides

print("The median duration for joyrides was {:.2f} seconds"\

.format(rides[joyrides]['Duration'].median()))

________________________________________________________________________________________________________________________

It's getting cold outside, W20529

Washington, D.C. has mild weather overall, but the average high temperature in October (68ºF / 20ºC) is certainly higher than the average high temperature in December (47ºF / 8ºC). People also travel more in December, and they work fewer days so they commute less.

How might the weather or the season have affected the length of bike trips?

Instructions 1/2

50 XP

Resample rides to the daily level, based on the Start date column.

Plot the .size() of each result.

# Import matplotlib

import matplotlib.pyplot as plt

# Resample rides to daily, take the size, plot the results

rides.resample(____, ____ = 'Start date')\

.____\

.plot(ylim = [0, 15])

# Show the results

plt.show()

________________________________________________________________________________________________________________________

It's getting cold outside, W20529

Washington, D.C. has mild weather overall, but the average high temperature in October (68ºF / 20ºC) is certainly higher than the average high temperature in December (47ºF / 8ºC). People also travel more in December, and they work fewer days so they commute less.

How might the weather or the season have affected the length of bike trips?

Resample rides to the daily level, based on the Start date column.

Plot the .size() of each result.

# Import matplotlib

import matplotlib.pyplot as plt

# Resample rides to daily, take the size, plot the results

rides.resample('D', on = 'Start date')\

.size()\

.plot(ylim = [0, 15])

# Show the results

plt.show()

Since the daily time series is so noisy for this one bike, change the resampling to be monthly.

# Import matplotlib

import matplotlib.pyplot as plt

# Resample rides to monthly, take the size, plot the results

rides.resample('M', on = 'Start date')\

.size()\

.plot(ylim = [0, 150])

# Show the results

plt.show()

+100 XP

Nice! As you can see, the pattern is clearer at the monthly level: there were fewer rides in November, and then fewer still in December, possibly because the temperature got colder.

________________________________________________________________________________________________________________________

Members vs casual riders over time

Riders can either be "Members", meaning they pay yearly for the ability to take a bike at any time, or "Casual", meaning they pay at the kiosk attached to the bike dock.

Do members and casual riders drop off at the same rate over October to December, or does one drop off faster than the other?

As before, rides has been loaded for you. You're going to use the Pandas method .value_counts(), which returns the number of instances of each value in a Series. In this case, the counts of "Member" or "Casual".

Instructions

100 XP

Set monthly_rides to be a resampled version of rides, by month, based on start date.

Use the method .value_counts() to find out how many Member and Casual rides there were, and divide them by the total number of rides per month.

# Resample rides to be monthly on the basis of Start date

monthly_rides = rides.resample('M',on='Start date')['Member type']

# Take the ratio of the .value_counts() over the total number of rides

print(monthly_rides.value_counts() / monthly_rides.size())

<script.py> output:

Start date Member type

2017-10-31 Member 0.768519

Casual 0.231481

2017-11-30 Member 0.825243

Casual 0.174757

2017-12-31 Member 0.860759

Casual 0.139241

Name: Member type, dtype: float64

    Nice! Note that by default, .resample() labels Monthly resampling with the last day in the month and not the first. It certainly looks like the fraction of Casual riders went down as the number of rides dropped. With a little more digging, you could figure out if keeping Member rides only would be enough to stabilize the usage numbers throughout the fall.

________________________________________________________________________________________________________________________

Combining groupby() and resample()

A very powerful method in Pandas is .groupby(). Whereas .resample() groups rows by some time or date information, .groupby() groups rows based on the values in one or more columns. For example, rides.groupby('Member type').size() would tell us how many rides there were by member type in our entire DataFrame.

.resample() can be called after .groupby(). For example, how long was the median ride by month, and by Membership type?

Instructions

100 XP

Complete the .groupby() call to group by 'Member type', and the .resample() call to resample according to 'Start date', by month.

Print the median Duration for each group.

# Group rides by member type, and resample to the month

grouped = rides.groupby('Member type')\

.resample('M', on = 'Start date')

# Print the median duration for each group

print(grouped['Duration'].median())

<script.py> output:

Member type Start date

Casual 2017-10-31 1636.0

2017-11-30 1159.5

2017-12-31 850.0

Member 2017-10-31 671.0

2017-11-30 655.0

2017-12-31 387.5

Name: Duration, dtype: float64

   Nice! It looks like casual riders consistently took longer rides, but that both groups took shorter rides as the months went by. Note that, by combining grouping and resampling, you can answer a lot of questions about nearly any data set that includes time as a feature. Keep in mind that you can also group by more than one column at once.

________________________________________________________________________________________________________________________

Timezones in Pandas

Earlier in this course, you assigned a timezone to each datetime in a list. Now with Pandas you can do that with a single method call.

(Note that, just as before, your data set actually includes some ambiguous datetimes on account of daylight saving; for now, we'll tell Pandas to not even try on those ones. Figuring them out would require more work.)

Instructions 1/2

50 XP

Make the Start date column timezone aware by localizing it to 'America/New_York' while ignoring any ambiguous datetimes.

# Localize the Start date column to America/New York

rides['Start date'] = rides['Start date'].dt.tz_localize('America/New_York',

ambiguous='NaT')

# Print first value

print(rides['Start date'].iloc[0])

Convert the Start date column to the timezone 'Europe/London'.

# Localize the Start date column to America/New York

rides['Start date'] = rides['Start date'].dt.tz_localize('America/New_York',

ambiguous='NaT')

# Print first value

print(rides['Start date'].iloc[0])

# Convert the Start date column to Europe/London

rides['Start date'] = rides['Start date'].dt.tz_convert('Europe/London')

# Print the new value

print(rides['Start date'].iloc[0])

________________________________________________________________________________________________________________________

How long per weekday?

Pandas has a number of datetime-related attributes within the .dt accessor. Many of them are ones you've encountered before, like .dt.month. Others are convenient and save time compared to standard Python, like .dt.weekday_name.

Instructions

100 XP

Add a new column to rides called 'Ride start weekday', which is the weekday of the Start date.

Print the median ride duration for each weekday.

# Add a column for the weekday of the start of the ride

rides['Ride start weekday'] = rides['Start date'].dt.weekday_name

# Print the median trip time per weekday

print(rides.groupby('Ride start weekday')['Duration'].median())

<script.py> output:

Ride start weekday

Friday 724.5

Monday 810.5

Saturday 462.0

Sunday 902.5

Thursday 652.0

Tuesday 641.5

Wednesday 585.0

Name: Duration, dtype: float64

+100 XP

Well done! There are .dt attributes for all of the common things you might want to pull out of a datetime, such as the day, month, year, hour, and so on, and also some additional convenience ones, such as quarter and week of the year out of 52.

________________________________________________________________________________________________________________________

How long between rides?

For your final exercise, let's take advantage of Pandas indexing to do something interesting. How much time elapsed between rides?

Instructions

100 XP

Calculate the difference in the Start date of the current row and the End date of the previous row and assign it to rides['Time since'].

Convert rides['Time since'] to seconds to make it easier to work with.

Resample rides to be in monthly buckets according to the Start date.

Divide the average by (60*60) to get the number of hours on average that W20529 waited in the dock before being picked up again.

# Shift the index of the end date up one; now subract it from the start date

rides['Time since'] = rides['Start date'] - (rides['End date'].shift(1))

# Move from a timedelta to a number of seconds, which is easier to work with

rides['Time since'] = rides['Time since'].dt.total_seconds()

# Resample to the month

monthly = rides.resample('M',on='Start date')

# Print the average hours between rides each month

print(monthly['Time since'].mean()/(60*60))

<script.py> output:

Start date

2017-10-31 5.519242

2017-11-30 7.256443

2017-12-31 9.202380

Freq: M, Name: Time since, dtype: float64

+100 XP

Great job! As you can see, there are a huge number of Pandas tricks that let you express complex logic in just a few lines, assuming you understand how the indexes actually work. If you haven't taken it yet, have you considered taking Pandas Foundations? In addition to lots of other useful Pandas information, it covers working with time series (like stock prices) in Pandas. Time series have many overlapping techniques with datetime data.
        

My Blogs

Saturday, January 15, 2022

Python: working with dates and times

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Search

Popular Posts

Labels

About Me

Blog Archive