Exploring data efficiently with pandas

MP #33: Reimplementing an older project with modern tools.

In the last post we explored some climate data using nothing but the Python standard library for the data processing part of the work. This time we’ll repeat the same exploration, but we’ll use tools from the pandas library. We’ll look at how this choice affects the overall readability of the code, how well we understand the work that’s being done, and the overall performance of the code. We’ll also use an updated dataset that includes readings through May of this year.

The goal of this post is to show how much more concise data analysis code can be when using pandas. A later series will go into more detail about how pandas works, and how you can use it in your own projects.

Extracting data

Parsing data is much easier using pandas:

# high_temp_trends.py

from pathlib import Path

import pandas as pd

def get_data(path):
    """Extract dates and high temperatures."""
    df = pd.read_csv(path)
    df['DATE'] = pd.to_datetime(df['DATE'])
    
    return df['DATE'], df['TMAX']

# Extract data.
path = Path('wx_data/sitka_highs_1944_2023.csv')
dates, highs = get_data(path)
print(f"\nFound {len(highs):,} data points.")

The main difference between this code and the previous version is in the body of get_data(). Using pandas, you can read all the data from a CSV file with a single call to the function read_csv(). The data is read into a pandas structure called a DataFrame, and we assign this to the variable df.

The two main data structures in pandas are Series objects and DataFrame objects. A Series is a sequence of values, and each Series object has a name. A DataFrame is a matrix of Series objects.

pandas has many functions for carrying out common data cleaning operations quickly and concisely. Here we take the part of the DataFrame corresponding to the label 'DATE' and run it through the to_datetime() function. This function can recognize many common date and timestamp formats, and it converts string values to datetime objects. It does this quickly and accurately.

Since I want to just focus on the high temperatures, this function returns only two Series objects from the DataFrame, a Series containing the dates and a Series containing the high temperatures. The output with the updated dataset reads in 28,093 data points:

$ time python high_temp_trends.py

Found 28,093 data points.
python high_temp_trends.py  0.54s user 0.13s system 114% cpu 0.580 total

The original body of get_data() had 13 lines of code. This function does the same work in just 3 lines of code. Once you become familiar with DataFrame objects, this code is much easier to follow than the standard library approach. The data takes slightly longer to load; it took less than 0.2 seconds using the standard library, and it takes just over 0.5 seconds here. This increase is mostly due to the time it takes to import pandas.1

Analyzing data

The code for visualizing data is the same as it was in the previous post, and the initial output is the same as well. So let’s focus on analyzing the data, using pandas.

Here’s the code for calculating moving averages using pandas:

averages = temps.rolling(30).mean()

The rolling() method returns a sequence of “windows” over a dataset. Here the code temps.rolling(30) returns a sequence of windows, each of which contains 30 high temperature readings. Chaining the mean() method onto this call gives us a 30-day moving average over the entire dataset.

We can calculate moving averages in one line of code now, so we don’t need a get_averages() function anymore.

Here’s a program that generates five plots of varying window sizes, from 1 year to 30 years:

from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

def get_data(path):
    ...

def plot_data(dates, temps, title,
        filename="output/high_temps.png"):
    ...

# Extract data.
path = Path('wx_data/sitka_highs_1944_2023.csv')
dates, highs = get_data(path)
print(f"\nFound {len(highs):,} data points.")

for num_years in [1, 3, 5, 10, 30]:
    timespan = num_years * 365
    avgs = highs.rolling(timespan).mean()

    # Plot the data.
    title = f"{num_years}-year Moving Average Temperature"

    filename = f"output/high_temps_{num_years}_year_moving_average.png"
    plot_data(dates, avgs, title, filename=filename)

    print(f"  Generated plot: {filename}")

After extracting the data, we set up a loop over each of the timespans we want to visualize. On each pass through the loop, we get the set of moving averages for the given timespan and plot the resulting data.

Here are two of the plots, for 1- and 5- year timespans:

Plot of 1-year moving averages, from about 1950 through 2022.
Plot of 5-year moving averages, from about 1950 through 2022.

Each of these is missing some data. When we use pandas to read in data from a CSV file, it doesn’t skip over invalid rows. Instead, it fills those rows with NaN values, which is short for Not a Number. If we keep these values in the data set when running our analysis, they lead to holes in the visualizations.

Examining invalid data

Before deciding what to do with the invalid data, it’s a good idea to see how much data is missing or invalid.

Here’s a snippet that tells us how many data points were invalid:

...
# Extract data.
path = Path('wx_data/sitka_highs_1944_2023.csv')
dates, highs = get_data(path)
print(f"\nFound {len(highs):,} data points.")

num_invalid = highs.isna().sum()
print(f"There are {num_invalid} invalid readings.")

The method isna() checks if each item in a Series is NaN. The snippet highs.isna() returns a Series that’s the same length as highs, but only consists of the values True and False. Every NaN element in highs generates a True value, and every other element generates a False value in the resulting Series.

Here’s what the resulting Series looks like:

0       False
1       False
2       False
...
456     True
...
28092   False

In Python, int(True) returns 1. Chaining sum() on a sequence of Boolean values becomes a nice way to count how many True values there are in a sequence.

All that to say, highs.isna().sum() counts the number of invalid readings in highs:

Found 28,093 data points.
There are 42 invalid readings.

Since we’re primarily interested in the 30-year timeframe, we can safely ignore 42 invalid readings. These invalid readings will have a negligible impact on an average of 30 * 365 readings, at least in this exploratory work.

Ignoring invalid data

pandas provides a straightforward way to drop invalid data from a DataFrame. The df.dropna() method drops any invalid rows from a DataFrame.

So, all we need to do is add one line to the get_data() function to generate clean plots:

def get_data(path):
    """Extract data from file."""
    df = pd.read_csv(path)
    df['DATE'] = pd.to_datetime(df['DATE'])

    # Drop invalid readings.
    df.dropna(inplace=True)
    
    return df['DATE'], df['TMAX']

The inplace=True argument tells pandas to modify the existing DataFrame, rather than returning a new DataFrame without the invalid data.2

Adding this one line fixes the hole in all the plots. Here’s the 5-year plot:

Plot of 5-year moving averages, from about 1950 through 2022.

And here’s the updated plot for a 30-year moving average:

Plot of 30-year moving averages, from about 1975 through 2022.

It looks like the drop in the early 2010s has reversed, and Sitka’s 30-year average temperature has just about matched its all-time high.

Looking at performance

This program runs plenty fast, but let’s profile a version of this code that generates the 30-year plot, to see how it compares to the standard library approach:

$ python -m cProfile -s cumtime high_temp_trends_30_yr.py

Found 28,051 data points.
         1501721 function calls (1474345 primitive calls) in 1.608 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       84    0.002    0.000    1.792    0.021 __init__.py:1(<module>)
    586/1    0.006    0.000    1.609    1.609 {built-in method builtins.exec}
        1    0.000    0.000    1.609    1.609 high_temp...30_yr.py:1(<module>)
   695/10    0.004    0.000    1.023    0.102 <frozen...

Even with more data points, there are fewer function calls. That makes sense, because we’re using pandas functions and methods in place of blocks of standard library code.

Without profiling, the 30-year program takes 1.1s to run, and the program that generates the full set of five plots takes only 2.2s.

Conclusions

While you can go a long way just using what’s in the Python standard library, it’s well worth your time to learn the basics of pandas if you’re going to be working with data on a regular basis. You can do more with less code, your code will be more readable, and it should run efficiently as well.

If your code starts to run slower than you like, make sure you profile it. Not everything in third-party libraries is better and faster, and a good profiling habit will help you identify the situations where third party code is not more efficient.

Resources

You can find the code files from this post in the mostly_python GitHub repository.

pandas documentation

If you’d like to look at some pandas documentation, here are a few good places to start:


  1. Note that it can take much longer to run your program the first time. For example when I first ran high_temp_trends.py, it took just over 12 seconds. Running it again immediately after that takes about 0.5 seconds. This happens for many larger libraries, because once the library has been loaded a lot of the code is still accessible in RAM. When you haven’t used a library in a while, you’ll see that initial startup time again.

  2. By default, methods in pandas tend to return new objects rather than modifying existing data. This helps avoid accidental data loss.