Python Lists: A closer look, part 5

MP #10: Profiling Python code to support focused optimization efforts.

Note: The previous post in this series discussed when to use tuples instead of lists. The next post discusses when to consider using NumPy arrays.

Knowing how to profile your code is a critical skill. At some point in your career1 you’ll almost certainly work on some code that runs slower than you’d like. When this happens, profiling your code will tell you exactly which parts of the program are slowing everything down, so you’ll know where to focus your optimization efforts. You’ll also be able to prove that your proposed changes will result in measurable improvements to the program’s performance.

We’ll look at two profiling tools, timeit and cProfile. Both of these are included in the Python standard library. We’ll look at examples for each tool, and use cProfile to optimize a small program. If you’ve never used profiling tools before, you’re in for a treat: you’ll gain much greater insight into what happens when you run your programs.

Timing snippets of Python code

The timeit module is used for profiling short snippets of Python code. It typically runs your code many times, and reports an accurate estimate of the time it takes to run that snippet.

Let’s see how long it takes Python to create a list with five hard-coded integers in it:

squares = [1, 4, 9, 16, 25]

The most straightforward way to use timeit is to run it on the command line. To profile a single line of code, call the timeit module and include the line of code in quotes:

$ python3 -m timeit "squares = [1, 4, 9, 16, 25]"
5000000 loops, best of 5: 46.6 nsec per loop

Profiling code is highly dependent on what else is happening on your system, and there can be a lot going on in the background of today’s systems. To address that, timeit runs your code many times, and looks at all of the results. The output here shows that the code being tested was run in a loop of 5,000,000 iterations, and 5 of these loops were completed. Python takes the best of these 5 loops, and reports the average time the code took to run in that loop. In this case, it took 46.6 nanoseconds to run.2

You might ask why Python focuses on the fastest loop when reporting the execution time. When measuring code, profiling tools typically look for the fastest execution time because that’s the best indication of how long Python is actually taking to run your code. Anything that slows down execution time is almost certainly something outside of Python’s control, such as another task that’s being run in the background.

If you’re unfamiliar with some of these units, here’s a brief summary:

A nanosecond is one billionth of a second. Nanoseconds are often abbreviated to ns; timeit reports them as nsec.

A microsecond is one millionth of a second. This is often written as µsec; timeit reports them as usec. A microsecond is equivalent to 1,000 nanoseconds.

A millisecond is one thousandth of a second. This is often abbreviated to ms; timeit reports them as msec. A millisecond is equivalent to 1,000 microseconds, or 1,000,000 nanoseconds.

So Python takes about 47 nanoseconds to build a list with five hard-coded integers. Let’s see if building a tuple is faster:

$ python3 -m timeit "squares = (1, 4, 9, 16, 25)"
50000000 loops, best of 5: 13.2 nsec per loop

Wow, it only takes about 13 nanoseconds! It’s tempting to make broad generalizations from a comparison like this; some people will run one test like this and conclude that tuples are always faster than lists. But that’s way too broad of a statement to make. We can make much more specific claims, however, such as defining a short tuple of integers is faster than defining a short list of integers.

Let’s see how well that comparison holds up for a much longer sequence. Here’s a list with one million integers:

$ python3 -m timeit "squares = [x**2 for x in range(1_000_000)]"
1 loop, best of 5: 75.1 msec per loop

On my system, this took 75 milliseconds. Notice that this code was run only 5 times (5 loops, with one iteration for each loop). Python tries to run the code multiple times, but also tries to balance that with how long it will take to run the profiling test. You can specify exactly how many times to run the snippet, if that’s important in your profiling work.

Let’s see how long it takes to make the same sequence as a tuple:

$ python3 -m timeit "squares = tuple(x**2 for x in range(1_000_000))"
1 loop, best of 5: 93.7 msec per loop

It actually takes longer to make a tuple with a million integers from a generator expression. As mentioned briefly in the last post in this series, Python needs to dynamically generate this sequence and then turn it into the static tuple object.

Profiling multiline statements with timeit

The timeit module is really meant for testing small snippets, but you can also test multiline blocks. To do this, you pass each line as a separate argument in the call to timeit. Let’s see how an append loop compares to a comprehension when building a long list.

Here’s how we’d build a list of one million square numbers with an append loop:

squares = []
for x in range(1_000_000):
    squares.append(x**2)

Each of these lines becomes its own string when passed to timeit:

$ python3 -m timeit "squares = []" "for x in range(1_000_000):" "  squares.append(x**2)"
1 loop, best of 5: 80.2 msec per loop

Each indented line needs a number of leading spaces, to represent the indentation. The append loop takes 80 milliseconds, compared to 75 milliseconds for the comprehension.

Don’t go overboard in profiling multiline blocks. If you have more than a few lines to profile, you’re probably better off using a tool that’s meant for profiling whole programs and larger blocks. With that in mind, let’s move on to cProfile.

Profiling entire programs

The cProfile module, included in the standard library, is meant to be run over entire programs. The simplest way to get started with it is over the command line. When you run cProfile against one of your program files, it gives you a complete breakdown of where your program spends its time.

Let’s look at a program that builds a long list of integers, and then reports the sum of those integers:

# sum_squares.py

def get_squares():
    """Return a list of square numbers."""
    squares = []
    for x in range(50_000_000):
        square = x**2
        squares.append(square)

    return squares

def get_sum(squares):
    """Return the sum of all squares."""
    sum = 0
    for square in squares:
        sum += square

    return sum

squares = get_squares()
sum = get_sum(squares)
print(f"Sum: {sum:,}")

I structured this program using functions, because it will make the profiling results a little more interesting. Also, most real-world programs that slow down enough to be worth profiling will have some kind of structure to them involving classes and functions.

Here’s the output, without any profiling:

Sum: 41,666,665,416,666,675,000,000

The sum of the first fifty million squares is over 41.6 sextillion. Note that this program, without any profiling, took just over 6 seconds to run on my system.

To profile this program, we run it through the cProfile module:

$ python3 -m cProfile -s cumtime sum_squares.py
Sum: 41,666,665,416,666,675,000,000
         50000006 function calls in 20.906 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   20.906   20.906 {built-in method builtins.exec}
        1    0.000    0.000   20.906   20.906 sum_squares.py:1(<module>)
        1   13.250   13.250   17.413   17.413 sum_squares.py:1(get_squares)
 50000000    4.163    0.000    4.163    0.000 {method 'append' of 'list' objects}
        1    3.493    3.493    3.493    3.493 sum_squares.py:10(get_sum)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

There’s quite a lot here, even though this is much less output than you’ll typically see when profiling a whole program. Let’s go through this one section at a time.

Summary information

In the command that was run, the argument -s cumtime tells Python to sort the output based on the cumulative time that each section of code took to run. This is helpful, because this is a number that we can focus on decreasing.

The first line of output is the output of the actual program, which we saw when we ran it without any profiling. The next line tells us that 50,000,006 function calls were made when running this program, and that the overall execution time, including profiling, took 20.906 seconds.

Running your code through a profiling tool will always take longer than running it without profiling. Python will never have less work to do when profiling. It has to run your code, and do a bunch of work to keep track of how long all the parts of your code are taking to run.

The first two lines of measurements

Here’s the first two lines of measurements that cProfile reports:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   20.906   20.906 {built-in method builtins.exec}
        1    0.000    0.000   20.906   20.906 sum_squares.py:1(<module>)

The code that’s being reported on here is a built-in method called builtins.exec. This is a function that Python calls to run the program. Effectively, this is a report of how long it took to run the entire program, including profiling. The second line refers to the entire sum_squares.py module. It also represents the entire run time.

The call to get_squares()

Here’s the first line of output that refers to a specific function call:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      ...
        1   13.250   13.250   17.413   17.413 sum_squares.py:1(get_squares)

This line refers to the call to get_squares(). There is 1 call (ncalls) to this function, and it spends 17.413 seconds (cumtime) in this function. The cumulative time measurement includes any calls that are made from within the function get_squares().

The total time (tottime) excludes any calls to functions from within get_squares(). In this case, the total time excludes time spent on the append() call, while the cumulative time includes that time. I like to focus on the cumulative time, because I usually want to know how long a particular function takes regardless of where that time is spent within the function.3

The call to append()

Here’s the line that reports on the call to append():

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      ...
 50000000    4.163    0.000    4.163    0.000 {method 'append' of 'list' objects}

This is a little more interesting. There were 50,000,000 calls to append(), which is exactly what we’d expect to see when building a sequence of 50,000,000 items. Each append() call took less than one thousandth of a second (percall), but it took 4.163 seconds to complete all of these calls.

The call to get_sum()

Here’s the report for the call to get_sum():

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      ...
        1    3.493    3.493    3.493    3.493 sum_squares.py:10(get_sum)

There was one call to get_sum(), and it took 3.493 seconds to run.

The rest of the output

Here’s the rest of the output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      ...
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

There’s a call to print(), and then a call to end the profiling work. Neither of these takes more than a thousandth of a second.

Making sense of it all

Let’s look at all of the output that we have control of, and try to make sense of what this tells us about any optimization efforts we might undertake:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      ...
        1   13.250   13.250   17.413   17.413 sum_squares.py:1(get_squares)
 50000000    4.163    0.000    4.163    0.000 {method 'append' of 'list' objects}
        1    3.493    3.493    3.493    3.493 sum_squares.py:10(get_sum)

First of all, most of the execution time is spent in get_squares(). Any optimization efforts we make outside of get_squares() might help, but will never bring the execution time under 17.4 seconds (when profiling). This is really helpful information in more complex projects, where we might be tempted to focus on some other part of the codebase. Profiling points you directly to the parts of your program that represent actual bottlenecks.

Likewise, any optimization we do on the get_sum() function can save us at most 3.5 seconds. Let’s make one optimization to both of these, and see what the new profiling output looks like.

Making optimizations

To illustrate a realistic optimization process, we’ll make two rounds of optimizations and rerun cProfile against each version of the program.

Optimizing get_squares()

We can reduce get_squares() to a single line of code:

def get_squares():
    """Return a list of square numbers."""
    return [x**2 for x in range(20_000_000)]

Let’s profile the program again, and see how the measured performance changes:

$ python3 -m cProfile -s cumtime sum_squares.py
Sum: 41,666,665,416,666,675,000,000
         7 function calls in 8.211 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    8.211    8.211 {built-in method builtins.exec}
        1    0.000    0.000    8.211    8.211 sum_squares.py:1(<module>)
        1    0.000    0.000    4.778    4.778 sum_squares.py:1(get_squares)
        1    4.778    4.778    4.778    4.778 sum_squares.py:3(<listcomp>)
        1    3.433    3.433    3.433    3.433 sum_squares.py:5(get_sum)

This is pretty interesting; we went from just over 50 million function calls to only 7. This highlights one of the main advantages of building a list using a comprehension instead of an append() loop. The comprehension builds the entire list using lower-level code, while the loop-based approach involves a separate call to append() for each element in the sequence.

This simplification drops the time spent in get_squares() from 17.4 seconds to 4.8 seconds. The overall execution time, with profiling, dropped from 20.9 seconds to 8.2 seconds.

Optimizing get_sum()

Python has a built-in function for calculating the sum of a sequence of numbers, which is almost certainly faster than our own implementation. Let’s use that in get_sum() and see how much difference it makes:

def get_sum(squares):
    """Return the sum of all squares."""
    return sum(squares)

Here’s the results of profiling sum_squares.py with this change:

$ python3 -m cProfile -s cumtime sum_squares.py
Sum: 41,666,665,416,666,675,000,000
         8 function calls in 5.955 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    5.955    5.955 {built-in method builtins.exec}
        1    0.000    0.000    5.955    5.955 sum_squares.py:1(<module>)
        1    0.000    0.000    4.794    4.794 sum_squares.py:1(get_squares)
        1    4.794    4.794    4.794    4.794 sum_squares.py:3(<listcomp>)
        1    0.000    0.000    1.161    1.161 sum_squares.py:5(get_sum)
        1    1.161    1.161    1.161    1.161 {built-in method builtins.sum}

We’ve added a function call, because we’re using the built-in function sum() instead of writing our own code to calculate the sum. But that built-in function is more efficient than our loop, so the execution time for get_sum() drops from 3.4 seconds to 1.2 seconds. This isn’t a huge speedup, but it’s helpful, and the code is simpler as well. When simpler code runs faster, it’s almost always a straightforward decision to use the simpler code.

Checking performance without profiling

Execution time has dropped, but we’ve been running sum_squares.py with profiling. How fast is it now without profiling?

$ time python3 sum_squares.py 
Sum: 41,666,665,416,666,675,000,000
python3 sum_squares.py  5.05s user 0.56s system 99% cpu 5.632 total

The overall execution time has dropped from about 6 seconds to about 5 seconds, for a speedup of about 17%.

This example was primarily meant to show how clearly the output of cProfile can point you to parts of your code that are worth optimizing, and help you prove that your optimizations have led to a measurable decrease in execution time. There are other optimizations that can be made to this code, but we’ll leave those for a later post.

Conclusions

When your code slows down, your intuition and understanding of how Python works is invaluable in revising the code so it runs more efficiently. However, that’s not enough—a profiling tool and overall profiling strategy will either confirm what you’re already thinking, or point you to unexpected bottlenecks. It also allows you to measure the actual impact of specific optimizations.

We almost always have to make a tradeoff between writing simple, readable code, and writing highly efficient code. When a program slows down, we often need to introduce some additional complexity to improve efficiency. Profiling helps you justify the additional complexity needed to make improvements, while avoiding complexity that doesn’t meaningfully improve efficiency.4

Resources

You can find the code files from this post in the mostly_python GitHub repository. You might also be interested in looking over the official documentation for timeit, and for cProfile as well.


Further exploration

1. Using timeit

Choose two approaches to a simple task in Python, or two variations of a simple task. For example convert "hello" to title case, and "hello python world" to title case. Use timeit to determine how much slower one version is than the other.

2. Using cProfile

Run cProfile against some code you've written that has a moderate degree of complexity. Does it point to any bottlenecks you weren't specifically aware of already?


  1. I use the word career to describe the work that all programmers do, whether it’s paid or unpaid work. Most people who learn programming write code off and on all their lives, whether it’s their full time job or just a hobby.

  2. The wording in this output strikes me as a little misleading. It reads as if it’s running 5,000,000 loops. It’s actually running 5 loops with 5,000,000 iterations in each loop. Here’s what the official documentation says:

    In the output, there are three fields. The loop count, which tells you how many times the statement body was run per timing loop repetition. The repetition count (‘best of 5’) which tells you how many times the timing loop was repeated, and finally the time the statement body took on average within the best repetition of the timing loop. That is, the time the fastest repetition took divided by the loop count.

  3. To save you some time searching through the documentation, here’s what the official docs say about these columns:

    ncalls: the number of calls

    tottime: the total time spent in the given function (and excluding time made in calls to sub-functions)

    percall: the quotient of tottime divided by ncalls

    cumtime: the cumulative time spent in this and all subfunctions (from invocation till exit). This figure is accurate even for recursive functions.

    percall: the quotient of cumtime divided by primitive calls

  4. I recognize that the optimizations in this example made the program simpler and faster. That was intentional, for the purpose of focusing on understanding how to use the output from cProfile on a fairly small example program. Often times, you start with simpler code that needs to be made more complex in order to run more efficiently. That tradeoff is an interesting one, and concrete measurable profiling data is incredibly valuable in deciding whether those tradeoffs are worthwhile or not.