Measuring efficiency

MP 134: Be careful about performance assumptions until you measure them!

Note: In the next few weeks, I'm going to slow the pace slightly on Mostly Python. I've been writing at least one post per week for two full years now. I really enjoy writing these posts, but that's become an unsustainable pace; I can't seem to write many short posts! I start writing about something, and I want to keep following the interesting threads that come up. I've often wondered if it's a bit much to read on a weekly basis as well.

I'll be writing at least one post every two weeks. That's a much more sustainable pace, and keeps me from forcing some posts just to get something out. When appropriate, I'll release posts more frequently from time to time.

If you're a paid subscriber and this makes you reconsider your paid subscription, please feel free to reach out and I'll cancel or refund your subscription. Thank you for reading, and thank you for supporting my ongoing work.


A couple weeks ago I wrote about the any() function. It's a really useful built-in function that I've used regularly over the years, but it's often taken me a round of refactoring to actually use it in my projects.

In that post I showed that any() can accept a sequence such as a list, but it can also accept a generator expression. I stated that using a generator expression should be faster than using a list:

I believe this version [using a generator expression] should be more efficient, as it should stop at the first item in the sequence that meets the specified condition. If you pass a comprehension, I believe the entire list will be built before any() is executed.

Since I wrote that I've been wanting to check if it's true. List comprehensions often surprise me when it comes to performance, so I haven't been confident that this claim was true. Let's find out!

Numbers instead of fruit

In the original post I used an example about fruit and desserts. That was nice for introducing concepts, but it's hard to make sequences of varying sizes with such a specific concept. Let's use numbers instead, using the range() function.

Let's look at what range() produces, and what it means for use with any():

>>> range(5)
range(0, 5)
>>> type(range(5))
<class 'range'>
>>> list(range(5))
[0, 1, 2, 3, 4]
>>> list(range(-5, 1))
[-5, -4, -3, -2, -1, 0]

The range() function by itself doesn't do much more than define an expression for generating a sequence. If you call type() on a range expression, you get back the class range. You can either use it as a generator expression, or cast it to a list to get the full sequence all at once. You can start a sequence at a negative value, but you have to include explicit start and end values. (If you run range(-5) it will try to generate a sequence of increasing integers from 0 to -5, which has no values.)

What's important to note here is that we have a sequence of one value that evaluates to False, and then a bunch values that evaluate to True:

>>> [bool(x) for x in range(5)]
[False, True, True, True, True]

If we want the False value to appear at the end of the sequence, we can start the range with a negative value:

>>> [bool(x) for x in range(-5, 1)]
[True, True, True, True, True, False]

These two sequences won't have much performance difference with any(), because it will find a True value at the first or second position, and not look at the rest of the list. However, throwing a not in there creates a list where the only True value appears at the end of the list:

>>> [not bool(x) for x in range(-5, 1)]
[False, False, False, False, False, True]

Now we can create a variety of expressions and lists that we can feed into any(), and explore how it performs under different conditions.

Using any() with an expression

One of the first posts I ever wrote on Mostly Python was a brief intro to profiling in Python, because it's such an important topic. Let's use timeit to evaluate the sequences we've been looking at.

Let's call any() with an expression where the first item evaluates to False, and the rest of the items evaluate to True:

$ python -m timeit "any(x for x in range(5))"
1000000 loops, best of 5: 371 nsec per loop

The any() call here was executed 1,000,000 times, and the average of the fastest 5 runs was 371 nanoseconds.

Using any() with a list of values

Now let's use the same range of values, but we'll cast it to a list in the call to any():

$ python -m timeit "any(list(x for x in range(5)))"
1000000 loops, best of 5: 371 nsec per loop

This also ran a million times, and had the same exact performance! Maybe any() sees the list() call and drops it because it's not needed? Maybe it's so little work to create a short list that there's no difference between evaluating a generator expression and evaluating a short list? Let's run some more tests.

Expressions and lists

Let's take a step back and compare the time it takes to create a generator expression, vs the time it takes to create a list from a generator expression:

$ python -m timeit "(x for x in range(5))"
2000000 loops, best of 5: 148 nsec per loop
$ python -m timeit "list(x for x in range(5))"
1000000 loops, best of 5: 338 nsec per loop

These results make sense to me. It takes 148 nanoseconds to define a generator expression. It takes a little more than twice as long to generate a list from that same generator expression.

This doesn't clarify anything for me about why the two any() calls took exactly the same amount of time. It seems like, even with a short sequence, calling any() on an expression should run faster than calling it on a fully-evaluated list.

Different lengths of sequences

Let's see if the performance diverges with a longer sequence. Let's first look at the time to create the sequence itself:

$ python -m timeit "(x for x in range(1_000))" 
2000000 loops, best of 5: 156 nsec per loop
$ python -m timeit "list(x for x in range(1_000))"
10000 loops, best of 5: 28.3 usec per loop

This makes sense. It takes a tiny bit longer to define an expression for a longer sequence, but it's still on the order of 150 nanoseconds. An expression is really just a rule for defining a sequence, and it doesn't take much time to define the rule.

Generating the actual list of elements, however, takes a lot longer. If you're not too familiar with fractions of a second, 28.3 microseconds is 28,300 nanoseconds. It takes about 200 times as long to build a list of 1,000 items as it does to define an expression for that same sequence.

Now let's see how that carries over to any():

$ python -m timeit "any(x for x in range(1_000))"  
1000000 loops, best of 5: 378 nsec per loop

This is what I expected to see. It takes 378 nanoseconds for an expression with 1,000 items, compared to 371 nanoseconds for the expression with 5 items. Calling any() with an expression for a sequence where one of the earliest items evaluates to True takes little time, because most of the sequence is not evaluated. The call to any() returns True when it hits that first True value in the expression, and it doesn't examine any of the remaining items.

Here's what it looks like for a list of 1,000 items:

$ python -m timeit "any(list(x for x in range(1_000)))"
10000 loops, best of 5: 28.2 usec per loop

This makes sense as well. Even though the second item in the list that's being created evaluates to True, and any() returns True as soon as it finds that item, we're telling Python to build the entire list and then feed it to any(). As a result of all that work, it takes 28.2 microseconds (28,200 nanoseconds) to run the overall line of code.

Calling any() when only the last item is True

Expressions start to win for performance when a True value appears early in the sequence. Let's run some of the same tests with a sequence where only the last item evaluates to True:

$ python -m timeit "any(not x for x in range(-5, 1))"      
500000 loops, best of 5: 483 nsec per loop
$ python -m timeit "any(list(not x for x in range(-5, 1)))"
500000 loops, best of 5: 423 nsec per loop

This is quite surprising! When most or all of a short sequence needs to be examined, it's faster to build a list and pass it to any(), than it is to give it the equivalent expression. There's also a pretty significant difference between these two times. 60 nanoseconds isn't much in absolute time, but it's a 14% difference relatively.

It's worth pointing out that both these runs take longer than the equivalent runs where the second item in the sequence was True. That makes sense intuitively, because any() is doing more work when the only True value comes at the end of the list.

Now let's look at a longer expression, and a longer list:

$ python -m timeit "any(not x for x in range(-1_000, 1))"
10000 loops, best of 5: 30.6 usec per loop
$ python -m timeit "any(list(not x for x in range(-1_000, 1)))"
10000 loops, best of 5: 34.1 usec per loop

Now there's a reversal. With 1,000 items in the sequence, the expression-based approach is a little faster. Notice that both runs take on the order of 30 microseconds, though. The call to any() involves looking at every item in the sequence, all the way up to the last item. This seems to be a comparison between retrieving items from a generator until it's exhausted, vs building a list and then accessing every item in the list one at a time.

There's some interesting optimization happening somewhere here, though. I was expecting to see a roughly 2x performance hit when using a full list. In that approach, Python needs to iterate over the entire range to build the list, and then iterate over the entire list when running any(). When calling any() with just the expression, there's only one iteration over the range. I think the performance hit is much lower than 2x because the low-level code for building lists from generator expressions is highly optimized, and most of the time here is spent accessing items from the list after it's built.

This difference doesn't change much with an even longer list:

$ python -m timeit "any(not x for x in range(-1_000_000, 1))"
10 loops, best of 5: 30.2 msec per loop
$ python -m timeit "any(list(not x for x in range(-1_000_000, 1)))"
10 loops, best of 5: 33.9 msec per loop

If you have a million items and only the last one is True, calling any() takes about 30 milliseconds regardless of which approach you use.

Conclusions

For most real-world projects, I'm a firm believer that a good approach is to build something that works, and then consider performance. Performant code that doesn't do what it's supposed to creates way more problems than slow, correct code. Many times, initial code that works is fast enough. Most of us would be perfectly happy with code that runs in 300 nanoseconds, and not find any real-world benefit from modifying that code to run in 30 nanoseconds, even though there's technically a 10x performance improvement.

When you do get to the point of focusing on performance, intuition and understanding of what's happening is important. But Python has gone through decades of performance optimizations, and the underlying hardware and software has gone through similar changes. Intuition is important, but it has to be tied to measurement. There are plenty of surprising and counterintuitive results when people actually measure performance, as exemplified by some of the results in this post.

All that said, you shouldn't write slow code on purpose. If you know a function like any() can accept an expression and you don't need the full list for anything else, pass it an expression instead of a list. If you have a large list and need to call a function like any() and it becomes a bottleneck, consider whether you can write an expression that's equivalent to the list and feed that to any() instead. But regardless of the approach you use, run some performance tests and prove to yourself (and others) that your approach is actually more efficient in your use case.