Python Lists: A closer look, part 9

MP #17: Why modifying a list inside a loop doesn't work

Note: The previous post in this series discussed what to watch out for when passing a list as an argument. The next post explains why to avoid using an empty list as a default argument, and how to remove all occurrences of an element from a list.

The remaining posts in this series examine a few common “gotchas” that can happen when working with lists. If you’ve run into these issues before, hopefully these posts will help you understand the behavior you saw. If you haven’t run into these issues yet, hopefully you’ll recognize them when they do show up in your own work, and you’ll know how to address the issues when you see them.

In this post we’ll examine what happens when you try to modify a list inside a simple for loop. It might seem to work in some situations, but it will almost certainly cause incorrect and unpredictable behavior if you rely on this approach.

Removing unwanted items

One of the most common mistakes people make after learning about for loops is attempting to add or remove items while inside a loop. This doesn’t work, because in order to set up a for loop Python builds an iterator to efficiently process all the items in the list. If you change the number of items in the list while the loop is running, the indexes that Python is using for the list will no longer correspond to the items in the current version of the list.

As an example, let’s continue with the context of valid and invalid requests from the last post. Here’s a list containing a number of requests, and an attempt to remove the bad requests from the list:

# process_requests_gotcha.py

# Make a mix of valid and invalid requests.
requests = [
    'request_0', 'request_1_bad', 'request_2',
    'request_3_bad', 'request_4_bad', 'request_5',
    'request_6', 'request_7_bad', 'request_8',
    'request_9_bad', 'request_10', 'request_11_bad'
]

# Remove invalid requests.
for request in requests:
    if 'bad' in request:
        requests.remove(request)

# Show what's left.
print(requests)

We make a list of requests, and then write a simple for loop over the list. Inside the loop, we call remove() for any item that has the word 'bad' in it.

We expect to see only valid requests remaining, but here’s the output:

['request_0', 'request_2', 'request_4_bad',
 'request_5', 'request_6', 'request_8',
 'request_10']

Why some items weren’t removed

When Python executes a simple for loop, it uses indexes to keep track of which item is coming next. If the list changes during the loop’s iteration, those indexes no longer correspond 1:1 with the items in the list.

Let’s see exactly what’s happening by inserting some diagnostics into the loop:

# process_requests_gotcha_diagnostic.py

# Make a mix of valid and invalid requests.
requests = [
    ...
]

# Remove invalid requests.
for index, request in enumerate(requests):
    print(f"Examining request: {index} - {request}")
    if 'bad' in request:
        requests.remove(request)

# Show what's left.
print(requests)

We use enumerate() to access the index on each pass through the loop. Before checking each request, we print the index and the request that’s currently being examined.

Let’s look at the output one line at a time, to follow what’s going on. Here’s the first line of output:

Examining request: 0 - request_0

The index starts at 0, and the request that’s being examined is request_0, as we’d expect. It’s a valid request, so it stays in the list.

Examining request: 0 - request_0
Examining request: 1 - request_1_bad

The next index is 1, and the request that’s being examined is request_1_bad. This item is removed from the list.

Examining request: 0 - request_0
Examining request: 1 - request_1_bad
Examining request: 2 - request_3_bad

Here’s where things start to break down. The next index is 2; Python is processing the indexes in the correct order. However, the item corresponding to index 2 is no longer request_2, because that item got shifted one space to the left when request_1_bad was removed. Now the item at index 2 is request_3_bad. The item request_2 happens to be valid, but it was never even examined! On this iteration, request_3_bad is removed from the list.

Examining request: 0 - request_0
Examining request: 1 - request_1_bad
Examining request: 2 - request_3_bad
Examining request: 3 - request_5

The next index is 3, but now two items have been removed from the list. The current item at index 3 is actually request_5. This request is valid, but our luck has run out; request_4_bad was never examined, so it was never removed from the list.

This is why some invalid requests remain in the list, but others get removed. This also shows how people get away with making this kind of error sometimes: if all the skipped items happened to be valid, you’d probably assume your loop examined every item. This is a situation where people can end up writing code that seems to work on a small sample dataset, but falls apart on a different or larger dataset.

Here’s the rest of the output:

Examining request: 0 - request_0
Examining request: 1 - request_1_bad
Examining request: 2 - request_3_bad
Examining request: 3 - request_5
Examining request: 4 - request_6
Examining request: 5 - request_7_bad
Examining request: 6 - request_9_bad
Examining request: 7 - request_11_bad
['request_0', 'request_2', 'request_4_bad', ..., 'request_10']

The mismatch between the index and the request being examined grows as the loop continues. When there are no items left to examine, Python ends the for loop.1

The fix: use a comprehension

Usually, the best way to approach this situation is to use a comprehension. Rather than removing the items that we don’t want, we build a list of the items we do want:

# process_requests_good.py

# Make a mix of valid and invalid requests.
requests = [
    ...
]

# Keep only valid requests.
requests = [r for r in requests if 'bad' not in r]

# Show what's left.
print(requests)

This code works, but you should notice more than just the change in syntax; this is a different way of thinking about the problem. Instead of focusing on removing invalid requests, we shift to a focus on what it takes for a request to be considered valid. A comprehension doesn’t remove items from an existing sequence. It builds a new list based on the existing sequence.

Here’s the output:

['request_0', 'request_2', 'request_5', 'request_6',
 'request_8', 'request_10']

If the one-line comprehension doesn’t strike you as readable, consider using the multi-line version that many people prefer:

# Keep only valid requests.
requests = [
    request
    for request in requests
    if 'bad' not in request
    ]

This formatting makes it easier to use more descriptive variable names inside the comprehension, and it separates the different logical parts of the expression as well.

This is what people mean by Pythonic

You may have heard people speak of “Pythonic” code; this is a perfect example of what people mean by that phrase. There are a number of other ways to remove unwanted items from a sequence. For example you could write some code to manage the index yourself, so it doesn’t skip over any items as you’re modifying the list. That would be the appropriate solution in some languages.

However, addressing this situation with a comprehension is the most appropriate, readable, and (almost always) efficient way to do this in Python. When you write code that not only works, but uses Python in idiomatic ways, we say your code is Pythonic.

A quick caveat about large lists

There’s one main drawback to this approach for large lists. A comprehension always generates a new list; it doesn’t modify the original list. As Python processes the comprehension, it keeps the entire original list in memory until the comprehension is complete.

So, for a very large list where you’re retaining a significant percentage of the items, your program will hold onto the entire original list while building a second large list. In this situation, you may want to write a less Pythonic but more efficient solution that removes the items from the original list, without the burden of building a second list.

What if you’re not comfortable with comprehensions yet?

If you’re not yet comfortable using comprehensions, or if the logic for processing each item is more complex, create a new list using a loop. Start by creating an empty list, and set up a loop over the existing list. Build the items you want inside the loop, and append each item to the new list.

Here’s what that would look like for the example in this post:

# Keep only valid requests.
valid_requests = []
for request in requests:
    if 'bad' not in request:
        valid_requests.append(request)

This is longer than the comprehension, but it’s quite readable and it uses the same logic as a comprehension. We’re still building a list of items we want, rather than trying to remove items we don’t want.

Conclusions

Although you might have gotten away with it previously in simple situations, don’t modify the contents of a list using a simple for loop. When you need to modify the contents of a list, consider using a comprehension to build the list you want from the list you have. If a comprehension isn’t working, write a full loop to generate the new list instead.

Resources

You can find the code files from this post in the mostly_python GitHub repository.


Further exploration

1. Using Python Tutor

It can be really helpful to see what’s happening to a list when you modify it inside a loop. Try running the code from this example on Python Tutor.


  1. When executing a simple for loop, Python actually runs a while loop using the built-in next() function. Here’s a representation of how Python implements the for loop shown in this post:

    # process_requests_iter.py
    
    # Make a mix of valid and invalid requests.
    requests = [
        'request_0', 'request_1_bad', 'request_2',
        'request_3_bad', 'request_4_bad', 'request_5',
        'request_6', 'request_7_bad', 'request_8',
        'request_9_bad', 'request_10', 'request_11_bad'
    ]
    
    # Remove invalid requests.
    requests_iter = iter(requests)
    while True:
        try:
            request = next(requests_iter)
        except StopIteration:
            break
        else:
            if 'bad' in request:
                requests.remove(request)
    
    # Show what's left.
    print(requests)

    Python builds an iterator object from the list requests. It then sets up an infinite while loop. On each pass through the loop, it calls next() to get the next request from the iterator. If it gets a request, it runs the code in the else block and checks for a bad request. When the iterator reaches the end of the sequence, it returns a StopIteration exception, and execution breaks out of the while loop.

    An iterator can’t move backwards; next() only ever moves forward through a sequence. It does this by keeping track of which index it should use on the next iteration. So when you remove an item from a sequence that’s being iterated, you end up skipping over some items in the sequence.