Python Lists: A closer look, part 7

MP #14: When to use sets

Note: The previous post in this series discussed NumPy arrays. The next post focuses on what happens when you pass a list as an argument to a function.

Python’s sets don’t get as much attention as lists or arrays. But when they fit your use case they make certain kinds of operations simple, elegant, and efficient.

When I first learned about Python I didn’t come across sets right away, so I implemented some set functionality from scratch using lists for a while. When I finally learned about sets, some of my code became much simpler and more readable.

Developer surveys

Each year Stack Overflow conducts a survey for programmers, and they make a sanitized version of their dataset public. For this example I’ll focus on the responses to one question:

Which programming, scripting, and markup languages have you done extensive development work in over the past year?

I’ve saved all of the responses to this question from the 2022 survey in a JSON file. Here’s the first part of that file:

# responses.json

["NA", "JavaScript", "TypeScript", ..., "SQL", "TypeScript"]

Let’s load this into Python, and see how many responses there were:

# explore_responses.py

from pathlib import Path
import json

path = Path('responses.json')
contents = path.read_text()
responses = json.loads(contents)

num_responses = len(responses)
print(f"Found {num_responses:,} responses.")

We read the contents of the file and create a list called responses. Here’s the output:

Found 370,114 responses.

About 70,000 people took part in the survey. But people can select more than one language, so there were almost 400,000 responses to this question! When I see data like this, I almost always want to know how many unique responses were given.

A naive solution to finding unique items

Before I learned about sets, here’s how I would have approached this problem:

# explore_responses_naive.py

...
# Find the unique responses.
unique_responses = []
for response in responses:
    if response not in unique_responses:
        unique_responses.append(response)

num_unique = len(unique_responses)
print(f"Found {num_unique} unique responses.")
print(unique_responses)

My approach was always to make a new empty list called something like unique_responses, and then loop over the original list. For every item in the list, add it to unique_responses only if it’s not already in that list.

This is relatively straightforward code, and it works:

Found 370,114 responses.
Found 43 unique responses.
['NA', 'JavaScript', 'TypeScript', ..., 'Solidity', 'COBOL']

What are sets?

A set is a collection where every item must be unique. If you build a set from an existing collection, the set will contain every item in the original collection, but each item will only appear once:

>>> languages = ['C', 'C++', 'C', 'C#', 'C++', 'C#', 'C']
>>> set(languages)
set(['C#', 'C', 'C++'])

This is simple, clean, and efficient.

A better solution to the survey problem

This makes the survey analysis code much nicer:

# explore_responses_set.py

...
# Find the unique responses.
unique_responses = set(responses)

num_unique = len(unique_responses)
print(f"Found {num_unique} unique responses.")
print(unique_responses)

The same work is done in a single line of code. The results are identical, except we have a set instead of a list:

Found 370,114 responses.
Found 43 unique responses.
{'Ruby', 'COBOL', 'Swift', ..., 'Scala', 'Bash/Shell'}

Sets are unordered so we see different items in this truncated listing, but all of the same items appear in the output. If you need to keep working with a list, you can change the line shown above to:

unique_responses = list(set(responses))

More about sets

If you haven’t worked with sets before, there are a few things you should know:

Defining sets

Sets are indicated by curly braces:1

>>> languages = {'Python', 'Rust', 'C'}
>>> languages
set(['Python', 'C', 'Rust'])

Sets are unordered collections, not sequences. Because they’re unordered, there is no concept of indexing with sets. You can loop over a set using a standard for loop, but you can’t create slices.

Useful operations

Sets were designed for operations such as union, intersection, difference, and others that are typically carried out on mathematical sets. This enables a range of useful operations on datasets.

For example, sets provide an elegant way of answering a number of questions you might ask about two collections. Let’s start with two sets, the languages I’m familiar with and the languages you’re familiar with:

>>> my_languages = {'Python', 'Java', 'C'}
>>> your_languages = {'Python', 'Rust', 'C'}

What languages are we both familiar with?

>>> my_languages & your_languages
set(['Python', 'C'])

When placed between sets, & is the intersection operator. It finds all elements that occur in both sets.

If we combine forces and work as a team, what’s the full range of languages that we can work with?

>>> my_languages | your_languages
set(['C', 'Java', 'Python', 'Rust'])

When placed between sets, | is the union operator. It combines all elements from both sets, removing any duplicates in the process.

What languages are you familiar with, that I’m not?

>>> your_languages - my_languages
set(['Rust'])

The difference operator (-) removes any elements that are in the second set from the first set.

Do you know Java?

>>> 'Java' in your_languages
False
Modifying sets

You can add items to sets:

>>> my_languages.add('Rust')
>>> my_languages
set(['Python', 'C', 'Java', 'Rust'])

And you can remove items from sets:

>>> my_languages.remove('Java')
>>> my_languages
set(['Python', 'C', 'Rust'])

For a full listing of what you can do with sets, see the official documentation.

Conclusions

Sets don’t usually get as much coverage in intro Python courses as other data structures such as lists and dictionaries. If you don’t need them right away in your own projects, it can be easy to forget they exist.

Like arrays, sets are quite powerful when they fit your use case. If you need to find unique items in a collection, or compare elements across multiple collections, take a moment to consider whether sets will help you write simpler, more efficient code.

Resources

You can find the code files from this post in the mostly_python GitHub repository.


  1. At first glance sets look like dictionaries. But Python can tell them apart because a set is a collection of individual items separated by commas. A dictionary is a sequence of paired items joined by colons, and each pair is separated by a comma.