Debugging in Python, part 7: Bugs in third-party libraries
MP 146: What happens when the bug actually is in one of your project's dependencies?
Note: This post is part of an ongoing series about debugging in Python. The posts in this series will only be available to paid subscribers for the first 6 weeks. After that they will be available to everyone. Thank you to everyone who supports my ongoing work on Mostly Python.
Most of the time when our code fails to run, the bug is in code that we've written. There are times, however, when the bug actually does come from one of the project's dependencies. Most dependencies are pretty well-tested, and end up not being the cause of the problem. But no code is perfect, and sometimes the dependency really is the source of the problem. In this post we'll see what it looks like when a bug appears in third-party code.
An interesting traceback
Let's use the same strat_players.py project from the last post. I generated a bug in a .py file within the virtual environment, that I knew would affect the execution path. Consider the resulting traceback:
$ python strat_players.py Traceback (most recent call last): File "strat_players.py", line 5, in <module> import player_data File "player_data.py", line 3, in <module> import pandas as pd File ".venv/lib/python3.12/site-packages/pandas/__init__.py", line 49, in <module> from pandas.core.api import ( File ".venv/lib/python3.12/site-packages/pandas/core/api.py", line 9, in <module> from pandas.core.dtypes.dtypes import ( File ".venv/lib/python3.12/site-packages/pandas/core/dtypes/dtypes.py", line 60, in <module> from pandas.core.dtypes.inference import ( File ".venv/lib/python3.12/site-packages/pandas/core/dtypes/inference.py", line 335 """ ^ IndentationError: expected an indented block after function definition on line 334
Just like the last post, which focused on a bug in our own code, this traceback includes references to some code we've written and some third-party code.
Let's take the same approach we always have, of starting at the end of the traceback.
File ".venv/lib/python3.12/site-packages/pandas/core/dtypes/inference.py", line 335 """ ^ IndentationError: expected an indented block after function definition on line 334
The exception that's being raised is coming from a .py file inside the virtual environment. It looks like there's an indentation issue in a file inside the pandas source code.
Scanning up the traceback, we see some more references to files in the virtual environment. Let's go back to the beginning, where we can focus on the execution path from our code's perspective:
File "strat_players.py", line 5, in <module> import player_data
The first line of our code that leads to the error is the call to import the player_data module. There doesn't seem to be anything wrong with this import statement, so let's keep going:
File "player_data.py", line 3, in <module> import pandas as pd
This is interesting. The call to import pandas is part of the traceback. This is unusual, unless we've forgotten to install pandas, or misspelled the library's name, or maybe we forgot to activate the virtual environment where pandas was installed. But those kinds of issues would lead to something like a ModuleNotFoundError.
File ".venv/lib/python3.12/site-packages/pandas/__init__.py", line 49, in <module> from pandas.core.api import (
This section can be a little harder to parse if you haven't looked inside Python packages before. The file __init__.py is often used to manage the internal structure of third-party libraries. Here the top-level __init__.py file is trying to import something from an internal api module.
File ".venv/lib/python3.12/site-packages/pandas/core/api.py", line 9, in <module> from pandas.core.dtypes.dtypes import (
That api.py module then tries to import resources from a module named dtypes. These resources help define and manage pandas data structures. If we wanted to see exactly what's being imported, we could open the dtypes.py file and look at the subsequent lines, which list everything that's being imported.
This brings us back to the last block in the traceback:
File ".venv/lib/python3.12/site-packages/pandas/core/dtypes/inference.py", line 335 """ ^ IndentationError: expected an indented block after function definition on line 334
There's a file called inference.py, which seems to have an indentation issue. Let's go find that file.
Virtual environments aren't magic
When people are newer to Python, or just focused on getting things to work, it's easy to think of virtual environments as some kind of magic that makes Python work when using third-party packages. There's a lot that gets done for us when creating and using virtual environments, but it's helpful to get past that magical feeling and understand that they're just a large set of files and directories that tools like pip and uv manage for us.
You're free to look at any file in a virtual environment. You're also free to modify those files, although you should understand that any changes you make will be undone the next time you reinstall that package, or destroy and rebuild the environment. You shouldn't plan to regularly modify files in a virtual environment. But if you think the bug you're working on might come from one of your project's dependencies, and you want to see if there's a quick fix, there's nothing stopping you from modifying files in your virtual environment. If you mess something up Git won't help you, but you can always reinstall the library and you'll be right back where you started.
Reading virtual environment paths
Consider this path from the last block of the traceback:
.venv/lib/python3.12/site-packages/pandas/core/dtypes/inference.py
Most library source code in a virtual environment is in the lib/python3.xx/site-packages/ directory. You can open any .py file you find there in the editor you normally work with. Let's look at the section around line 334 in inference.py:
def is_hashable(obj) -> TypeGuard[Hashable]: """ Return True if hash(obj) will succeed, False otherwise. Some types will pass a test against collections.abc.Hashable but fail when they are actually hashed with hash(). Distinguish between these and other types by trying the call to hash() and seeing if they raise TypeError. ... """ try: hash(obj) except TypeError: return False else: return True
Hash functions take some input, such as a Python object, and generate a unique ID from that object. Hashes are used for many things, including indexing and checking for changes against a known value. pandas makes extensive use of hashing in managing the data you're working with. In its data processing work, pandas needs to figure out if objects are hashable before they can be used in certain ways. The is_hashable() function here is called before trying to hash objects.
This is exactly the kind of small function that's often executed when we run library code. Here we can see a simple error: the def statement is indented at the same level as the body of the function. The fix is to unindent the function definition:
def is_hashable(obj) -> TypeGuard[Hashable]: """ Return True if hash(obj) will succeed, False otherwise. ...
Now the program runs successfully again:
$ python strat_players.py $
Most bugs in library code that actually make it into a public release aren't this simple, but it can happen. I once caught a fairly simple bug in a project that had several thousand stars on GitHub, where the author and contributors hadn't yet developed a comprehensive test suite.
It's important to trust what a traceback tells you, and know that the Python code that runs when a library is called is just like any other Python code. Most bugs involving libraries tend to be more subtle logical errors, which we'll get to shortly. It's important to look at bugs with more straightforward resolutions before getting into more difficult and subtle bugs.
Beyond the virtual environment
If this were an actual bug in pandas, our fix would disappear the next time we update pandas, or the next time we recreate the virtual environment. So how should you deal with a bug like this? A detailed answer to that question is beyond the scope of this post, but here are a few thoughts to guide you in that kind of situation:
- Check if the bug has already been reported. When you find a bug in a popular library, there's a good chance someone has already reported it. But that's not always the case, especially if you're using the library in a less common way. Either way, it's good to scan recent issues on the project's GitHub page, and see if anyone has already reported it. If they have, and you have new information about the bug, make a comment. You can also watch the bug to see when the fix is likely to make it into a release. If no one has identified a fix, you can share what's worked for you and consider submitting a PR.
- Report the bug. If no one has reported the bug yet, go ahead and open an issue. Share the error that was raised, and the fix that worked for you.
- Install the updated version of the library when the fix is released. If the bug is being addressed, make sure you install the updated version of the library once the fix makes it into a public release.
- Clone the library outside your project, and make an editable install. If the project is not likely to be updated soon, you can maintain a fork of the project on your own. This isn't a great long-term solution, but it can buy you time until the public library has a new release. To do this, you usually clone the library's repository outside your own project. Then you make an editable install, with a command like
$ pip install -e /path/to/fork/of/pandas/. Instead of copying pandas to thesite-packages/directory in your virtual environment, the environment will run the code from the external path. You can then manage your copy of the library with Git, outside of your main project. You can use this setup to push a PR to the library's GitHub repository. In extreme cases, this is how projects are forked when maintainers abandon heavily-used projects, for all kinds of reasons.
Practicing
It can be a bit hard to practice finding and fixing bugs in libraries, because libraries tend to have large codebases. When we run our own projects, we usually use just a narrow slice of the overall library's code. So if you introduce a single bug into a library like pandas, it's quite unlikely that your specific execution path will hit that code.
You can get a sense of exactly which library code is being run by using Python's trace module. Here's what a simple trace call looks like:
$ python -m trace -l strat_players.py functions called: filename: python3.12/__future__.py, modulename: __future__, funcname: <module> filename: python3.12/__future__.py, modulename: __future__, funcname: _Feature filename: python3.12/__future__.py, modulename: __future__, funcname: _Feature.__init__ ...
The -l flag tells trace to list all functions that are called during execution. This is just the first few lines of output; the full output has over 4,000 lines.
You can use a tool like grep to only show lines you're interested in:
$ python -m trace -l strat_players.py | grep "pandas/" filename: pandas/__init__.py, modulename: __init__, funcname: <module> filename: pandas/_config/__init__.py, modulename: __init__, funcname: <module> filename: pandas/_config/__init__.py, modulename: __init__, funcname: using_copy_on_write ...
This filters the output, only showing lines that include pandas/ in the path. This still generates over 1,000 lines, but you can scroll through and see which pandas files are being accessed, and which functions as well.
You can do one more level of filtering, to show only the functions in a specific library file that are being called:
$ python -m trace -l strat_players.py | grep "pandas/" | grep inference.py ... filename: pandas/core/dtypes/inference.py, modulename: inference, funcname: is_dataclass filename: pandas/core/dtypes/inference.py, modulename: inference, funcname: is_dict_like filename: pandas/core/dtypes/inference.py, modulename: inference, funcname: is_hashable filename: pandas/core/dtypes/inference.py, modulename: inference, funcname: is_named_tuple
This shows only the functions in inference.py that are called when running strat_players.py. If you insert a bug into any of the functions listed here, you'll be much more likely to see a traceback the next time you run the program.
I've been developing py-bugger as I write this series. Currently, you can use the --target-file and --target-lines CLI arguments to induce bugs in specific blocks of code. I may add a feature that lets you name the functions where you want a bug to be inserted; that would make py-bugger much more useful for practicing debugging issues in dependencies.
Conclusions
Bugs are inevitable when building most real-world projects. Most of the bugs we run into are a result of our own mistakes. Occasionally, however, we'll see issues that are a result of errors that make it into production in a public third-party library. It's important to focus on our own code, but also keep in mind the possibility that the error comes from a dependency. If that is the case, it's good to know how to deal with it. Remember that all the Python code in a virtual environment is just like any other Python code; you're free to modify it, and see if you can find a fix for the issue you're facing.
In the next post, we'll move away from bugs that lead to exceptions. We'll look at logical errors, which allow your programs to finish running, but cause them to generate incorrect output.