OOP in Python, part 15: Class structure in pathlib

MP 59: How file paths are modeled in pathlib.

Note: This post is part of a series about OOP in Python. The previous post discussed how classes are used to implement exceptions in Python. The next post looks at the class structure in the Matplotlib library.


In the last post we looked at the Exception class hierarchy, much of which is implemented in C. In this post we’ll look at a much newer library, pathlib, which is implemented almost entirely in Python.

The need for pathlib

In the old days of Python, people used strings to represent file paths. This was problematic for a number of reasons. One of the most significant issues arose when dealing with different operating systems.

As a brief example, consider a program that shows a file’s location before doing any other work:

path = 'static/note_images/a1.png'
print(f"File location: {path}")

This is taken from a project I’ve been working on recently that helps people learn the grand staff when playing piano. Here’s the image file:

The second-lowest A note on a piano, commonly referred to as A1.

This simple program works on my macOS system:

File location: static/note_images/a1.png

But file paths on Windows use backslashes, so that path is different on Windows:

path = 'static\note_images\a1.png'
print(f"File location: {path}")

This path looks like it should be correct on Windows, but here’s the output:

File location: static
ote_images1.png

This program falls apart because the sequence \n in the string path is interpreted as a newline.

To fix this, you’d see paths written like this on Windows:

path = 'static\\note_images\\a1.png'
print(f"File location: {path}")

This works because the first forward slash escapes the next forward slash:

File location: static\note_images\a1.png

However, this is a pretty inelegant and inefficient way of handling something as important as file and directory paths.

Beyond just making paths look consistent across OSes, there are a number of things we’d like to do with fully-featured path objects that we can’t do with strings:

  • Get parts of a path: root, parent, filename, file extension
  • Check if a file or directory exists.
  • Find out if a path represents a file or a directory.
  • Read from a file.
  • Write to a file.
  • Many more common file and directory operations.

Python has a lot of resources in the os module for working with files and directories. But the move to creating a library with support for paths as dedicated objects has made it much easier and more intuitive to work with paths in Python, in a way that facilitates cross-platform functionality.

A simple example with pathlib

Let’s see what the previous example looks like if we use pathlib instead of strings:

from pathlib import Path

path = Path('static/note_images/a1.png')
print(f"File location: {path}")

We import the Path class from the pathlib module. We then make a Path object, using forward slashes. This file generates the same output as the previous example on macOS:

File location: static/note_images/a1.png

That’s good, but here’s the important part. On Windows the same program file, including the forward slashes, generates this output:

File location: static\note_images\a1.png

The file path is no longer a string that knows nothing about file paths and operating systems. Instead it’s a full Path object, which has lots of functionality built in that makes it aware of common file operations, including how they should be represented on each OS. Here the variable path is being formatted appropriately for whichever operating system the program is running on.

The pathlib module does a lot of work behind the scenes to check what the current operating system is, and how things should be handled given that information. Let’s see how OOP principles were used to implement this kind of functionality.

Path objects

Since we just made a Path object, let’s first see how that class is implemented. Here’s the definition of the class, along with part of its docstring:

class Path(PurePath):
    """PurePath subclass that can make system calls.

    Depending on your system, instantiating a Path will return
    either a PosixPath or a WindowsPath object...
    """

This is interesting already! The class Path inherits from PurePath, so we’ll take a look at that class in a bit. But also, calling Path() doesn’t return an instance of Path. Instead it either returns an instance of PosixPath, or an instance of WindowsPath. We’ll have to look at those classes as well.

A real-world use of __new__()

Let’s first see how calling Path() returns an instance of a different class. We saw in an earlier post that __new__() is the method responsible for creating new instances of a class, so let’s look at that method:

class Path(PurePath):
    ...

    def __new__(cls, *args, **kwargs):
        if cls is Path:
            cls = WindowsPath if os.name == 'nt' else PosixPath
        self = cls._from_parts(args)
        if not self._flavour.is_supported:
            raise NotImplementedError(...)
            
        return self

The __new__() method gets a reference to the type of class that’s being created, which is passed to the cls argument. The code shown here checks the value of os.name. If this value is 'nt' then we’re on Windows, and it changes the cls type to WindowsPath. If the value is anything else, it changes the cls type to PosixPath. This is the type that’s appropriate for macOS and most Linux systems.

Path methods: path.cwd()

Many of the methods defined in Path are wrappers around calls to functions in the os module, or calls to other methods within Path itself. The overall effect is to make it easier for end users to do what they need to with path objects.

For example, here’s the implementation of the path.cwd() method:

class Path(PurePath):
    ...

    @classmethod
    def cwd(cls):
        """Return a new path pointing to the current working directory
        (as returned by os.getcwd()).
        """
        return cls(os.getcwd())

This is a one-line method, but it does a lot to simplify things for end users. It’s a thin wrapper around the os.getcwd() function. That might not seem particularly beneficial, but take a look at the output of path.cwd() compared to os.getcwd():

>>> path.cwd()
PosixPath('/Users/eric/.../mp59_oop15')
>>> os.getcwd()
'/Users/eric/.../mp59_oop15'

The main difference here is what gets returned. With path.cwd(), you get back a Path object that’s aware of how your system works. With os.getcwd(), you’re stuck with a string.

This helps explain the last line of the path.cwd() method:

return cls(os.getcwd())

The call to os.getcwd() returns a string representation of a path. If you remember that cls is a reference to the current class type, you can start to see that wrapping cls() around that return value gives us back a new Path object. Except it won’t be a Path object; it will be a WindowsPath object on Windows, and a PosixPath object on macOS and Linux.

This is pretty interesting to think about. path.cwd() is a method that, for its return value, creates an instance of its own class. It’s a little Inception-like, for those who’ve seen that movie.

The difference is even more noticeable on Windows:

>>> path.cwd()
WindowsPath('C:/Users/eric/.../mp59_oop15')
>>> os.getcwd()
'C:\\Users\\eric\\...\\mp59_oop15'

The Path object representing the current working directory on Windows is identical to what we get on other systems, except for the overall class type. All the custom information and behavior needed for that OS is contained in the class. The representation of the path, in Python code, is consistent across all systems.

The return value from os.getcwd() is an ugly double-backslashed string. More important though, is what you can do with the return value. When using the methods from pathlib, you can continue to work with what’s returned, because it’s another path object.

Path methods: path.exists()

Let’s look at how one of the most useful methods in the Path class is implemented. The path.exists() method tells you whether a file or directory exists, so you can verify it exists before taking any other actions. It saves you from having to use try-except blocks to handle the possibility of missing files or directories.

Here’s the method:

class Path(PurePath):
    ...

    def exists(self):
        """
        Whether this path exists.
        """
        try:
            self.stat()
        except OSError as e:
            if not _ignore_error(e):
                raise
            return False
        except ValueError:
            # Non-encodable path
            return False
        return True

This is another thin wrapper, around another method in the same class. The path.stat() method makes a call to os.stat(), which is a wrapper for a system-level stat call. Many people don’t know about stat, and don’t necessarily need to if all they really want to know is whether a path exists or not. The path.exists() method makes the stat() call, and interprets the results in a useful way.

All of this results in a more intuitive API for working with paths:

>>> path
PosixPath('static/note_images/a1.png')
>>> path.exists()
True

Let’s look at the OS-specific path classes, and then come back to the PurePath class.

The PosixPath and WindowsPath classes

Here’s the entire implementation of PosixPath, which is what you get when you call Path() on most non-Windows systems:

class PosixPath(Path, PurePosixPath):
    """Path subclass for non-Windows systems.

    On a POSIX system, instantiating a Path should return this object.
    """
    __slots__ = ()

This is a small class that combines the behavior of the Path and PurePosixPath classes. If you haven’t seen __slots__ before, it’s a way of restricting the set of attributes that can be defined for an instance of a class. The empty tuple here means you can’t add any new attributes to an instance of PosixPath.

WindowsPath has an almost identical structure:

class WindowsPath(Path, PureWindowsPath):
    """Path subclass for Windows systems.

    On a Windows system, instantiating a Path should return this object.
    """
    __slots__ = ()

    def is_mount(self):
        raise NotImplementedError("Path.is_mount() is unsupported...")

The only difference here is the is_mount() method, which overrides a parent class’ is_mount() method. This makes sure anyone who calls path.is_mount() on Windows gets an appropriate message that the method isn’t available on Windows.

Now let’s move back up the hierarchy, and see what the “pure” path classes look like.

The PurePosixPath and PureWindowsPath classes

Here are two of the classes that PosixPath and WindowsPath inherit from:

class PurePosixPath(PurePath):
    """PurePath subclass for non-Windows systems..."""
    _flavour = _posix_flavour
    __slots__ = ()

class PureWindowsPath(PurePath):
    """PurePath subclass for Windows systems..."""
    _flavour = _windows_flavour
    __slots__ = ()

These are each thin classes in the hierarchy. They each define a _flavour attribute, either _posix_flavour or _windows_flavour. These are the pieces that help determine things like whether a forward slash or a backslash should be used when formatting paths.

What is _flavour?!

The underscore in _flavour tells us it’s not meant to be used outside the class. But we’re trying to understand the inner workings of the module, so let’s figure out where it’s defined.

Two lines in the middle of pathlib.py stand out because they aren’t part of any class:

_windows_flavour = _WindowsFlavour()
_posix_flavour = _PosixFlavour()

These two lines define one instance of the class _WindowsFlavour, and one instance of the class _PosixFlavour.

Here are the first parts of those two classes:

class _WindowsFlavour(_Flavour):
    sep = '\\'
    altsep = '/'
    has_drv = True
    pathmod = ntpath

    is_supported = (os.name == 'nt')
    ...

class _PosixFlavour(_Flavour):
    sep = '/'
    altsep = ''
    has_drv = False
    pathmod = posixpath

    is_supported = (os.name != 'nt')
    ...

Here you can start to see how paths are handled differently on each OS. The attribute sep is short for separator, which is the file separator Python needs to use on each OS. On Windows, that’s the double backslash, \\. On non-Windows systems, that’s a single forward slash, /. You can also see how the is_supported attribute is set, based on the value of os.name.

I won’t include it here directly, but if you look at the source for WindowsPath there are a number of longer comments that document what people have learned about handling paths on Windows systems over the course of developing and maintaining the pathlib module. One longer comment begins: Interesting findings about extended paths. The people who develop and maintain these libraries don’t start out knowing everything about each operating system. They’ve just carefully defined what they want the library to be able to do, researched how to make that happen, and documented their findings so that others don’t have to repeat all that work.

The _Flavour class

Both _WindowsFlavour and _PosixFlavour inherit from _Flavour. That base class provides some functionality for dealing with classes on specific operating systems. I’ll show one small part of that class:

class _Flavour(object):
    """A flavour implements a particular (platform-specific)
    set of path semantics.
    """

    def __init__(self):
        self.join = self.sep.join

    ...

If you haven’t come across it yet, join() is a built-in string method. It lets you do things like this:

>>> flavors = ['chocolate', 'vanilla', 'strawberry']
>>> ', '.join(flavors)
'chocolate, vanilla, strawberry'

The join() method lets you specify a separator, in this case a comma followed by a space. It then joins all the items in a sequence into one string, using that separator.

Consider this line of code from _Flavour.__init__():

self.join = self.sep.join

This overrides the built-in join method, and makes it so that calling join() on a path always uses the separator that’s appropriate for the current operating system.

Most of this code is meant for internal use, but we can play around with some of these attributes and methods if we understand how everything fits together.

Let’s explore this in a terminal session, starting on macOS. Here’s the path we’ve been working with:

>>> path = Path('static/note_images/a1.png')

Now let’s see its _flavour attribute:

>>> path._flavour
<pathlib._PosixFlavour object at 0x100a825d0>

It’s an instance of _PosixFlavour. Now let’s see the separator:

>>> path._flavour.sep
'/'

A path object doesn’t have a sep attribute. It has a _flavour attribute, which itself has a sep attribute. To get a path’s separator, you have to work through its _flavour attribute.1

Path objects have an attribute _parts, which consists of each element in the path:

>>> path._parts
['static', 'note_images', 'a1.png']

Putting all this together, we can rebuild a path from its parts by calling the join() method from _flavour:

>>> path._flavour.join(path._parts)
'static/note_images/a1.png'

Notice that the parts were put back together using a forward slash, without us ever specifying what the separator should be. That separator was defined when _Flavour overrode join.

Much of this looks the same on Windows, but key parts are different:

>>> path = Path('static/note_images/a1.png')
>>> path._flavour
<pathlib._WindowsFlavour object at 0x00000195E423C0D0>
>>> path._flavour.sep
'\\'
>>> path._parts
['static', 'note_images', 'a1.png']
>>> path._flavour.join(path._parts)
'static\\note_images\\a1.png'

We start out with the same path object. The value of _flavour is a _WindowsFlavour object, and the separator is a double backslash. The parts of the path are identical, as they should be. Calling join() generates a path using the OS-specific \\ separator.

This is exactly how pathlib works internally. It looks complex from the outside, but it’s a complexity that lets all the OS-agnostic and OS-specific parts work together efficiently, and maintainably. End users have to think very little about OS-specific implementations.

The PurePath class

All this brings us back to the PurePath class. Three classes: Path, PurePosixPath, and PureWindowsPath all inherit from PurePath. Let’s take a look at its implementation.

Here’s the first part of PurePath:

class PurePath(object):
    """Base class for manipulating paths without I/O...
    """
    __slots__ = (
        '_drv', '_root', '_parts',
        '_str', '_hash', '_pparts', '_cached_cparts',
    )

    ...

The PurePath class implements path behaviors that aren’t related to input or output actions. These include actions like getting the parts of a path, getting an OS-specific representation of the class, and more.

The __slots__ attribute shows us the small set of attributes a PurePath object can have. One of these is the _parts attribute we just looked at.

Let’s close this out by looking at some of the methods in PurePath. Here’s the as_posix() method:

class PurePath(object):
    ...

    def as_posix(self):
        """Return the string representation of the path with forward (/)
        slashes."""
        f = self._flavour
        return str(self).replace(f.sep, '/')

Even if you’re on Windows, it’s sometimes useful to represent a path with forward slashes. This method checks the path’s _flavour attribute, builds a string representation of the path, and then replaces the OS-specifc separator with a single forward slash.

Here’s a method that returns the path’s file extension, if there is one:

class PurePath(object):
    ...

    @property
    def suffix(self):
        """
        The final component's last suffix, if any.
        This includes the leading period. For example: '.txt'
        """
        name = self.name
        i = name.rfind('.')
        if 0 < i < len(name) - 1:
            return name[i:]
        else:
            return ''

This method gets the name attribute of the path, which is a string. It then uses rfind() to find the rightmost dot in name. It uses that index to return everything from the final dot to the end of the string. So for a path like static/note_images/a1.png, name would be 'a1.png'. It would find that the dot is the third character in the string, and return everything from that dot to the end of the string:

>>> path = Path('static/note_images/ai.png')
>>> path.suffix
'.png'

There are many other methods in PurePath, all of which are meant to help make common file and directory tasks intuitive and simple in a cross-platform manner.

Conclusions

It can seem hard to know what to take away from some of these detailed looks at complicated implementations. You might be asking yourself, “How would I ever design a hierarchy like this?!”

The big takeaway for me is not to make a habit of building a hierarchy like this for its own sake. Instead, think carefully about what you want to accomplish, and list out all the ways your codebase will be used. What kinds of instances will people want to make? How will they use those instances? The goal is to build a hierarchy that makes it intuitive for end users to do the work they need to do, and to develop a library that can be maintained so it will do that work reliably and correctly for the foreseeable future. This all holds even if you’re the only end user at the moment.

This diagram representing the class hierarchy is shown at the top of the pathlib docs:

The pathlib class hierarchy. PurePosixPath, PureWindowsPath, and Path all inherit from PurePath. PosixPath inherits from PurePosixPath and Path, while WindowsPath inherits from PureWindowsPath and Path. This hierarchy is organized to maintain a clear separation between I/O and non-I/O behavior, and Windows and non-Windows behavior.

I hope this discussion has helped you understand a hierarchy such as this one a little better, and gain some understanding of how pathlib works as well.


  1. The use of _Flavour in pathlib is a great example of composition, which I’ll discuss in more detail before closing out this series.