Cleaning up a long series of commits

MP 127: How do you make sure you've cleaned up all your diagnostic code?

I've been doing a bunch of refactoring work on a moderately sized project recently. I'm always looking for ways to work more efficiently, and use tools like Git better. It's a really satisfying aspect of focusing on just one project for a while.

When I'm doing nontrivial refactoring work, I often comment out existing code instead of removing it. That way I can see what was working, while implementing a simpler or more efficient approach. I always aim to remove that commented code before merging the refactored code back to the main branch.

Something I've often struggled with is making sure I've removed all the commented code before merging my changes back to the main branch. Working alone, I usually don't have anyone else to do a code review, so I need a practice that reliably identifies commented code. It turns out a single git diff call can do this quite effectively.

My refactoring workflow

I follow a workflow that's probably similar to what many people use when doing refactoring:

  • Make sure my main branch is fully up to date with remote.
  • Make a new branch, named for the specific refactoring work I'm focusing on.
  • Make as many commits as I want while refactoring, including broken states and states with lots of diagnostic code that shouldn't make it into production.
  • Comment out the lines of code that I'm refactoring, and write new code just above or below the commented code.
  • Clean up all commented and diagnostic code when the refactoring work is done.
  • Make a pull request, and merge using squash and merge.

This workflow lets me make fine-grained commits during the development and refactoring stages, while ending up with a clean history in the project.

Example: Simplifying a plugin interface

The project I'm working on, as many of you know, focuses on automating Django deployments. The project is made up of a core that inspects the user's project, and a set of plugins that handle configuration for different hosting platforms. Getting that core-plugin interface correct is critical to the success of this project.

The core-plugin interface is defined in a file called hookspecs.py:

import pluggy

hookspec = pluggy.HookspecMarker("simple_deploy")

@hookspec
def automate_all_supported():
    """Specify whether --automate-all is supported on the specified platform."""

@hookspec
def simple_deploy_get_automate_all_msg():
    """Get a platform-specific message for confirming the --automate-all option."""

@hookspec
def simple_deploy_get_platform_name():
    """Get the name of the platform that's being deployed to."""

@hookspec
def simple_deploy_deploy(sd_config):
    """Carry out all platform-specific configuration and deployment work."""
hookspecs.py

If you're unfamiliar with how plugins are structured, these @hookspec blocks specify the functions that plugins can implement. The highlighted functions all pass single pieces of information from the plugin back to core: whether the plugin supports fully automated deployments, a message about how the plugin handles automated deployments, and the name of the hosting platform.

I wanted to refactor this, and have the plugins return a single config object instead of a growing list of individual resources. Here's what hookspecs.py looked like in the middle of my refactoring work:

import pluggy

hookspec = pluggy.HookspecMarker("simple_deploy")

@hookspec
def simple_deploy_get_plugin_config():
    """Get plugin-specific attributes required by core."""

# @hookspec
# def automate_all_supported():
#     """Specify whether --automate-all is supported on the specified platform."""

# @hookspec
# def simple_deploy_get_automate_all_msg():
#     """Get a platform-specific message for confirming the --automate-all option."""

# @hookspec
# def simple_deploy_get_platform_name():
#     """Get the name of the platform that's being deployed to."""

@hookspec
def simple_deploy_deploy(sd_config):
    """Carry out all platform-specific configuration and deployment work."""
hookspecs.py

Now hookspecs.py only requires one function for returning resources to core, simple_deploy_get_plugin_config(). Any additional information that needs to be shared can be added to the config object this function returns, instead of continuing to change the core-plugin interface. That makes breaking changes between core and plugins much less likely. (I'm not sure all these functions need to be prefixed with simple_deploy_, so another refactor may simplify these function names.)

I kept the original hookspecs in this file, commented out, until the new function was working. It was helpful to see what these function names were when I was looking for the places they were used in the rest of the codebase.

Cleaning up commented code

This was a somewhat complicated refactor, because it affected several files in core and several files in each of the three plugins that currently exist. When it came time to clean everything up before merging this refactoring work back to main, it was easy to clean up hookspecs.py. The entire file fits on a single screen, and it's where all this refactoring work started.

However, I wanted to make sure to clean up all the other commented code as well. Some of the files in this project are over 500 lines long, and some of the refactoring involved changing single lines of code. Those single lines are really helpful to keep around (commented out) during refactoring, but they're the kind of clutter that accumulates over time if you don't clean them up right away. And when they linger, it's harder to confidently remove them later because you don't remember if they were intended to be uncommented at some point, or removed entirely.

There's a simple way to look for this kind of cruft before merging the refactor branch back to main. The command git diff by default compares your current code to the most recent commit. However, you can diff any two states of your project. So here's the command I ran:

$ git status
git status
On branch plugin_config
Your branch is up to date with 'origin/plugin_config'.

nothing to commit, working tree clean
$ git diff main
diff --git a/simple_deploy/hookspecs.py b/simple_deploy/hookspecs.py
index cec488d..2ab1185 100644
--- a/simple_deploy/hookspecs.py
+++ b/simple_deploy/hookspecs.py
@@ -3,20 +3,23 @@ import pluggy
 hookspec = pluggy.HookspecMarker("simple_deploy")
 
 
-@hookspec
-def automate_all_supported():
-    """Specify whether --automate-all is supported on the specified platform."""
+# @hookspec
+# def automate_all_supported():
+#     """Specify whether --automate-all is supported on the specified platform."""
...

Running git status shows we're on the plugin_config branch, and that all changes have been pushed to the remote repository. Running git diff main compares the head of the entire refactoring branch against the current state of main. Instead of comparing individual refactoring steps, we get an overview of all changes that will be merged back to main. In the first bit of output, we can see that the commented out function automate_all_supported() is still there.

That's interesting, but not super helpful because I wasn't likely to miss that code. It would also be easy to clean up that code if I found it in the project later on.

Sections like this are much more helpful:

An excerpt from running git diff main from a refactoring branch. Around line 508 in deploy.py, there's some commented-out code that should be removed before merging back to main.

There's a file called deploy.py that's over 500 lines long. In that file, there's a section that performs validation on the plugin the user has installed. One validation step makes sure that all required hooks are defined by the plugin. This validation step is simpler now that only a single config object is being returned.

Three lines are still commented out from the old validation work. This is exactly the kind of block that's easy to miss if you don't have an automated way to find commented code. With my terminal configuration I can scroll through the output of git diff main, looking for any green blocks starting with #, followed by a line of code. Since I just did this refactoring work, I recognize each of these sections and can usually tell that it's safe to remove those lines.

If you use this workflow you can make a commit, followed by another diff call:

$ git commit -am "Remove unused code."
$ git diff main

These commits won't persist if you're using squash and merge, and it means each diff call is cleaner than the last. You should be left with only changes that you actually want merged back to main.

Conclusions

Some people don't like seeing commented code, even in refactoring work. But many of us find it incredibly useful. If you like to keep commented code around during your development work, make sure you have a clear way to get rid of it before making your final commits and merges. Keep in mind that git diff is a really flexible command, which can be used to examine changes across a wide variety of states in your project.