Abstract Nonsense

Better Python path handling with Pathlib and Git

One of the most frustrating problems in ad-hoc data science projects is broken file paths.

You write a script that loads a model, grabs data, or instantiates a config from your local disk. It works perfectly on your machine, then someone else runs it and… catastrophe: FileNotFoundError: No such file or directory. Uh oh, looks like someone just got bit by hard-coded paths or assumptions about where the script is being run from.

Since I’m normally working inside a git repo, I’ve started writing all my file paths as relative to the repo root, and dynamically finding that path with GitPython. For example, supposing I have the following project structure:

my-project/
├── README.md
├── data
│   └── iris.csv
└── src
    ├── model.py
    └── preprocess.py

and I want to grab data/iris.csv from within preprocess.py (or perhaps, much more heavily nested sub-folders).

Instead of using a relative path (../data/iris.csv) or an absolute path (/Users/yossi/my-project/data/iris.csv), I can do the following:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "gitpython",
# ]
# ///

from pathlib import Path
import git

repo_root = Path(
    git.Repo(search_parent_directories=True).git.rev_parse("--show-toplevel")
)

print((repo_root / "data" / "iris.csv").read_text())

Side note: I really love pathlib’s overloading of the / operator to allow syntactic sugar for path augmentation. I’m also a big fan of Path.read_text instead of the standard with open(file_path) as f context handler (in most cases).

Also, if you haven’t seen it before, I’m using Python inline script metadata here to specify the dependencies. You can run this script with uv run --script src/preprocess.py and uv will take care of resolving and installing the dependencies within a venv.

I’m anticipating that this will probably trigger a lot of people:

  1. But now you have an extra dependency on the GitPython package!
  2. What happens if git isn’t installed on the system? (Looking at you, Windows).
  3. What happens if there’s git submodules?

And to all this, I say: “yes, that’s true”. I don’t believe that this is idiomatic or good for production or anything close to it. But for adhoc collaborative projects, littered with *.ipynb notebooks (this, even I cannot stand), it restores some level of sanity amidst the throes of data science passion.

As to the Python module import system? That’s a rant for another day… I’m still scarred by seeing too many bastardised sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))).

Please let me know what your preferred path-wrangling design pattern is. I’m just trying to find my path here after all…