Better Python path handling with Pathlib and Git
One of the most frustrating problems in ad-hoc data science projects is broken file paths.
You write a script that loads a model, grabs data, or instantiates a config from your local disk. It works perfectly on your machine, then someone else runs it and… catastrophe: FileNotFoundError: No such file or directory
. Uh oh, looks like someone just got bit by hard-coded paths or assumptions about where the script is being run from.
Since I’m normally working inside a git
repo, I’ve started writing all my file paths as relative to the repo root, and dynamically finding that path with GitPython. For example, supposing I have the following project structure:
my-project/
├── README.md
├── data
│ └── iris.csv
└── src
├── model.py
└── preprocess.py
and I want to grab data/iris.csv
from within preprocess.py
(or perhaps, much more heavily nested sub-folders).
Instead of using a relative path (../data/iris.csv
) or an absolute path (/Users/yossi/my-project/data/iris.csv
), I can do the following:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "gitpython",
# ]
# ///
from pathlib import Path
import git
repo_root = Path(
git.Repo(search_parent_directories=True).git.rev_parse("--show-toplevel")
)
print((repo_root / "data" / "iris.csv").read_text())
Side note: I really love pathlib’s overloading of the
/
operator to allow syntactic sugar for path augmentation. I’m also a big fan ofPath.read_text
instead of the standardwith open(file_path) as f
context handler (in most cases).
Also, if you haven’t seen it before, I’m using Python inline script metadata here to specify the dependencies. You can run this script with uv run --script src/preprocess.py
and uv
will take care of resolving and installing the dependencies within a venv
.
I’m anticipating that this will probably trigger a lot of people:
- But now you have an extra dependency on the
GitPython
package! - What happens if
git
isn’t installed on the system? (Looking at you, Windows). - What happens if there’s
git
submodules?
And to all this, I say: “yes, that’s true”. I don’t believe that this is idiomatic or good for production or anything close to it. But for adhoc collaborative projects, littered with *.ipynb
notebooks (this, even I cannot stand), it restores some level of sanity amidst the throes of data science passion.
As to the Python module import system? That’s a rant for another day…
I’m still scarred by seeing too many bastardised sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
.
Please let me know what your preferred path-wrangling design pattern is. I’m just trying to find my path here after all…