Version-controlling Jupyter notebooks means tracking every meaningful change to the notebook’s code, narrative, and outputs in a reproducible, collaborative, and reviewable way—usually with Git plus specialized diff or text-conversion tools.
Version control is the practice of tracking and managing changes to files over time so that multiple people can collaborate, review history, and restore previous states when needed. In software engineering this normally involves storing source code in a line-oriented text file and using Git to manage commits, branches, and pull requests.
Jupyter notebooks (.ipynb
) complicate this process because they are JSON documents that bundle executable code, rich markdown, cell metadata, and often bulky cell outputs such as images or dataframes. A single character typed in a cell can re-write a large chunk of the underlying JSON, making a standard Git diff almost unreadable and merge conflicts nearly impossible to resolve manually.
Because notebook cells are stored in an array, re-ordering or re-executing cells changes the numeric IDs in the JSON structure, which produces huge diffs unrelated to the actual change.
Two collaborators who execute the notebook in different orders or have different output sizes are likely to generate conflicting JSON fragments. Resolving these by hand is error-prone.
Embedded images, dataframes, and widgets can balloon a commit from KBs to MBs. Repository size and clone times grow rapidly.
.ipynb
Commit the raw notebook. Pros: zero setup, standard Git hosting services. Cons: unreadable diffs, high chance of merge conflicts.
nbdime, ReviewNB, and GitHub’s built-in notebook viewer render visual, cell-level diffs that ignore irrelevant metadata. They integrate with Git hooks or your pull-request UI:
# Install nbdime globally
pip install nbdime
nbdime config-git --enable # sets up git difftool & mergetool
Pros: rich, cell-aware diff/merge; minimal workflow changes. Cons: collaborators must adopt the same tooling.
Jupytext synchronizes a notebook with one or more text representations—Markdown, R Markdown, or a .py
/.R
script containing special # %%
cell markers. Your repo stores the text file; the .ipynb
is regenerated locally on open.
# One-time setup
pip install jupytext
jupytext --set-formats ipynb,py:percent my_notebook.ipynb
Pros: fully readable diff, easy conflict resolution, smaller repo size. Cons: Higher cognitive load (two files), need to educate team.
Notebook code may depend on library versions, environment variables, or data stores. Commit one or more of the following:
requirements.txt
or environment.yml
conda-lock
filejupyter nbconvert --ClearOutputPreprocessor.enabled=True
before commit, or use Git pre-commit hooks like nbstripout.black-nb
or flake8-nb
, and validate execution with pytest --nbval
.The following opinionated workflow balances readability with minimal friction:
git checkout -b feature/customer-churn-model
jupytext
extension installed..py
text representation (jupytext --set-formats ipynb,py:percent
)..py
partner file and a stripped .ipynb
(if needed). The nbstripout
pre-commit hook clears outputs automatically..py
file inline, or a rendered notebook via ReviewNB.Galaxy focuses on writing, sharing, and reviewing SQL rather than Python notebooks. If your analytics workflow relies primarily on SQL—notebooks for data science—you would author and version your queries directly inside Galaxy’s fast editor, leaving notebooks for exploratory or visualization tasks. For Git-based review, you could store exported SQL from Galaxy alongside Jupytext-managed notebooks in the same repository, maintaining a single source of truth for both query logic and ad-hoc analysis.
Why it happens: Analysts forget to clear outputs; Git LFS is not enabled. Fix it by automating nbstripout
or using Jupytext to keep outputs separate.
Out-of-order cell execution creates hidden state. Always Restart & Run All before commit and consider pytest --nbval
in CI to enforce deterministic execution.
Long, multipurpose notebooks cause merge headaches and confuse reviewers. Split ETL, EDA, and modeling into separate logically named notebooks or convert stable logic into Python modules or SQL queries managed in Galaxy.
Effective version control turns notebooks from disposable scratchpads into maintainable, collaborative artifacts. Whether you adopt Jupytext, nbdime, or another strategy, the essential steps are the same: store a diff-friendly representation, strip bloat, document the environment, and automate validation. With these guardrails in place, data teams can iterate quickly without sacrificing reproducibility or code quality.
Notebooks dominate exploratory data science, yet their JSON format breaks traditional diff-and-merge workflows. Without proper version control, teams lose reproducibility, slow code reviews, and risk shipping incorrect analyses. Mastering notebook-aware versioning aligns data science with software-engineering rigor, enabling faster collaboration and more reliable insights.
.ipynb
file in Git at all?Yes, but strip outputs and ensure a stable cell order. The JSON is still useful for rendering in viewers like GitHub or VS Code, but the text partner file (via Jupytext) should be the primary diff target.
Enable nbdime
as your Git mergetool. It performs cell-aware merges and opens a web UI for manual resolution. If you use Jupytext, you can merge the plain-text script first and regenerate the notebook.
Both provide cell-level diffs. nbdime is an open-source CLI/UI tool integrated with Git, whereas ReviewNB is a SaaS platform that plugs directly into GitHub pull requests and supports comments on individual cells.
Galaxy is optimized for SQL workflows rather than Jupyter notebooks. However, you can keep notebooks and Galaxy-authored SQL side-by-side in the same Git repository, applying the notebook practices discussed here alongside Galaxy’s built-in query history.