Compare two files

Unlimited files. Real-time comparisons. For free, forever.

Original

Changed

Private and secure

Everything happens in your browser. Your files never touch our servers.

Blazing fast

No uploading, no waiting. Convert the moment you drop a file.

Actually free

No account required. No hidden costs. No file size tricks.

“Diffs” are the lingua franca of change. They’re the compact narratives that tell you what moved between two versions of a thing—source code, prose, a dataset—without forcing you to reread everything. Behind those few symbols (+, -, @@) lives a deep stack of algorithms, heuristics, and formats that balance optimality, speed, and human comprehension. This article is a practical, algorithms-to-workflows tour of diffs: how they’re computed, how they’re formatted, how merge tools use them, and how to tune them for better reviews. Along the way, we’ll ground claims in primary sources and official docs—because tiny details (like whether whitespace counts) really matter.

What a “diff” actually is

Formally, a diff describes a shortest edit script (SES) to transform an “old” sequence into a “new” one using insertions and deletions (and sometimes substitutions, which can be modeled as delete+insert). In practice, most programmer-facing diffs areline-oriented and then optionally refined to words or characters for readability. The canonical outputs are the context and unified formats; the latter—what you usually see in code review—compresses output with a concise header and “hunks,” each showing a neighborhood of context around changes. The unified format is selected via -u/--unified and is the de-facto standard for patching; patch generally benefits from context lines to apply changes robustly.

The GNU diff manual catalogs the switches you reach for when you want less noise and more signal—ignoring blanks, expanding tabs for alignment, or asking for a “minimal” edit script even if it’s slower (options reference). These options don’t change what it means for two files to differ; they change how aggressively the algorithm searches for smaller scripts and how the result is presented to humans.

From LCS to Myers: how diffs are computed

Most text diffs are built on the Longest Common Subsequence (LCS) abstraction. Classic dynamic programming solves LCS in O(mn) time and space, but that’s too slow and memory-hungry for large files. Hirschberg’s algorithm showed how to compute optimal alignments in linear space (still O(mn) time) using divide-and-conquer, a foundational space-saving technique that influenced practical diff implementations.

For speed and quality, the breakthrough was Eugene W. Myers’s 1986 algorithm, which finds an SES in O(ND) time (N ≈ total lines, D ≈ edit distance) and near-linear space. Myers models edits in an “edit graph” and advances along furthest-reaching frontiers, yielding results that are both fast and close to minimal in the line-diff setting. That’s why “Myers” remains the default in many tools.

There’s also the Hunt–Szymanski family, which accelerates LCS when few positions match (by pre-indexing matches and chasing increasing subsequences), and is historically linked to early diff variants. These algorithms illuminate trade-offs: in inputs with sparse matches, they can run sub-quadratically. For a practitioner’s overview bridging theory and implementation, see Neil Fraser’s notes.

When “optimal” isn’t readable: patience and histogram strategies

Myers aims for minimal edit scripts, but “minimal” ≠ “most readable.” Large blocks reordered or duplicated can trick a pure SES algorithm into awkward alignments. Enter patience diff, attributed to Bram Cohen: it anchors on unique, low-frequency lines to stabilize alignments, often producing diffs humans find cleaner—especially in code with moved functions or reorganized blocks. Many tools expose this via a “patience” option (e.g.,diff.algorithm).

Histogram diff extends patience with a frequency histogram to better handle low-occurrence elements while remaining fast (popularized in JGit). If you’ve ever found --histogram producing clearer hunks for noisy files, that’s by design. On modern Git, you can pick the algorithm globally or per-invocation:git config diff.algorithm myers|patience|histogram or git diff --patience.

Word-level clarity, whitespace control, and moved-code highlighting

Line diffs are concise but can obscure tiny edits. Word-level diffs (--word-diff) color intra-line changes without flooding the review with whole-line insertions/deletions—great for prose, long strings, or one-liners.

Whitespace can swamp diffs after reformatting. Git and GNU diff both let you ignore space changes in different degrees and GNU diff’s whitespace options (-b, -w, -B) help when a formatter runs; you’ll seelogical edits instead of alignment noise.

When code moves wholesale, Git can highlight moved blocks with --color-moved, visually separating “moved” from “modified,” which helps reviewers audit that a move didn’t hide unintended edits. Persist it via diff.colorMoved.

Diffs in the service of merges: two-way vs. three-way and `diff3`

A two-way diff compares exactly two versions; it can’t tell whether both sides edited the same base line, so it often over-conflicts. Three-way merging (used by modern VCSs) computes diffs from a common ancestor to each side and then reconciles the two change sets. This dramatically reduces spurious conflicts and provides better context. The classic algorithmic core here is diff3, which merges changes from “O” (base) to “A” and “B” and marks conflicts where necessary.

Academic and industrial work continues to formalize and improve merge correctness; for example, verified three-way merges propose semantic notions of conflict-freedom. In day-to-day Git, the modern ort merge strategy builds on diffing and rename detection to produce merges with fewer surprises. For users, key tips are: show base lines in conflicts with merge.conflictStyle=diff3, and integrate frequently so diffs stay small.

Rename detection and its thresholds

Traditional diffs can’t “see” renames because content addressing treats files as blobs; they only see a deletion and an addition. Rename detection heuristics bridge that gap by comparing similarity across added/removed pairs. In Git, enable or tune via -M/--find-renames[=<n>] (default is ~50% similarity). Lower it for noisier moves. You can cap candidate comparisons with diff.renameLimit (and merge.renameLimit during merges). To follow history across renames, use git log --follow -- <path>. Recent Git also performs directory-rename detection to propagate folder moves during merges.

Binary and delta diffs: rsync, VCDIFF/xdelta, bsdiff

Text isn’t the only thing that changes. For binaries, you typically want delta encoding—emit copy/add instructions to reconstruct a target from a source. The rsync algorithm pioneered efficient remote differencing using rolling checksums to align blocks across a network, minimizing bandwidth.

The IETF standardized a generic delta format, VCDIFF (RFC 3284), describing a bytecode of ADD, COPY, and RUN, with implementations like xdelta3 using it for binary patching. For compact patches on executables, bsdiff often produces very small deltas via suffix arrays and compression; choose it when patch size dominates and generation can happen offline.

Text diffs beyond source code: fuzzy matching and patching

When you need robust patching in the face of concurrent edits or slightly misaligned contexts—think editors or collaborative systems—consider diff-match-patch. It marries Myers-style differencing with Bitap fuzzy matching to find near-matches and apply patches “as best effort,” plus pre-diff speedups and post-diff cleanups that trade a tiny bit of minimality for nicer human output. For how to combine diff and fuzzy patch in continuous sync loops, see Fraser’s Differential Synchronization.

Structured data diffs: tables and trees

Line diffs on CSV/TSV are brittle because a one-cell change can look like a whole-line edit. Table-aware diff tools (daff) treat data as rows/columns, emitting patches that target specific cells and rendering visualizations that make additions, deletions, and modifications obvious (see the R vignette). For quick checks, specialized CSV differs can highlight cell-by-cell changes and type shifts; they’re not algorithmically exotic, but they increase review signal by comparing the structure you actually care about.

Practical Git diff tuning: a reviewer’s checklist

Pick the right algorithm: start with Myers (default), try --patience if reorders or noisy blocks confuse the output, or --histogram for fast, readable diffs on repetitive text. Set a default with git config diff.algorithm ….
Reduce noise: for style-only edits, use whitespace flags (-b, -w, --ignore-blank-lines) to focus on substantive changes. Outside Git, see GNU diff’s whitespace controls.
See inside a line: --word-diff helps for long lines and prose.
Audit moved code: --color-moved (or diff.colorMoved) separates “moved” from “modified.”
Handle renames: when reviewing refactors, add -M or tweak the similarity threshold (-M90%, -M30%) to catch renames; remember the default is about 50%. For deep trees, set diff.renameLimit.
Follow history across renames: git log --follow -- <path>.

How merges actually consume diffs (and what to do when they don’t)

A merge computes two diffs (BASE→OURS, BASE→THEIRS) and tries to apply both to BASE. Strategies like ort orchestrate this at scale, folding in rename detection (including directory-scale moves) and heuristics to minimize conflicts. When conflicts happen, --conflict=diff3 enriches the markers with base context, which is invaluable for understanding intent. The Pro Git chapter on Advanced Merging walks through resolution patterns, and Git’s docs list knobs like -X ours and -X theirs. To save time on recurring conflicts, enable rerere to record and replay your resolutions.

Beyond files: remote and incremental scenarios

If you’re syncing large assets over a network, you’re closer to the rsync world than to local diff. Rsync computes rolling checksums to discover matching blocks remotely, then transfers only what’s necessary. For packaged deltas, VCDIFF/xdelta gives you a standard bytecode and mature tools; choose it when you control both encoder and decoder. And if patch size is paramount (e.g., over-the-air firmware), bsdiff trades CPU/memory at build time for very small patches.

A quick word on “fuzzy” and “friendly”

Libraries like diff-match-patch accept that, in the real world, the file you’re patching may have drifted. By combining a solid diff (often Myers) with fuzzy matching (Bitap) and configurable cleanup rules, they can find the right place to apply a patch and make the diff more legible—critical for collaborative editing and syncing.

The “table stakes” you should internalize

Know your formats. Unified diffs (-u/-U<n>) are compact and patch-friendly; they’re what code review and CI expect (reference).
Know your algorithms. Myers for minimal edits fast (paper); patience/histogram for readability on reorderings or noisy blocks (patience, histogram); Hirschberg for the linear-space trick (paper); Hunt–Szymanski for sparse-match acceleration (paper).
Know your switches. Whitespace controls, word-diff, and color-moved are review multipliers (git diff docs; GNU whitespace options).
Know your merges. Three-way with diff3 style is less confusing; ort plus rename detection reduces churn; rerere saves time.
Pick the right tool for the data. For CSV/tables, use daff; for binaries, use VCDIFF/xdelta or bsdiff.

Appendix: tiny command cookbook

Because muscle memory matters:

# Show a standard unified diff with extra context
  git diff -U5
  diff -u -U5 a b
  
  # Get word-level clarity for long lines or prose
  git diff --word-diff
  
  # Ignore whitespace noise after reformatting
  git diff -b -w --ignore-blank-lines
  diff -b -w -B a b
  
  # Highlight moved code during review
  git diff --color-moved
  git config --global diff.colorMoved default
  
  # Tame refactors with rename detection and follow history across renames
  git diff -M
  git log --follow -- <file>
  
  # Prefer algorithm for readability
  git diff --patience
  git diff --histogram
  git config --global diff.algorithm patience
  
  # See base lines in conflict markers
  git config --global merge.conflictStyle diff3

Closing thought

Great diffs are less about proving minimality and more about maximizing reviewer understanding at minimum cognitive cost. That’s why the ecosystem evolved multiple algorithms (Myers, patience, histogram), multiple presentations (unified, word-diff, color-moved), and domain-aware tools (daff for tables, xdelta/bsdiff for binaries). Learn the trade-offs, tune the knobs, and you’ll spend more time reasoning about intent and less time reassembling context from red and green lines.

Selected references & further reading

GNU diffutils manual: overview • unified format • diff3 • whitespace options
Git docs: git-diff • diff.algorithm • --word-diff • --color-moved • rename detection • diff.renameLimit • merge.renameLimit • --follow • Advanced Merging (Pro Git) • git-rerere • merge-ort
Algorithms: Myers (1986) • Hirschberg (1975) • Hunt–Szymanski (1977) • Patience diff • Histogram diff
Fuzzy patching & sync: diff-match-patch • Bitap • Fraser (diff notes) • Differential Synchronization
Binary/remote delta: rsync algorithm • RFC 3284 (VCDIFF) • xdelta3 • bsdiff
Tables/data: daff (GitHub) • daff R vignette

Frequently Asked Questions

What is a diff?

A diff is a tool or functionality used in version control systems to highlight the differences between two versions or instances of a file. It is typically used to track changes or updates made to the file over time.

How does a diff compare two files?

A diff compares two files line by line. It scans through and matches each line in the first file with its counterpart in the second file, noting all significant differences like additions, deletions, or alterations.

What is a patch in the context of diffs?

A patch is a file that contains the differences between two files, as produced by the diff tool. It can be applied to a version of a file with the 'patch' command to update it to a newer version.

What are unified diffs?

Unified diffs are a type of diff file format that presents changes in a file format suitable for text files. It displays deletions from the original file prefixed with a '-', and additions to the original file are prefixed with a '+'.

Why are diffs crucial in version control systems?

Diffs are crucial in version control systems because they allow teams to track changes made to a file over time. This tracking makes it easier to maintain consistency, prevent duplicating work, spot errors or discrepancies, and manage multiple versions of files efficiently.

What is the LCS algorithm in diff tools?

The Longest Common Subsequence (LCS) algorithm is a common method used in diff tools to find the longest sequence of characters that appear left-to-right in both original and modified files. This algorithm helps in identifying the key similarities and differences between two files.

Can diff tools compare binary files?

Most basic diff tools can only compare text files. However, specialized diff tools are designed to compare binary files, displaying the differences in a readable format.

What are some common diff tools in use today?

Some of the most popular diff tools include GNU diff, DiffMerge, KDiff3, WinMerge (Windows), and FileMerge (Mac). Many Integrated Development Environments (IDEs) also include built-in diff utilities.

How can I create a diff in Git?

In Git, you can create a diff by using the `git diff` command followed by the two versions of the files you want to compare. The output will show the differences between the two files.

Can I use diff tools with directories, not just files?

Yes, many diff tools have the capability to compare directories in addition to individual files. This feature can be particularly useful when comparing versions of a large project with multiple files.