Rewriting Your Git History and JS Source for Fun and Profit

This is a post in the Codebase Conversion series.


Or, how I spent a few weeks obsessing over a task that only I cared about

Intro 🔗︎

I just completed a large-scale rewrite and cleanup of our team's Git repository. I learned a lot working on this task and came up with some nifty and useful techniques in the process, so I'd like to share what I've learned.

In order to keep this post from getting any longer that it already is, I'll be referencing a bunch of external articles and assuming that you, the reader, have taken the time to read them and understand most of the concepts involved. That way I can go into more detail on the stuff I actually did.

To summarize the rest of the cleanup task:

  • I filtered out junk files from the repo's history, shrinking the base repo size by over 70%
  • I automatically rewrote our ES5 Javascript source files to ES6 syntax throughout the entire history, as if they had "always been written that way"

I wrote a bunch of scripts and code for this task. I've created a repo with slightly cleaned-up copies of these scripts for reference . I'll show some code snippets in this post, but point to that repo for the full details.

Note: This post is absurdly technical and deep, even for me :) Hopefully people find this info useful, but I also don't expect it to be widely read. This is mostly a chance for me to publicly document all the stuff I did, as a public service.

Table of Contents 🔗︎

Background 🔗︎

Why Rewrite History? 🔗︎

My current project got started about six years ago. We used Mercurial for the first year, then migrated to Git. The repo's .git folder currently takes up about 2.15GB, with about 15,000 commits. There's several reasons for that. We've historically "vendored" third-party libs by committing them directly to the repo, including a lot of Java libraries. We've also had some random junk files that were accidentally committed (like a 135MB JAR file full of test images).

Unfortunately, because of how Git works, any file that exists in the historical commits has to be kept around permanently, as long as at least one commit references that file. That means that if you accidentally commit a large file, merge that commit to master, and then merge a commit that deletes the file, Git still keeps the file's contents around.

So, we had junk files in the history that should never have been committed, and we had old libraries and other binaries that were not going to be needed for future development. We just wrapped up a major development cycle, and I wanted to clean up the repo in preparation for the next dev cycle. That way everyone's clones would be smaller, which would also help with CI jobs.

Dealing with History Changes 🔗︎

However, Git commits form an immutable history. Every commit references its parents by their hashes. Every commit references its file tree by a hash. Every file tree references its files by the hashes of their contents. That means that if you literally change a single bit in one file at the start of the repo's history, every commit after that would have a different hash, and thus effectively form an "alternate history" line that has no relation to the original history. (This is one of the reasons why you should never rebase branches that have already been pushed - it creates a new history chain, and someone else might be relying on the existing history.)

My plan was to create a fresh clone of our repo, and rewrite its history to filter out the files we didn't need for future work. We'd archive the old repo, and when the next dev cycle starts, everyone would clone the new repo and use that for development going forward.

The Codebase 🔗︎

Our repo has two separate JS client codebases which talk to the same set of services. The services are written in a mixture of Python and Java.

The older JS client codebase, which I'll call "App1", started in 2013, and the initial dev cycle resulted in a tangle of jQuery spaghetti and global script tags. I was able to convert those global scripts to AMD modules about halfway through that first dev cycle, and spent the second half of the year refactoring the codebase to use Backbone. We continued to use Backbone for new features until early 2017, when we began adding new features in React+Redux, and refactoring existing features from Backbone to React. We didn't have a true "compile" step in our build process, so we were limited in what JS syntax we could use based on our target browser environment. I finally upgraded the build system from Require.js to Webpack+Babel in late 2017, and that allowed us to finally start using ES6 modules and ES6+ syntax. Since then, all of our new files have been written as ES6 modules, and we've had to do back-and-forth imports between files in AMD format and files in ES6 module format.

"App2" was originally written using Google Web Toolkit (GWT), a Java-to-JS compiler. We completed a ground-up rewrite of that client in React+Redux during this dev cycle (and I took great joy in deleting the GWT codebase, particularly since I'd written almost all of that myself). This codebase was written using Wepack, Babel, and ES6+ from day 1.

Because of its longer and varied dev history, App1's codebase is a classic example of the "lava layer anti-pattern" (and I will freely admit that I'm responsible for most of those layers). It's currently about 80% Backbone and 20% React+Redux, and we hope to finish rewriting all the remaining Backbone code to React over the next year or two. In the meantime, the mixture of AMD and ES6 modules is a bit of a pain. Webpack will let you use both, but you have to do some annoying workarounds when importing and exporting between files in different module formats (like adding SomeDefaultImport.default when using an ES6 default export in an AMD file)

The Plan Grows 🔗︎

Our team hasn't exactly been consistent with our code styling. In theory we have a formatting guide we ought to be following, but in practice... eh, whatever :)

I've been planning to set up automatic formatting tools like Prettier for our JS code and Black for our Python code. However, the downside of adding a code formatter to an existing codebase is that you inevitably wind up with a "big bang" commit that touches almost every line and obscures the actual change history of a given file. If you do a git blame (or "annotate"), at some point every line was last changed by "Mark: REFORMATTED ALL THE THINGS", which isn't helpful. There's ways to skip past that, but it's annoying.

At some point I realized that if I was going to be rewriting the entire commit history anyway by filtering out junk files, then I could also apply auto-formatting of the code at every step in the history, to make it look as if all our code had "always been formatted correctly". That led me to another bigger realization: I could do more than just reformat the code - I could rewrite the code!.

I'd seen mentions of "codemods" before - automated tools that look for specific patterns in your code and transform them into other patterns. The React team is especially fond of these, and has provided codemods for things like renaming componentWillReceiveProps to UNSAFE_componentWillReceiveProps across an entire codebase.

it occurred to me that I could automatically rewrite all of our AMD modules to ES6 modules, and upgrade other syntax to ES6 as well. And, as with the formatting, I could do this for the entire Git history, as if the files had been written that way since the beginning.

Naturally, I ran into a bunch of complications along the way, but in the end I accomplished what I set out to do (yay!). Here's the details of how I did it.

Filtering Junk Files from History 🔗︎

Short answer: use The BFG Repo-Cleaner. Done.

Slightly longer answer: it does take work to figure out which files and folders you want to delete, and set up a command for the BFG to do its work.

Note: most of these commands are Bash-based and require a Unix-type environment. Fortunately, Git Bash will suffice on Windows.

Finding Files to Delete 🔗︎

There's a couple ways to approach this.

The first is to look for specific large files in the history. Per this Stack Overflow answer, here's a Bash script that will spit out a list of the largest files and their sizes:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| awk '$2 >= 2^20' \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The 2^20 searches for files larger than 1MB, and can be adjusted to look for other sizes.

Anoher approach is to look at the names of every file path that has ever existed in the repo, to get an idea what might not be relevant any more. Thanks again to Stack Overflow for this answer:

git log --pretty=format: --name-only --diff-filter=A | sort -u > allFilesInRepoHistory.txt

Skimming through the list of all files turned up a bunch of junk from early in the development history that wasn't in our current tree, and could safely be removed.

Preparing the Filtering Command 🔗︎

The BFG Repo-Cleaner supports deleting top-level folders by name, but not nested folders. For that, I had to write a script that looked for all files matching a given folder/path prefix, and write all matching blob hashes to a file that could be read by the BFG. (Pretty sure this started from another SO answer, but can't find which one atm.)

First, write a text file called nestedFoldersToRemove.txt with each path prefix to delete on a separate line:

ParentFolder1/nestedFolder1/
ParentFolder2/nestedFolder2/nestedFolder2a/
ParentFolder3/someFilePrefix

Then, run this script inside a repo to generate a file containing just the blob IDs that match those prefixes

readarray -t folders <  "../nestedFilesToRemove.txt"

for f in "${folders[@]}"
do
    echo "Finding blobs for ${f}..."
    git rev-list --all --objects | grep -P "^\w+ ${f}*" | cut -d" " -f1 >> ../foundFilesToDelete.txt
done

If you want to see which files are getting deleted, remove the | cut -d" " -f1 portion to generate lines like ABCD1234 Path/to/some/file.ext

Finally, put together a list of the top-level folders you want deleted as well.

In some cases, I wanted to nuke all the old files in a folder, and only keep what was in there currently. In those cases I went ahead and specified the whole folder as a search path, because the BFG by default will preserve all files in the current HEAD.

Running the BFG 🔗︎

Once you know all the files you want cleaned up, you need to make sure that your original repo does not have those in the tip of its history. If any of them still exist, add commits that delete those files.

Once that's done, it's time to nuke stuff!

# Clone the existing repo, without checking out files
git clone --bare originalRepoPath filteredRepo

# Run the BFG, deleting specific top-level folders and nested folders/files
java -jar bfg-1.13.0.jar --delete-folders "{TopLevelFolder1,TopLevelFolder2}" --strip-blobs-with-ids ./fileIdsToDelete.txt filteredRepo

After the BFG has rewritten the filtered repo, you need to have Git remove any blobs that are no longer referenced:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

You should now have a much smaller repo with the same commit authors and times, but a different line of history.

Rewriting JS Source with Codemods 🔗︎

Codemods for Converting to ES6 🔗︎

There's plenty of tools out there for rewriting JS source code automatically. My main concern was finding codemod transforms to do what I needed: convert our AMD modules to ES6 modules, and then fix up a few more bits of ES6 syntax on top of that.

I found most of what I needed in two repos:

  • 5to6/5o6-codemod
    • amd: converts AMD module syntax to ES6 module syntax
    • named-export-generation: Adds named exports corresponding to default export object keys. Only valid for ES6 modules exporting an object as the default export.
  • cpojer/jscodemod
    • no-vars: Conservatively converts var to const or let
    • object-shorthand: Transforms object literals to use ES6 shorthand for properties and methods.
    • trailing-commas: Adds trailing commas to array and object literals

There's many other transforms I could have used, but this was sufficient for what I wanted to do.

Writing a Custom Babel-Based Codemod 🔗︎

I mentioned that we had some funkiness in the JS code as a result of cross-importing between AMD and ES6 modules. One common pattern was that an AMD module would do:

define(["./someEs6Module"], 
function(someEs6Module) {
    const {named1, named2} = someEs6Module;
});

Transformed into ES6, this would be:

import someEs6Module from "./someEs6Module";

const {named1, named2} = someEs6Module;

However, the ES6 module in question might actually only have named exports, and no default export. This only worked because of Webpack magic.

When I tried running the transformed code, Webpack gave me a bunch of errors saying that these imports didn't exist. So, I opted to write a codemod that found all named exports, and if there was no default export, generated a fake default export object containing those, as a compatibility hack.

I couldn't find anything related to this that worked with jscodeshift. However, I did find jfeldstein/babel-plugin-export-default-module-exports, which almost did what I wanted. I figured I could hack together some custom changes to it. Since it was a Babel plugin, I needed a different tool to run that codemod. Fortunately, square/babel-codemod lets you run Babel plugins as codemods.

Thanks to the AST Explorer tool and some assistance from Twitter, I was able to hack together a plugin that did what I needed.

Initial Conversion Testing 🔗︎

I started off by trying to run each of these transforms as a separate step. I created a shell script that called jscodeshift multiple times, each with the path to a single transform file. I also called Prettier to do some formatting.

While doing that, I ran across some issues with Babel not recognizing the dynamic import() syntax, so I added a couple sed commands to rewrite those temporarily:

# Before anything else, replace uses of dynamic import
sed -i s/import/require.ensure/g App1/src/entryPoint.js

# Do other transforms
yarn jscodeshift -t path/to/some-transform.js App1/src

yarn codemod -p my-custom-babel-plugin.js App1/src

# Format the code
yarn prettier --config .prettierrc --write "App1/src/**/*.{js,jsx}"

# Undo replacement
sed -i s/require.ensure/import/g App1/src/entryPoint.js

Using this script, I was able to process a current checkout of our codebase automatically.

Formatting Python Code 🔗︎

While the majority of my focus was on our JS code, we've also got a bunch of Python code on our backend. I figured I'd take this chance to do some auto-formatting on that as well. I reviewed the available tools, and settled on Black, largely because its highly-opinionated style is almost identical to how we were writing our Python anyway.

Iterating through Git History 🔗︎

My ultimate goal was to iterate through every commit in the history of the already-filtered Git repo, find any relevant JS and Python files, transform the versions of the files as they existed in that commit, and create new commits with the same metadata but updated file trees containing the transformed files.

I'd done some prior research and read through a bunch of articles. The most relevant was A tale of three filter-branches, which compared three different ways to iterate using the git filter-branch command, and showed that the third way was the fastest.

Understanding the Index Filter Logic 🔗︎

In general, git filter-branch will iterate over the commit history, and run whatever additional commands you want at each commit. These could be "inline" shell commands, or separate scripts / tools.

The third approach shown in that post involved using git filter-branch --index-filter, and checking each commit to see which files had actually been added/changed/deleted. Fortunately, the author of that post chose to write the per-commit logic as a Ruby script. Reading that was extremely helpful in understanding what was going on.

I'll summarize the steps:

  • Retrieve the current commit ID from the filter-branch environment
  • Look up the original parent commit ID
  • Look up the ID for the rewritten form of the parent commit
  • Reset the Git index to the tree contents of the rewritten parent commit
  • Diff the original parent tree and original commit tree to see which files changed
  • For each added/changed/removed file:
    • If it's a file we're interested in, transform it, and add the transformed version to the Git index
    • Any other added/changed files we don't care about should just be added to the index as-is
    • If it was removed, delete it from the index
  • Create a new commit with the original metadata and the transformed tree

Note that the Ruby script was specifically interacting with Git via low-level "plumbing" commands like git cat-file blob and git update-index. In addition, note that it was shelling out to call Git's commands as external binaries.

Speeding Up the Filtering Process: Iterating Commits in Python 🔗︎

I started by trying to port some of the Ruby script's logic to Python. My first attempt was just to run the same Git commands, capture the list of changed files, filter them based on the source paths of the JS and Python files I was interested in, and print those. I used the great plumbum toolkit to let me easily call external binaries from Python.

When I tried running git filter-branch --index-filter myScript.py, it worked. But, Git estimated that it would take upwards of 16 hours just to iterate through 15K commits and print the list of changed files. My guess was that a large part of that had to do with kicking off so many external processes (especially since I was doing this in Git Bash on Windows 10).

I knew that the libgit2 library existed, and that there were Python bindings for libgit2. I figured the process would run faster if I could somehow do all of the Git commands in a single process, using pygit2 to iterate over the history.

I experimented with pygit2 and figured out how to iterate over commits using commands like:

for commit in repo.walk(repo.head.target, GIT_SORT_TOPOLOGICAL | GIT_SORT_REVERSE):
    # do something with each commit

Since pygit2 also has APIs to manipulate the index, I was able to put together a script that replicated the logic from the Ruby script, but all done in-process. I think the initial script I wrote was able to loop over the history and print every JS/Python file that matched my criteria, in about 30 minutes or so. Clearly a huge improvement.

Speeding Up the Filtering Process: Using pylter-branch 🔗︎

I was curious if anyone else had done something like this already. I dug through Github, and came across sergelevin/pylter-branch. Happily, it already did what I wanted in terms of iterating over commits, doing some kind of rewrite step, and saving the results, and provided a base "repo transformation" class that could be subclassed to define the actual transformation step.

I switched over to using that, and it actually seemed to iterate a bit faster.

Optimizing the Transformation Process 🔗︎

My original plan for the rewrite was to run these steps for each commit:

  • Filter the list of added/changed files for the JS and Python files I was interested in
  • Write each original file blob to disk in a temp folder, with names like ABCD1234_actualFilename.js
  • Run all of the JS codemods and Python formatting on all files in that folder
  • Write the changed files to the Git index and commit

However, I knew that all of the file access and external commands would slow things down, so I began trying to find ways to optimize this process.

Speeding Up JS Transforms: Combining Transforms 🔗︎

I had six different JS codemods I wanted to run. Five of them required jscodeshift, the other required babel-codemod. Originally, this would have required six separate external processes being kicked off for every commit.

A comment in the jscodeshift repo pointed out that you could write a custom transform that just imported the others and called them each in sequence. Using that idea, I was able to run all five jscodeshift transforms in one step.

That left the one Babel plugin-based transform as the outlier. There was a jscodeshift issue discussing how to use Babel plugins as transforms, and I was able to adapt Henry's example to run my plugin inside of jscodeshift. That meant all six transforms could run in a single step.

I ultimately copied the transform files locally so I didn't have to try installing them as separate dependencies.

Speeding Up Python Formatting 🔗︎

That cut down on a lot of the external processes, but I was still writing files to disk for every commit. Since my script was written in Python, and the Black formatter is also Python, I realized I could probably just call it directly.

I set up some logic to put all matching Python files into an array that looked like [ {"name" : "ABCD1234_someFile.py", "source" : "actual source here}], and just directly call Black's format_str() function on each source string. That eliminated the need to write any Python file to disk - I could read the source blobs, format them, and write the formatted blobs back to the repo, all in memory without ever writing any temp files to disk.

Speeding Up JS Transforms: Creating a Persistent JS Transform Server 🔗︎

Since the JS transforms were all done using tools written in JS, calling that code directly wasn't an option. To solve that, I threw together a tiny Express server that accepted the same kind of file sources array as a POST, directly called the jscodeshift and prettier APIs to transform each file in memory, and returned the transformed array. This meant I didn't have to have any temp files written to disk. It also meant there were no other external processes starting up for every commit. All I had to do was start the JS transfom server, and kick off the commit iteration script.

I later realized I could speed up things even further by parallelizing the JS transforms step to handle multiple files at once. The workerpool library made that trivial to add. I set up the pool of workers, and for every request, mapped the array to transform calls that returned promises, and did await Promise.all(fileTransforms). This was another great improvement.

Handling Transform Problems 🔗︎

I ran into a bunch of problems along the way. Here's some of the problems I found and the solutions I settled on.

Python String Conversions 🔗︎

Both Black and pylter-branch required Python 3.6, so I was using that. pygit2 reads blobs as bytes, but some of the work I needed to do required strings. I also found a few files that had some kind of a "non-breaking space" character instead of actual ASCII spaces. Fixing this required doing some unicode normalization:

def normalizeEntry(fileEntry):
    newText = normalize("NFKD", str(fileEntry["source"], 'utf-8', 'ignore'))
    return update(fileEntry, {"source": newText})

Handling JS Syntax Errors 🔗︎

Turns out our team had made numerous commits over the years with invalid JS syntax. This included things like missing closing curly braces, extra parentheses, wrong variable names, Git conflict markers, and more. Because the JS transforms are all parser-based, both jscodeshift and prettier could throw errors if they ran across bad syntax, causing the transforms for that file to fail.

I first had to know which files were broken, at which commits. I added error handling to the JS transform server to catch any errors, return the original (untransformed) source, and continue onwards. I then added handling to write both the current string contents and the error message to disk, like:

- /js-transform-errors
    - /some-commit-id
        - ABCD1234_someFile.js    // bad source file contents
        - ABCD1234_someFile.json  // serialized error message

I'd run the conversion script for a while, let it write a bunch of errors, then kill it and review them.

I wound up writing dozens of hand-tested regular expressions to try to fix those files whenever they came up. Using an interactive Python interpeter (in my case, Dreampie), I'd read the original blob into a string, fiddle with regexes until I got something that matched the problematic source, create a substitution that fixed the issue, and then paste the search regex and the replacement into a search table in my conversion script. The conversion logic would then check to see if any file in a given commit matched the bad file paths, and run each provided regex in sequence on the source in memory. This would fix up the source to be syntactically valid before it was sent to the JS transform server, allowing it to be transformed correctly.

I eventually gave up on fixing every last issue. So, a few files in the history wound up with commit sequences like:

FIXED_SOURCE_C    // ES6 syntax
BROKEN_SOURCE_B   // original AMD syntax
FIXED_SOURCE_A    // ES6 syntax

Fortunately, most of those were far enough back in the history to not really cause issues with the blames.

This was probably the most annoying part of the whole task.

Mistaken Optimization of ES6 Files 🔗︎

All of the source files for App2 were already ES6 modules, and about 20% of App1's files were also ES6. Five of the six codemods were about converting AMD/ES5 to ES6, so I figured I could speed things up by not running those on files that were already ES6.

I modified the JS transform server to accept a {formatOnly : true} flag in the file entries, in which case it would skip the transforms and just run Prettier on the source. That was fine.

I then tried to have the Python script detect which files were already ES6, and did so by checking to see if the strings "import " or "export" were in the JS source. That turned out to be a mistake, as we had a bunch of AMD files with those words in comments or source already. I did one conversion run that I thought would be "final", but realized afterwards that many files hadn't been transformed at all.

I eventually settled for just checking to see if the file was part of App2's source, and ran the complete transform process on everything in App1.

Running the Conversion 🔗︎

I did lots of partial and complete trial runs to iron out the issues. The actual final conversion run, with all of the transforms and formatting, took right about 5 hours to complete. That's certainly a huge improvement over the base git filter-branch command, which probably would have taken upwards of 20-some hours. (This proved to be particularly helpful when I realized that I'd screwed up a "final" run with the bad "skip ES6" optimization, and had to re-run the whole process again.)

I had added a bunch of logging to the conversion script, including running values of elapsed time, time per processed commit, and estimated time remaining. This was really useful, and it was extremely satisfying to watch the commit details fly by in the terminal. (Fun side note: a couple hours after kicked off the actual final conversion run in the background, a popup window informed me that IT had pushed through a forced Windows reboot due to updates. That meant it was a race for the conversion to complete before the reboot, and at one point it was 3.5 hours remaining for the conversion, 3 hours left until the reboot. Fortunately, the conversion sped up considerably about halfway through thanks to smaller files to process.)

However, we were still doing some fixes and work in our existing repo, and had made several commits since I made the clone to start the file filtering process. Those needed to get ported to the new repo, and that turned out to be trickier than I expected.

Attempt #1: Bundles 🔗︎

I've used Git "bundles" before, which let you export a series of commits as a single file. That file can then be transferred to another machine, and used as a source for cloning or pulling commits into another repo.

I had assumed I could export the latest commits into a bundle and pull those directly into the newly-rewritten repo. I was wrong :( Turns out that git bundle always verifies that the target repo already has the parent commit before the first commit in the bundle, and of course since the rewritten repo had a completely different history line, that specific parent commit ID didn't exist in the new repo.

I poked at various forms of pulling, cloning, and banging my head against a wall before giving up on this approach.

Attempt #2: Cherry-Picking 🔗︎

I then figured I could add the original repo as a remote, git fetch the original commits into the new repo, and cherry-pick them over into the new history. Didn't go as I planned.

First, the new repo had to copy over every old blob and commit into itself. Then, when I tried cherry-picking the first "new" commit, it brought along not just the changed files from that commit, but the old versions of every other file in the old commit's file tree.

I probably could have done some kind of surgery to make that work, but I gave up.

Attempt #3: Patch Files 🔗︎

Git was originally intended for use with an email-based workflow, since that's what the Linux kernel team does. It has built-in commands for generating patch files, writing emails with those patches, and applying patches from emails.

I was able to generate a series of patch files with git format-patch COMMIT_BEFORE_FIRST_I_WANT..LAST_COMMIT_I_WANT. I then copied those to the new repo, and tried to apply them.

Git has two related commands. git apply takes a single standalone patch file and tries to update your working copy. git am reads an email-formatted patch file, applies the diff, and then creates a new commit using the metadata that was encoded in the email header.

When I tried to use git am, it failed. The generated patch files have lines like index OLD_BLOB_HASH NEW_BLOB_HASH for each changed file, and those lines caused Git to try to find the old file blob IDs in the new repo. Again, those didn't exist.

I finally resorted to manually deleting those index lines from each patch file, then running git am --reject --ignore-whitespace --committer-date-is-author-date *.patch. Git would try to apply a patch file, and if any hunks failed to apply correctly, write them to disk as SomeFile.js.rej and pause. I could then do manual hand-fixes to match the changes in the patch, and run git am --reject --ignore-whitespace --committer-date-is-author-date --continue, and it would pick up where it left off.

So, patch files aren't ideal, but they at least allow copying over newer changes semi-automatically while reusing the commit metadata. I probably could have found other ways to do this programmatically, but oh well.

Conversion Results 🔗︎

Nuking unneeded files from the repo history knocked the new repo down from 2.15GB as a baseline to "only" 600MB, a savings of 1.5GB. We still unfortunately have a bunch of vendored libs with the current codebase, but that's still a significant improvement. (Yeah, yeah, we'll look at maybe using something like Git-LFS down the road.)

The JS transformation process worked out exactly as I'd hoped. All of the versions of the JS files in our history were transformed and formatted, as if they'd originally been written using ES6 module syntax and consistent formatting from day 1. This meant that each commit retained the original metadata and relative diffs to its parent.

As an example, App1's entry point was initially a global script, but a couple months into development I converted it into an AMD module. The original diff looked like:

+define(["a", "b"],
+function(a, b) {

// Actual app setup code here

+   return {
+       export1 : variable1,
+       export2 : function2
+   }
+});

Afterwards, the equivalent commit was still by me, still on the same day, but the diff looked like:

+import a from "a";
+import b from "b";

// Actual app setup code here

+export {export1 : variable1, export2 : function2};

I did have to do a few hand-fix commits after the conversion process was done, mostly around our mixed use of AMD/ES6 modules (like removing any of use of SomeDefaultImport.default). Once I did that, the code ran just fine.

Final Thoughts 🔗︎

As I said, that was incredibly technical, but hopefully it was informative.

I had a fairly good grasp on how Git worked and how it stored data before this task began, but this really solidified my understanding.

I suspect there may have been some other approaches I could have used, particularly rewriting each file blob in parallel. I'm pretty happy with how this worked out, though.

The sanitized conversion scripts are available on Github. If you've got questions, leave a comment or ping me @acemarke on Twitter or Reactiflux.

Further Information 🔗︎

Git internal data structures: blobs, trees, commits, and hashes 🔗︎

Git history rewriting 🔗︎

JS Codemods 🔗︎

Git Commit Transfers 🔗︎


This is a post in the Codebase Conversion series. Other posts in this series: