GitHub Wrapped
I wrote a decent amount of code this year:
I was curious to see a breakdown of all that activity, so I got to work last night with a Jupyter REPL and Claude in hand. Frustratingly, I was only able to recover 1062/1412 contributions through the API alone, and wasn’t able to debug a several hundred commit discrepancy from the over the summer. Whatever.
Also, if you’re expecting more pretty pictures like the one above, then you’ll probably be disappointed. This isn’t that type of blog. Without further ado, my 2024 GitHub Wrapped!
By The Numbers
Commits by Repository (>1%)
Repository | Percentage | Type |
---|---|---|
tjbai/evolver | 26.6% | Research |
tjbai/ddm | 11.8% | Research |
tjbai/avlm | 10.5% | Class Project |
tjbai/argo | 9.8% | Research |
tjbai/llmr | 8.6% | Class Project |
tjbai/aoc | 4.1% | Personal |
tjbai/neurstat | 3.5% | Class Project |
tjbai/bstat | 3.3% | Class |
tjbai/ji | 3.1% | Personal |
tjbai/cv | 2.9% | Class |
tjbai/blog | 2.4% | Personal |
tjbai/front | 2.0% | Personal |
tjbai/cogai | 1.8% | Class |
Unsurprisingly, the top of this list is dominated by private research repos (evolver, ddm, argo) along with a couple course projects from this past semester (avlm, llmr). Collectively, these are exclusively language model-related research projects—quite a shift from just over a year ago when I was posting things like PyTorch is Not Pleasant. Sign of the times.
Past that, there’s a steep fall off and the rest is a mix of random homework repos (bstat, cogai, cv) or personal websites/productivity tools (blog, front, ji). It’s funny to see my Advent of Code repo (aoc) in 6th place, considering all those commits came from a 2-3 week period at the start of December.
Work Patterns
Commits by Month
I love this chart. My activity steadily ramped up last spring as I got deeper into a project with Jason, briefly dipped during finals in May, then came back during a couple of intense months in Austin where I was juggling both work and research. Coming back to school, I immediately got really burnt out for pretty much the entire fall, exacerbated by a bad case of “signed an offer letter”-itis. I did manage to get my swagger back to close out the year though between Advent of Code, final projects, and getting back into research.
Commits by Day
The trend here is really funny to me too. I come into each Monday with a full head of steam, regress to my normal levels of productivity mid-week, then take a couple days of “deserving” rest. Eventually, the Sunday scaries come around to jump start me back into motion.
Under The Hood
Commits By Programming Language
File Extension | Lines Added | Lines Removed |
---|---|---|
.py | 47,153 | 22,638 |
.sh | 1,762 | 1,084 |
.scala | 1,081 | 223 |
.m | 1,861 | 286 |
.R | 273 | 7 |
In 2024 I wrote a lot of… Python. I remarked to a friend last spring that using this language is as natural as speaking English to me. In the year since, nothing’s really changed and there really isn’t a better option/lesser evil than for all my research work. If we account for all the lost commits and notebook development (.ipynb), I’m probably over 100k lines changed on the year.
I did at least pick up Scala late in the year during Advent of Code for a change of scenery and because it’s the language of choice at my future employer. Plus, I probably wrote around 10,000 lines of Java over the summer at work. The most shocking part is that I committed exactly 0 lines of JavaScript to any of my websites, including a complete rewrite of this one.
At least 90% of the Bash was autogenerated to queue SLURM jobs on various supercomputers and 100% of the Matlab and R code was written against my will for various homework assignments. We are definitely leaving proprietary scientific computing languages in 2024.
Commits By Configuration Language
File Extension | Lines Added | Lines Removed |
---|---|---|
.json | 755,096 | 363,210 |
.md | 98,885 | 3,211 |
.yml | 2,502 | 560 |
.toml | 1,124 | 4 |
This comparison isn’t entirely fair, since a sizeable portion of that JSON came from constant data collection during extended experiments—I’m always really paranoid about not being able to reproduce some result.
Past that, I used JSON quite a bit to manage and version control training/model hyperparameters separately from the source code.
Since the end of the summer I’ve transitioned entirely to YAML because of its massively improved ergonomics, like in-line commenting.
All the TOML is a byproduct of recently adopting uv, which uses pyproject.toml
, as my package manager of choice.
The Whole Kitchen Sink (>100 Lines Changed)
File Extension | Lines Added | Lines Removed |
---|---|---|
.json | 755,096 | 363,210 |
.vocab | 572,522 | 36,710 |
.conllu | 590,027 | 0 |
.ipynb | 189,866 | 169,277 |
.jsonl | 249,180 | 77,642 |
.md | 98,885 | 3,211 |
.py | 47,153 | 22,638 |
.html | 400 | 68,489 |
.txt | 24,027 | 15,638 |
.out | 11,872 | 1,797 |
.csv | 4,152 | 4 |
.lock | 3,498 | 370 |
.yml | 2,502 | 560 |
.sh | 1,762 | 1,084 |
.m | 1,861 | 286 |
.log | 1,620 | 1,573 |
.css | 1,252 | 94 |
.scala | 1,081 | 223 |
.toml | 1,124 | 4 |
.gen | 906 | 906 |
.astro | 644 | 126 |
.dat | 744 | 0 |
.R | 273 | 7 |
None | 454 | 175 |
The full picture is a lot less interesting because of the massive amount of data input/output stuff and generated files. This does take me back to some interesting experiments though, like when I was messing around with dependency parses (.conllu) or novel tokenizers (.vocab). I also apparently can’t make up my mind between .log
and .out
for logging.
Commit Poetry
Most Frequent Tokens
Type | Count |
---|---|
fix | 98 |
add | 87 |
some | 57 |
stuff | 56 |
update | 52 |
config | 43 |
added | 36 |
eval | 35 |
updated | 35 |
init | 34 |
If I had been more consistent with my tenses, then this table would show that I “add” and “added” more than I “fixed,” thus proving that I am a good programmer after all. Apparently, I also did a lot of “stuff” in 2024, which is good to finally know for sure.