TJ's Blog

GitHub Wrapped


I wrote a decent amount of code this year:

GitHub Contributions

I was curious to see a breakdown of all that activity, so I got to work last night with a Jupyter REPL and Claude in hand. Frustratingly, I was only able to recover 1062/1412 contributions through the API alone, and wasn’t able to debug a several hundred commit discrepancy from the over the summer. Whatever.

Also, if you’re expecting more pretty pictures like the one above, then you’ll probably be disappointed. This isn’t that type of blog. Without further ado, my 2024 GitHub Wrapped!

By The Numbers

Commits by Repository (>1%)

Repository Percentage Type
tjbai/evolver 26.6% Research
tjbai/ddm 11.8% Research
tjbai/avlm 10.5% Class Project
tjbai/argo 9.8% Research
tjbai/llmr 8.6% Class Project
tjbai/aoc 4.1% Personal
tjbai/neurstat 3.5% Class Project
tjbai/bstat 3.3% Class
tjbai/ji 3.1% Personal
tjbai/cv 2.9% Class
tjbai/blog 2.4% Personal
tjbai/front 2.0% Personal
tjbai/cogai 1.8% Class

Unsurprisingly, the top of this list is dominated by private research repos (evolver, ddm, argo) along with a couple course projects from this past semester (avlm, llmr). Collectively, these are exclusively language model-related research projects—quite a shift from just over a year ago when I was posting things like PyTorch is Not Pleasant. Sign of the times.

Past that, there’s a steep fall off and the rest is a mix of random homework repos (bstat, cogai, cv) or personal websites/productivity tools (blog, front, ji). It’s funny to see my Advent of Code repo (aoc) in 6th place, considering all those commits came from a 2-3 week period at the start of December.

Work Patterns

Commits by Month

Monthly Breakdown

I love this chart. My activity steadily ramped up last spring as I got deeper into a project with Jason, briefly dipped during finals in May, then came back during a couple of intense months in Austin where I was juggling both work and research. Coming back to school, I immediately got really burnt out for pretty much the entire fall, exacerbated by a bad case of “signed an offer letter”-itis. I did manage to get my swagger back to close out the year though between Advent of Code, final projects, and getting back into research.

Commits by Day

Daily Breakdown

The trend here is really funny to me too. I come into each Monday with a full head of steam, regress to my normal levels of productivity mid-week, then take a couple days of “deserving” rest. Eventually, the Sunday scaries come around to jump start me back into motion.

Under The Hood

Commits By Programming Language

File Extension Lines Added Lines Removed
.py 47,153 22,638
.sh 1,762 1,084
.scala 1,081 223
.m 1,861 286
.R 273 7

In 2024 I wrote a lot of… Python. I remarked to a friend last spring that using this language is as natural as speaking English to me. In the year since, nothing’s really changed and there really isn’t a better option/lesser evil than for all my research work. If we account for all the lost commits and notebook development (.ipynb), I’m probably over 100k lines changed on the year.

I did at least pick up Scala late in the year during Advent of Code for a change of scenery and because it’s the language of choice at my future employer. Plus, I probably wrote around 10,000 lines of Java over the summer at work. The most shocking part is that I committed exactly 0 lines of JavaScript to any of my websites, including a complete rewrite of this one.

At least 90% of the Bash was autogenerated to queue SLURM jobs on various supercomputers and 100% of the Matlab and R code was written against my will for various homework assignments. We are definitely leaving proprietary scientific computing languages in 2024.

Commits By Configuration Language

File Extension Lines Added Lines Removed
.json 755,096 363,210
.md 98,885 3,211
.yml 2,502 560
.toml 1,124 4

This comparison isn’t entirely fair, since a sizeable portion of that JSON came from constant data collection during extended experiments—I’m always really paranoid about not being able to reproduce some result. Past that, I used JSON quite a bit to manage and version control training/model hyperparameters separately from the source code. Since the end of the summer I’ve transitioned entirely to YAML because of its massively improved ergonomics, like in-line commenting. All the TOML is a byproduct of recently adopting uv, which uses pyproject.toml, as my package manager of choice.

The Whole Kitchen Sink (>100 Lines Changed)

File Extension Lines Added Lines Removed
.json 755,096 363,210
.vocab 572,522 36,710
.conllu 590,027 0
.ipynb 189,866 169,277
.jsonl 249,180 77,642
.md 98,885 3,211
.py 47,153 22,638
.html 400 68,489
.txt 24,027 15,638
.out 11,872 1,797
.csv 4,152 4
.lock 3,498 370
.yml 2,502 560
.sh 1,762 1,084
.m 1,861 286
.log 1,620 1,573
.css 1,252 94
.scala 1,081 223
.toml 1,124 4
.gen 906 906
.astro 644 126
.dat 744 0
.R 273 7
None 454 175

The full picture is a lot less interesting because of the massive amount of data input/output stuff and generated files. This does take me back to some interesting experiments though, like when I was messing around with dependency parses (.conllu) or novel tokenizers (.vocab). I also apparently can’t make up my mind between .log and .out for logging.

Commit Poetry

Most Frequent Tokens

Type Count
fix 98
add 87
some 57
stuff 56
update 52
config 43
added 36
eval 35
updated 35
init 34

If I had been more consistent with my tenses, then this table would show that I “add” and “added” more than I “fixed,” thus proving that I am a good programmer after all. Apparently, I also did a lot of “stuff” in 2024, which is good to finally know for sure.