TJ's Blog

Note to Jason


I sent the following note to my research advisor this afternoon. As context, Iā€™d previously taken a roughly month long hiatus from research.

So for the better part of the summer and this semester, Iā€™ve been focused on trying to get Evolver to perform competitively on standard datasets like WMT-14. I was initially bullish on this direction after getting decent results on small/toy datasets. Thereā€™s many parts of our design that seem promising: deep-shallow encoder-decoder, deep encodings between diffusion steps, etc. This led me to spend a lot of time on trying different datasets, engineering tweaks, exploring training schemes, etc.

After taking time off and coming back to this with fresh eyes, I realize I might have been hill-climbing in a local minimum rather than thinking about the more fundamental aspects of the design. Iā€™m reminded of Platoā€™s allegory of the cave šŸ˜Š.

For example, I realize that Iā€™ve always had an issue with how Evolver relies on a token-level alignment between adjacent sequences in the diffusion trajectory. This seemed reasonable initially (being similar to other edit-based models like LevT), but this assumption seems to have permeated the entire project. This project started with iterative summarization provided by a black-box LLM, but training required inferring alignments through heuristics or MCEM. This pushed me towards handwritten noising functions or parse trees that gave easy alignments (though I did like how this let the model ā€œsketch outā€ subsequences by generating nonterminals to fill in later).

The INS/CPY/SUB parametrization has also felt too complex and ā€œhardā€ to learn, though I understand the rebuttal that the neural model should just ā€œfigure it out.ā€ Itā€™s possible we just havenā€™t found the best training distribution to ā€œteachā€ these actions well, or that the more complex ideas I have from our meeting notes could work. Iā€™m starting to think we need something simpler and easier to train, rather than the other way around (we discussed this at some point and you referred to it as the ā€œcurse of intelligenceā€ if I recall).

I was reading this tweet recently and was reminded that the MT community spent years on ā€œhard alignmentā€ designs, but then Transformers came in and blew everything up because they could just learn ā€œsoft alignmentsā€ from data. Iā€™m probably relearning old lessons here, but it highlights why parts of our design feel a bit awkward/inelegant to me.

I donā€™t doubt that we could get some version of this model to work with the right dataset, noising model, and design choices. But, I also think weā€™ve diverged quite a bit from the initial motivations we had for this project back in the spring. I recall being much more interested in how we could design a model that did coarse-to-fine generation for language, in the spirit of the ā€œnext-scale predictionā€ paper we read at the last lab meeting, before getting bogged down in countless details. Iā€™ve been spending time reading and just trying to look at things from first principles again.

Hereā€™s my pitch for a slightly new direction: In short, we want to build a model that generates sequences not just left-to-right, but also through multiple passes ā€œbottom-up,ā€ moving from abstract to specific details, coarse-to-fine, a la denoising diffusion models. This naturally connects to current popular work on reasoning, chain-of-thought, test-time compute, etc. but thereā€™s other obvious applications like creative writing (outline -> draft), planning/scheduling (goals -> steps), code generation (skeleton -> implementation). Iā€™m going to focus on reasoning here because itā€™s very relevant with all the o1 model stuff and related work going on right nowā€”lots of datasets and ideas floating around.

Current approaches to reasoning just go left-to-right. I think there might be some intuition that these models implicitly learn some propose -> reflect -> refine loop or at least some kind of iterative structure, otherwise I wouldnā€™t understand how performance actually ends up scaling with test-time compute. We could benefit from making this process explicit in our architecture. The coherent diffusion trajectory also gives us nice properties: we could do things like process-level supervision during training, and at test time we get all the usual diffusion model guidance tricks.

The bottom-up structure also lets us build up progressively deeper embeddings, like we had in the Evolver. You mentioned before that this kind of falls out naturally from doing iterative refinement, but I think it could be a real strength. The new continuous chain-of-thought paper has a very similar idea going left-to-right. There might even be an interesting way to tradeoff interpretability for expressiveness by not decoding tokens at each step and directly feeding the top-level embeddings into the next step.

Architecture-wise, I think this would look very similar to the Evolver (deep-shallow structure, deep embeddings) but probably moving away from the INS/CPY/SUB idea. Not having to worry about alignments gives a bit more flexibility. I imagine training would look something like 1) initial supervised learning on a small dataset of LLM-generated trajectories (like the things I was doing with GPT-3.5 last spring) and 2) bootstrap approaches like STaR, process-level reward, etc.??