Note to Jason
I sent the following note to my research advisor this afternoon. As context, Iād previously taken a roughly month long hiatus from research.
So for the better part of the summer and this semester, Iāve been focused on trying to get Evolver to perform competitively on standard datasets like WMT-14. I was initially bullish on this direction after getting decent results on small/toy datasets. Thereās many parts of our design that seem promising: deep-shallow encoder-decoder, deep encodings between diffusion steps, etc. This led me to spend a lot of time on trying different datasets, engineering tweaks, exploring training schemes, etc.
After taking time off and coming back to this with fresh eyes, I realize I might have been hill-climbing in a local minimum rather than thinking about the more fundamental aspects of the design. Iām reminded of Platoās allegory of the cave š.
For example, I realize that Iāve always had an issue with how Evolver relies on a token-level alignment between adjacent sequences in the diffusion trajectory. This seemed reasonable initially (being similar to other edit-based models like LevT), but this assumption seems to have permeated the entire project. This project started with iterative summarization provided by a black-box LLM, but training required inferring alignments through heuristics or MCEM. This pushed me towards handwritten noising functions or parse trees that gave easy alignments (though I did like how this let the model āsketch outā subsequences by generating nonterminals to fill in later).
The INS/CPY/SUB parametrization has also felt too complex and āhardā to learn, though I understand the rebuttal that the neural model should just āfigure it out.ā Itās possible we just havenāt found the best training distribution to āteachā these actions well, or that the more complex ideas I have from our meeting notes could work. Iām starting to think we need something simpler and easier to train, rather than the other way around (we discussed this at some point and you referred to it as the ācurse of intelligenceā if I recall).
I was reading this tweet recently and was reminded that the MT community spent years on āhard alignmentā designs, but then Transformers came in and blew everything up because they could just learn āsoft alignmentsā from data. Iām probably relearning old lessons here, but it highlights why parts of our design feel a bit awkward/inelegant to me.
I donāt doubt that we could get some version of this model to work with the right dataset, noising model, and design choices. But, I also think weāve diverged quite a bit from the initial motivations we had for this project back in the spring. I recall being much more interested in how we could design a model that did coarse-to-fine generation for language, in the spirit of the ānext-scale predictionā paper we read at the last lab meeting, before getting bogged down in countless details. Iāve been spending time reading and just trying to look at things from first principles again.
Hereās my pitch for a slightly new direction: In short, we want to build a model that generates sequences not just left-to-right, but also through multiple passes ābottom-up,ā moving from abstract to specific details, coarse-to-fine, a la denoising diffusion models. This naturally connects to current popular work on reasoning, chain-of-thought, test-time compute, etc. but thereās other obvious applications like creative writing (outline -> draft), planning/scheduling (goals -> steps), code generation (skeleton -> implementation). Iām going to focus on reasoning here because itās very relevant with all the o1 model stuff and related work going on right nowālots of datasets and ideas floating around.
Current approaches to reasoning just go left-to-right. I think there might be some intuition that these models implicitly learn some propose -> reflect -> refine loop or at least some kind of iterative structure, otherwise I wouldnāt understand how performance actually ends up scaling with test-time compute. We could benefit from making this process explicit in our architecture. The coherent diffusion trajectory also gives us nice properties: we could do things like process-level supervision during training, and at test time we get all the usual diffusion model guidance tricks.
The bottom-up structure also lets us build up progressively deeper embeddings, like we had in the Evolver. You mentioned before that this kind of falls out naturally from doing iterative refinement, but I think it could be a real strength. The new continuous chain-of-thought paper has a very similar idea going left-to-right. There might even be an interesting way to tradeoff interpretability for expressiveness by not decoding tokens at each step and directly feeding the top-level embeddings into the next step.
Architecture-wise, I think this would look very similar to the Evolver (deep-shallow structure, deep embeddings) but probably moving away from the INS/CPY/SUB idea. Not having to worry about alignments gives a bit more flexibility. I imagine training would look something like 1) initial supervised learning on a small dataset of LLM-generated trajectories (like the things I was doing with GPT-3.5 last spring) and 2) bootstrap approaches like STaR, process-level reward, etc.??