Fused Linear Cross-Entropy : Why fusing the LM head projection with cross-entropy is the single biggest memory win for training LLMs at long context.
blog
Research notes, implementation details, and technical writeups. Mostly language models, evaluation, training, and systems.