๐
Date: Friday, March 27, 2026
๐ Time: 13:30 โ 14:30
Location: Heinzel Seminar Room, Office Building West
Speaker: Frederik Kunstner (INRIA – Paris)
Title: Why language models are difficult to train without Adam
Abstract:
Adam is the default optimizer to train language models, because gradient descent is too slow. In this talk we’ll try to understand why. We revisit common interpretations of Adam and why they are insufficient to explain the observed performance gap, and instead show that Adam fixes a problem coming from text data. In text, a few words are very frequent but there also is a long tail of infrequent words. We show experimentally that the performance gap is related to this frequency imbalance, and study a simplified language model when this phenomenon can be formalized