Description:

The fundamental principles underlying learning and intelligent systems have yet to be identified. What makes our world and its data inherently learnable? How do natural or artificial brains learn? Physicists are well positioned to address these questions. They seek fundamental understanding and construct effective models without being bound by the strictures of mathematical rigor nor the need for state-of-the-art engineering performance. This mindset, recognized by the 2024 Nobel Prize in Physics, is needed to push to uncover the fundamental principles of learning.

Group and synergies:

Professors Brice Ménard, Matthieu Wyart, Jared Kaplan and their team members are pushing the boundaries of AI theory. They are interested in questions like:

  • How is learning encoded in neural representations? Are they universal?
  • How does performance scale with size? 
  • How do surprising capabilities of AI systems emerge? 

Answering these questions is crucial for building a comprehensive and unifying theory of neural learning and computation. 

The group will significantly expand in the next couple of years, with the addition of several new faculty members and their respective groups. Physics of learning research is highly synergistic. We strongly interact with colleagues in the departments of cognitive/neuroscience, computer science and applied mathematics and statistics. 

Other faculty members using AI in their research include: Alex Szalay, Ben Wandelt, Petar Maksimovic, Yi Li, Tyrel McQueen

Students and postdocs: APPLY NOW!

  • PhD in the physics of learning: undergraduate students can apply now to join the first cohort of graduate students working on the physics of learning and starting in the Fall 2025. [link]
  • Postdoctoral positions: Several postdoc positions and  fellowships are available. Applications require a CV, research statement and 3 recommendation letters by January 6, 2025. [link]

Research highlights:

Opening the black box of neural networks (B. Ménard)

What do neural networks learn? Do different networks learn to perform a task in the same way? What can we say about the learned encoding? We explore the universality of neural encodings in convolutional neural networks trained on image classification tasks. Our results reveal the existence of a universal neural encoding for natural images. They explain, at a more fundamental level, the success of transfer learning and the origin of foundation models. In collaboration with neuroscientists, we have shown that signatures of universality in learning are also found in the visual cortex of humans and mice.

Menard AI graph

Caption: this figure compares two 6 layer-networks trained on ImageNet. Our analysis shows that, despite different random initializations, these two networks learn similar features. 

[1] On the universality of encodings in CNNs
F Guth, B Ménard, https://arxiv.org/abs/2409.19460
[2] Universal scale-free representations in human visual cortex
Raj Magesh Gauthaman, Brice Ménard, Michael F. Bonner, https://arxiv.org/abs/2409.06843


Creativity by compositionality in machine learning (M. Wyart)

Generative models, such as Large Language or diffusion Models, manage to learn high dimensional distributions. This feat is generically impossible, unless data are highly structured. What is the nature of this structure? In a sequence of works, the Wyart’s group has shown that if data hierarchically compose features at different levels (as illustrated in Fig.1A), then (i) deep nets (but not shallow ones) can learn a classification task  with a number of data polynomial, instead of exponential, in the dimension of the problem [1]. The same holds true for transformers learning the data distribution by training on next token prediction [2]. An intriguing consequence of this viewpoint is the existence of a phase transition in diffusion models at some noise level [3,4], below which Forward-Backward experiments (see Fig.1B) display a phase transition. Overall, this analysis supports that the success of generative models lies in their ability to compose a new whole from previously observed low-level elements- a fact relevant for tasks ranging from reasoning to the composition of a new image.

wyart AI graph

Caption: A/ Sketch of the latent structure of data that are composed hierarchically. B/ Forward-Backward experiments in diffusion models, where noise is added from the initial leftmost image, and then removed. For small noise, low-level elements of the snow leopard (such as eye color) are affected. At larger noise, the theory predicts that the class will change, but will still use low-level elements of the initial picture: the fox shares the nose, eyes and ears of the snow leopard. For even larger noise, a butterfly hijacks the leopard spots [3].

[1] How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model 
Francesco Cagnetta, Leonardo Petrini,Umberto Tomasini, Alessandro Favero, Matthieu Wyart, Physical Review X 14, 3, 031001, (2024) 
[2] Towards a theory of how the structure of language is acquired by deep neural networks 
Francesco Cagnetta, Matthieu Wyart, NEURIPS (2024)
[3] A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data 
Antonio Sclocchi, Alessandro Favero, Matthieu Wyart, arXiv:2402.16991 (2024)
[4] Probing the Latent Hierarchical Structure of Data via Diffusion Models 
A .Sclocchi, A. Favero, NI. Levi, M. Wyart, arXiv:2410.13770 (2024) 


Neural scaling laws (J. Kaplan)

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

kaplan AI graph

Caption: Scaling Laws for Neural Language Models 
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, https://arxiv.org/abs/2001.08361