Information Geometry in Learning and Optimization - PhD COURSE - September 22-26 2014 University of Copenhagen
-
Lectures overview
Contents of Lectures by Shun-ichi Amari
I. Introduction to Information Geometry - without knowledge on differential geometry- Divergence function on a manifold
- Flat divergence and dual affine structures with Riemannian metric derived from it
- Two types of geodesics and orthogonality
- Pythagorean theorem and projection theorem
- Examples of dually flat manifold: Manifold of probability distributions (exponential families), positive measures and positive-definite matrices
II. Geometrical Structure Derived from Invariance
- Invariance and information monotonicity in manifold of probability distributions
- f-divergence : unique invariant divergence
- Dual affine connections with Riemannian metric derived from divergence: Tangent space, parallel transports and duality
- Alpha-geometry induced from invariant geometry
- Geodesics, curvatures and dually flat manifold:
- Canonical divergence: KL- and alpha-divergence
III. Applications of Information Geometry to Statistical Inference
- Higher-order asymptotic theory of statistical inference – estimation and hypothesis testing
- Neyman-Scott problem and semiparametric model
- em (EM) algorithm and hidden variables
IV. Applications of Information Geometry to Machine Learning
- Belief propagation and CCCP algorithm in graphical model
- Support vector machine and Riemannian modification of kernels
- Bayesian information geometry and geometry of restricted Boltzmann machine: Towards deep learning
- Natural gradient learning and its dynamics: singular statistical model and manifold
- Clustering with divergence
- Sparse signal analysis
- Convex optimization
Suggested reading:
Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural Computation 10, 2 (1998): 251-276.
Amari, Shun-ichi, and Hiroshi Nagaoka. Methods of information geometry. Vol. 191. American Mathematical Soc., 2007.
Contents of Lectures by Nihat Ay
I. Differential Equations:- Vector and Covector Fields
- Fisher-Shahshahani Metric, Gradient Fields
- m- and e-Linearity of Differential Equations
II. Applications to Evolution:
- Lotka-Volterra and Replicator Differential Equations
- "Fisher's Fundamental Theorem of Natural Selection"
- The Hypercycle Model of Eigen and Schuster
III. Applications to Learning:
- Information Geometry of Conditional Models
- Amari's Natural Gradient Method
- Information-Geometric Design of Learning Systems
Contents of Lectures by Nikolaus Hansen
I. A short introduction to continuous optimization
II. Continuous optimization using natural gradients
III. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
IV. A short introduction into Python (practice session, see also here)
V. A practical approach to continuous optimization using cma.py (practice session)
Suggested reading:
Hansen, Nikolaus. The CMA Evolution Strategy: A Tutorial, 2011
Ollivier, Yann, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles. arXiv:1106.3708Contents of Lectures by Jan Peters
Suggested reading:
Peters, Jan, and Stefan Schaal. Natural actor critic. Neurocomputing 71, 7-9 (2008):1180-1190Contents of Luigi Malagò
Stochastic Optimization in Discrete Domains
I. Stochastic Relaxation of Discrete Optimization Problems
II. Information Geometry of Hierarchical Models
III. Stochastic Natural Gradient Descent
IV. Graphical Models and Model Selection
V. Examples of Natural Gradient-based Algorithms in Stochastic Optimization
For the gradient flow movie click here.
Suggested reading:
Amari, Shun-Ichi. Information geometry on hierarchy of probability distributions IEEE Transactions on Information Theory 47, 5 (2001):1701-1711Contents of Lectures by Aasa Feragen and François Lauze
I. Aasa's lectures- Recap of Differential Calculus
- Differential manifolds
- Tangent space
- Vector fields
- Submanifolds of R^n
- Riemannian metrics
- Invariance of Fisher information metric
- If time: Metric geometry view of Riemannian manifolds, their curvature and consequences thereof
- Riemannian metrics
- Gradient, gradient descent, duality
- Distances
- Connections and Christoffel symbols
- Parallelism
- Levi-Civita Connections
- Geodesics, exponential and log maps
- Fréchet Means and Gradient Descent
Suggested reading:
Sueli I. R. Costa, Sandra A. Santos, and Joao E. Strapasson. Fisher information distance: a geometrical reading. arXiv:1106.3708Contents of Tutorial by Stefan Sommer
In the tutorial on numerics for Riemannian geometry on Tuesday morning, we will discuss computational representations and numerical solutions of some differential geometry problems. The goal is to be able to implement geodesic equations numerically for simple probability distributions, to visualize the computed geodesics, to compute Riemannian logarithms, and to find mean distributions. We will follow the presentation in the paper Fisher information distance: a geometrical reading from a computational viewpoint.
The tutorial is based on an ipython notebook that is available here. Please click here for details.Background
Principles of Information Geometry have been successfully applied in all major areas of machine learning, including supervised, unsupervised, and reinforcement learning, as well as in stochastic optimization. Information Geometry comes into play when we consider parametrized probabilistic models (e.g., in the context of stochastic behavioral policies, search distributions, stochastic neural networks, ...) and their adaptation. Technically speaking, in Information Geometry the space of probability distributions that can be represented by a parametrized probabilistic model is described as a manifold, on which the Fisher information metric defines a Riemannian structure. Through the geometry of the Riemannian manifold of distributions, optimization and statistics can be done directly on the space of distributions.
Information geometry was founded and pioneered by Shun'ichi Amari in the 1980s, with statistical learning as one of the first applications. Due to the nonlinear nature of the space of distributions, the steepest ascent direction for adapting a probability distribution parametrized by a set of real-valued parameters (e.g., the mean and the covariance of a Gaussian distribution) is not the ordinary gradient in Euclidean space, but the so called natural gradient, defined with respect to the Riemannian structure of the space of distributions. The natural gradient is natural in the sense that it renders the adaptation invariant under reparametrization and changing representations, and it is closely linked to the Kullback-Leibler divergence often used for quantifying the similarity of distributions.
The natural gradient for adapting probabilistic models has been successfully used in all major areas of machine learning, from supervised learning of neural networks over independent component analysis to reinforcement learning. In this PhD course there will, in particular, be lectures on supervised learning, reinforcement learning and stochastic optimization. Reinforcement learning refers to machine learning algorithms that improve their behavior based on interaction with the environment, whereas stochastic optimization refers to stochastic solutions to complex optimization problems for which we do not have an analytical description. Both in stochastic optimization and reinforcement learning, (intermediate) solutions are best described by probability distributions. In the one case, we consider distributions over potential actions to be taken in a certain situation. In the other case, we consider the search distribution describing which candidate solution to probe next. Thus, both the learning as well as the optimization process are best described by an iterative update of probability distributions.Confirmed Speakers
- Shun'ichi Amari, RIKEN Brain Science Institute
- Nihat Ay, Max Planck Institute for Mathematics in the Sciences and Universität Leipzig
- Nikolaus Hansen, Université Paris-Sud and Inria Saclay – Île-de-France
- Jan Peters, Technische Universität Darmstadt and Max-Planck Institute for Intelligent Systems
- Luigi Malagò, Shinshu University, Nagano
- Aasa Feragen, University of Copenhagen
- Francois Lauze, University of Copenhagen
- Stefan Sommer, University of Copenhagen
Scientific content
The course will consist of 5 days of lectures and exercises. In addition, students will be expected to read a pre-defined set of scientific articles on information geometry prior to the course, and write a report on information geometry and its potential use in their own research field after the course. The course will consist of three modules:- A crash course on Riemannian geometry and numerical tools for applications of Riemannian geometry
- Introduction to Information Geometry and its role in Machine Learning and Stochastic Optimization
- Applications of Information Geometry
Learning goals
After participating in this course, the participant should- Understand basic differential geometric concepts (manifolds, Riemannian metric, geodesics, manifold statistics) to the point where they can apply differential geometric concepts in their own research;
- Be able to implement basic numerical tools for differential geometric computations;
- Have a strong knowledge of information geometry and its role in machine learning and stochastic optimization;
- Be able to apply information theoretic approaches to machine learning and stochastic optimization in their own research;
Have a basic knowledge of existing applications of information geometry.
Organizers
- Christian Igel, University of Copenhagen
- Aasa Feragen, University of Copenhagen
Place
Københavns Universitet, Njalsgade 128, Bygning (building) 27, Lokal (room): 27.0.17The lectures are at the south campus of the University of Copenhagen, very close to the Metro station Islands Brygge. Room 27.0.17 in building 27 is on the ground floor. Click here for a map. See also Google maps.