Many important challenges in data science can be reduced to the problem of minimising a high dimensional function, typically a negative log likelihood function for the parameters of a neural network designed to encode tasks such as classi cation, clustering, recognition and anomaly detection. In practice, given the vast numbers of parameters involved (for deep networks this can be millions or more), the available data is often sparse (it may even be smaller than the number of parameters) and a full exploration is never possible. The obvious question this raises is how machine learning can ultimately be effective at all.

Most answers to this question focus on the observation that there are large dense clusters of minimisers with similar properties and thus one can rely on many alternative parameterisations to obtain a suitable exploration of a loosely formulated concept of “robust ensemble” [1].

Much work is needed to explain the natural geometric structure of landscapes arising in machine learning [2, 3] and to use knowledge of the landscape to derive efficient algorithms.

In close analogy to the problems of data science, the fundamental paradigm for much simulation in chemistry and physics is the energy landscape which encodes the relative weight of different states, for example describing the configurations of atoms. Even if the energy function U, and the associated Gibbs-Boltzmann density, are theoretically calculable, the complexity and high dimensionality of the underlying system often make exploration of the most probable states intractable. This challenge has given rise to the development of the Monte-Carlo (MC) method [4, 5], and molecular dynamics (MD) [6] as well as a tremendous enterprise of advanced sampling schemes including simulated tempering [7, 8], temperature-accelerated methods [9, 10], replica-exchange [11, 12], and many others. The development of molecular sampling methods has gone hand in hand with the design of algorithms for optimisation, for example the protein folding problem is usually formulated as a global optimisation problem, but optimisation also plays a key role in the refinement of experimental (e.g. spectroscopy) data and the search for structural motifs for new materials. Molecular dynamics-like sampling methods can often be adapted to data science where they for example circumvent problems due to over fitting in parameter inference (see, e.g., the stochastic gradient Langevin dynamics method [14]) or allow exploration of parameter and input spaces defining a feature specification [15].

Schemes based on molecular “thermostats” [16, 17] provide a flexible and robust alternative for control of the invariant distribution in the presence of stochastically perturbed gradients, e.g. when noise arises in subsampling the data set.

Sampling methodologies typically are formulated based on the temperature of an ambient fictitious “heat-bath” in which the target system is embedded and on the identification of a set of distinguished collective variables or “reaction coordinates” which can be used to parameterize the progress of a sampling task. Identification of suitable collective variables becomes the most important task in practice and then the problem can be reduced to calculating barriers in the “free-energy” landscape which is obtained by integrating out over the space of transverse coordinates to the targeted reaction coordinates. Such a free-energy perspective provides a uni ed perspective which may help to make precise the notion of robust ensemble and allow to characterise the role of basin geometry (entropic structure) in the rate of progressive exploration of the landscape.

Because the choice of collective variables is formally arbitrary but critical to efficiency of the reduced description, it becomes important to focus algorithm development to this task. In molecular sampling one powerful recent approach to automatic determination of collective variables has been developed recently based on diffusion maps, an idea that in fact originated in harmonic analysis (and which provides a systematic procedure for manifold learning) [18, 19].

**References**

[1] Unreasonable eff ectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes C. Baldassi et al, PNAS, E7655-E7662, 2016. www.pnas.org/cgi/doi/10.1073/pnas.1608103113

[2] Explorations on high dimensional landscapes, L. Sagun, V. Guney, G. Ben Arous, and Y. LeCun, Workshop presentation at ICLR(2015). https://arxiv.org/abs/1412.6615

[3] Perspective: energy landscapes for machine learning, A.J. Ballard et al, Phys. Chem. Chem. Phys., in press (2017). https://arxiv.or /pdf/1703.07915.pdf

[4] Equation of State Calculations by Fast Computing Machines, N. Metropolis et al, J. Chemical Physics, 21, 1087-1092, 1953

[5] Monte Carlo sampling methods using Markov chains and their applications, Hastings, W. K., Biometrika, 57, 97-109, 1971

[6] Molecular Dynamics, B. Leimkuhler and C. Matthews, Springer, 2015

[7] Simulated tempering: a new Monte Carlo scheme, E. Marinari et al, Europhys. Lett.,19, 451-458, 1992

[8] Numerical comparisons of three recently proposed algorithms in the protein folding problem, U.H.E. Hansmann and Y. Okamoto, J. Computational Chemistry, 18, 920-933, 1997

[9] An adiabatic molecular dynamics method for the calculation of free energy pro les L. Rosso and M.E. Tuckerman, Molecular Simulation, Pages 91-112, 2010

[10] A temperature accelerated method for sampling free energy and determining reaction pathways in rare events simulations, L. Maragliano and E. Vanden-Eijnden, Chem. Phys. Lett, Vol 426, Issues 1-3, 26 July 2006, Pages 168-175

[11] Replica Monte Carlo simulation of spin-glasses, R. Swendsen and J. Wang, Phys. Rev. Lett., 57, 2607-2609, 1986

[12] Markov Chain Monte Carlo maximum likelihood, C.J. Geyer, Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, 156-163, 1991

[13] Sampling from multimodal distributions using tempered transitions, Statistics and Computing, 6, 353-366, 1996

[14] Bayesian learning via stochastic gradient Langevin dynamics, M. Welling and Y. Teh, Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 2011. https://www.ics.uci.edu/~welling/publications/papers/stoclangevin_v6.pdf

[15] Asymptotically exact inference in di fferentiable generative models, M. Graham and A. Storkey, Proceedings of the 20th International Conference on Artifi cial Intelligence and Statistics (AISTATS), Fort Lauderdale, Florida, USA. JMLR: W&CP 54, 2017. http://proceedings.mlr.press/v54/graham17a/graham17a.pdf

[16] Bayesian sampling using stochastic gradient thermostats, N. Ding et al, Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence and K.Q. Weinberger, Eds., 3203{3211, 2014, http://papers.nips.cc/paper/5592-bayesian-sampling-using-stochastic-gradient-thermostats.pdf

[17] Covariance-controlled adaptive Langevin thermostat for large-Scale Bayesian sampling, X. Shang et al, Advances in Neural Information Processing Systems 28, 37-45,2015. https://arxiv.org/abs/1510.08692

[18] Diff usion maps, R. Coifman and S. Lafon, Applied and Computational Harmonic Analysis, 21, 5-30, 2006

[19] Data-driven model reduction and transfer operator approximation, S. Klus et al, Arxiv preprint, 2017. http://arxiv.org/abs/1703.10112