The problem of computing expectations with respect to a probability distribution is in the heart of many modern applied mathematics applications. Albeit simple to state, it can be surprisingly difficult to deal with in practice, especially when the dimension of the space is high or when the underlying probability distribution corresponds to a posterior arising from a Bayesian inverse problem in the abundance of data.

Traditional computational statistics methods, such as Markov Chain Monte Carlo have difficulties in dealing with the scenarios described above in part because the emphasis is in providing unbiased statistical estimation. However, if one is willing to allow for some bias in the underlying calculations, then the range of methods that can be used to tackle the original problem increases: a prime example of such methods are those inspired by numerical analysis of stochastic differential equations.

In the heart of these new methods lies the idea that by carefully dealing with the bias-variance trade-off one can design computational methods that are optimal in the sense that they provide the “best” answer for a given computational budget.

**A few relevant applications**

Cox processes provide useful and frequently applied models for aggregated spatial point patterns where the aggregation is due to a stochastic environmental heterogeneity. A class of Cox processes most widely used in applications are the Log Gaussian Cox processes, i.e. Cox processes where the logarithm of the intensity surface is a Gaussian process. In the stationary case, the distribution is completely characterised by the intensity and the pair correlation function of the Cox process. Estimating these quantities is very important in order to be able to make predictions from such a model.

**Theory of stochastic gradient algorithms**

One of the most common problems in machine learning is the maximization/minimization of a loss function. However, this optimization procedure can become very expensive when trying to calibrate a model for large datasets. In order to reduce the computational overhead one replaces the true gradient of the underlying loss function by a cheaper but stochastic version of it. Similar ideas can be used when one wants to study the full posterior distribution rather than just the maximum a posteriori mode, as is the case for standard optimization approaches. Understanding these methods from the point of view of numerical analysis has recently attracted a lot of interest and is proving useful in order to reveal the strengths and weakness of such approaches, as well as helping in designing new and more efficient methods