Modelling the joint effects of temporal, heterogeneous datasets

Developing statistical methodology to jointly model temporal, heterogeneous datasets, with a focus on the effects of pollution and lifestyle on health


Joint models for temporal datasets that come from different sources, and are therefore heterogeneous, have great potential in revealing information that is not available in each dataset separately. A framework of functional models will be developed which can deal simultaneously with sparse and dense temporal predictors, as well as different time lags in their effect on outcomes. An example where such models are needed is to evaluate and interpret the accumulated effects of pollution and lifestyle on health outcomes.

Explaining the science

The availability of functional linear regression models to characterise the association between functional outcomes and functional predictors is limited. For this project's methodological challenge the data is retrospectively collected, hence modeling time backwards. Existing models are unable to quantify the influence of multiple functional factors subject to measurement error and the joint effect of dense and sparse temporal factors on general outcomes (continuous, counts). Furthermore, no methods exist to select the relevant temporal factors.

The project researchers have developed a method to analyse dense functional datasets under a general measurement error. However, the work described on this page considers just one dataset. The researchers have submitted a manuscript on a prediction model based on sparse functional datasets with application to scleroderma (a group of autoimmune diseases).  They have developed an estimation method for the effect lag of predictors in a functional linear model. The researchers are currently evaluating this method in real problems.

In this project the researchers will finalise the estimation procedure and develop a selection method for temporal factors and consider models beyond the standard linear. Most importantly the collaboration with industrial partners will provide expert inputs in order to fine tune the models to ensure appropriate use of datasets. Analysis results will be jointly interpreted in the context of urban analytics with the industrial partners.

Project aims

Data come often from different sources and are therefore heterogeneous in scale, measurement error and size. Work has been done to build models for various heterogeneous static datasets, but availability of methods for analysis of heterogeneous temporal datasets is limited despite time being an aspect that needs to be addressed for many questions in application domains.

For example, in a city environment, frailty, disease onset and mortality are influenced by many interrelated time-varying factors. Epidemiological studies suggest that dense pollution and sparse lifestyle factors influence health. However, the models and data used ignore the complexity and temporal aspects of these relationships. As an example, a study might aggregate exposure to pollution over time, but this would fail to capture important factors such as whether exposure is constant over time, or irregular with large peaks, and if a time lag is present between the exposure and its effect on the response variable.

This project will establish a research line in the integration and modeling of heterogeneous data sources, as well as establish a network within UK academia and industry. The project will involve building generalised functional models, that address all above mentioned challenges, and developing the accompanying methods for model fitting and variable selection.

Specifically, the proposed work will consider the time lag of influence of factors on general outcomes; develop methods for estimating parameters; investigate statistical inference theory; and release software packages. In collaboration with industrial partners, the tools developed will be employed to answer their questions using data from several sources. Aggregate values of the electronic Frailty Index (eFI) from General Practitioners (GP) databases will be used to represent overall health in a region (postcode).


The motivating application of this project is to develop a data-driven model of Leeds City area. Such a model will help the local authority make targeted interventions to improve health in the city and reduce costs. This task requires knowledge of the specific contributions of lifestyle and pollution on health. More precisely, local authorities need to predict health related outcomes based on potential (temporal) risk factors and to select the relevant (temporal) factors.

Here advanced statistical and machine learning methods are required. The models need to address specific issues such as modeling a time lag for the effect of exposure to the factors on the outcome and the correlation between temporal risk factors. For example the effect of pollution on health might not be immediately observable and exposure to pollution and lifestyle might be correlated.


Contact info

Haiyan Liu [email protected]