Clustering signed networks and time series data

Project goal

Implementing and comparing several recent algorithms, and potentially developing new ones, for clustering signed networks, with a focus on correlation matrices arising from real-world multivariate time series data sets.

People

Interns

Aldo Glielmo and Peter Davies

Supervisors

Mihai Cucuringu, Turing Research Fellow, University of Oxford
Hemant Tyagi, Turing Research Fellow, University of Edinburgh

Project detail

Clustering is one of the most widely used techniques in data analysis, and aims to identify groups of nodes that exhibit similar features. Spectral clustering methods have become a fundamental tool with a broad range of applications in areas including network science, machine learning, and data mining. The analysis of signed networks – with negative weights denoting dissimilarity or distance between a pair of nodes in a network – has become an increasingly important research topic in recent times.

Examples include social networks that contain both friend and foe links, and shopping bipartite networks that encode like and dislike relationships between users and products. When analysing time series data, the most popular measure of linear dependence between variables is the Pearson correlation taking values in [−1, 1], and clustering such correlation matrices is important in certain applications.

This proposal will develop k-way clustering in signed weighted graphs, motivated by social balance theory, where the task of clustering aims to decompose the network into disjoint groups. These will be such that individuals within the same group are connected by as many positive edges as possible, while those from different groups by as many negative edges as possible.

We expect that the low-dimensional embeddings obtained via the various approaches we will investigate could be of independent interest in the context of robust dimensionality reduction in multivariate time series analysis. Of particular interest is learning nonlinear mappings from time series data which are able to exploit (even weak) temporal correlations inherent in sequential data, with the end goal of improving out-of-sample prediction.

A strong emphasis will be placed on assessing the performance of the algorithms on real-world, publicly available data sets arising in economic data science, meteorology, medicine monitoring, or finance.