Buzzwords such as ‘big data’, ‘artificial intelligence’ (AI) and ‘deep learning’ have generated a lot of excitement and hype over the last ten years. These important research areas are usually powered by large datasets. However, datasets are often time-consuming to collect, computationally demanding to process, and challenging to practically derive key insights from. The enormous growth and interest in generating and capitalising on large data resources presents new challenges around distilling useful information, drawing meaningful interpretations, and developing state-of-the-art methods that transcend fields from medicine to finance and manufacturing, to improve human lives. As data scientists and statisticians like to say: “we are drowning in data and starving for information!”
For example, in healthcare, scientists believe that changes in just a few key genes may indicate a genetic predisposition to various diseases. The challenge is to identify the correct genes to focus on from the 20,000 or so in every human cell. Similarly, to understand the stock market, traders, financial experts and researchers may need to prune the vast majority of incoming information and focus on the key characteristics that affect specific assets. Therefore, practically we need to discern what is useful information and what should be discarded to understand which genes could be causing a condition or which factors might be influencing the stock market. There are many examples where we would benefit from algorithms which could be deployed for identifying the key characteristics in complex datasets and there will be many applications that we do not even know about yet.
To identify these key characteristics, we need to develop versatile algorithms. These will help us to improve AI performance, computational speed (how quickly the problem can be worked out) and interpretability (our ability to predict what will happen). A key component within AI towards that task is achieved by ‘feature selection’ algorithms, which serve to identify the key characteristics in a dataset to address a given problem (that is, we have a specific outcome we want to assess, e.g. whether a tumour is malignant, or an investment is risky). Contemporary feature selection algorithms are limited in their application by the type of outcome or domain, or they may require a very large dataset, or all characteristics to be of the same type, to operate effectively. For example, many existing feature selection algorithms do not perform well when presented with characteristics which include both binary types (e.g. a yes/no entry in a dataset) and continuous types (where any number is a possible entry in the dataset); similarly, many existing algorithms fail to perform well in datasets where outcomes might be rare, e.g. an event such as an earthquake.
We recently developed a novel, versatile feature selection algorithm that can identify key characteristics in diverse datasets, applicable across different types of outcomes and domains. We compared the algorithm against 19 competing state-of-the-art algorithms across 12 diverse datasets to thoroughly benchmark its performance, which showed that our algorithm is powerful. The implication of these findings is hard to overestimate, particularly given the applicability of feature selection across diverse applications.
In medicine, for example, this type of algorithm has the potential to make systems more transparent to clinical experts by focusing their attention on the key characteristics and facilitating their decision-making: this could improve prognosis estimates, diagnosis, symptom assessment and rehabilitation. In finance, these algorithms could reduce computational complexity and speed up transactions in automatic stock trading. In a legal context, they could provide a more expedient, cost-effective framework for analysing dense documents.
Practical challenges with data complexity mandate that we develop more versatile, transparent and generalisable algorithms to address unmet needs. The scientific community is embracing these challenges, and the development of robust algorithmic tools can maximise the benefit and utility we can garner from diverse datasets.
It is important to remember that data science is a complex field and while versatile algorithms will soon find increasing use in everyday applications, from optimising battery usage, to weather forecasting and developing personalised recommendations in our wearable and smart home devices, they are unlikely to entirely replace the need for more specialised application of data science. However, developing generalisable algorithms will save data scientists hours of time working on a particular dataset and allow analyses to happen much quicker – potentially leading to a more timely diagnosis of a disease, for example. They will also be better integrated in decision-making systems to improve transparency and help experts make better-informed decisions, including in medical, legal and financial contexts. So, while large datasets are vital to improving research, creating versatile algorithms like ours will help us to more easily digest and use the colossal amounts of data that we collect on a daily basis, helping to supercharge the AI systems of the future.
Read the paper (published in Patterns):
Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool