Datathons, hackathons, and Data Study Groups (DSGs) are events that bring together teams of individuals proficient in computer programming, data science, and artificial intelligence to tackle a set challenge. One of the challenges of running such events is that they require access to high-quality data in order to realise the best results. Their compressed timeframe, typically ranging from one day to one week means crafting innovative solutions in a constrained timescale.
The paper titled "How to Data in Datathons," authored by the Applied Skills Science Team, has recently been accepted to the 2023 NeurIPS conference as part of the Datasets and Benchmarks track. This paper introduces a framework for assessing the potential of data in datathon-style events.
“The most challenging aspect for industry partners approaching us is understanding how to effectively prepare their data for the Data Study Group,” says Jules Manser, Applied Skills Programmes Manager.
In a typical research project, researchers have ample time to gather, clean, and prepare their data, all whilst calibrating their experimental approaches. In contrast, participants in hackathons, datathons, or DSGs operate under severe time constraints, and are therefore limited in their ability to explore and utilise the provided data effectively. For organisations hosting such challenges, ensuring that the data is well-prepared is vital for achieving optimal results. Data can suffer from multiple quality issues, such as appropriateness, readiness, reliability, sensitivity and sufficiency.
While there have been studies and papers written about data in the past, the Applied Skills Science Team found that there was a lack of information in the context of datathons. Without the luxury of time to extensively test datasets and anticipate the kinds of experiments or applications participants will develop, the state of the data becomes a critical factor in the success of a challenge as well as the event itself. Drawing upon their experience in organizing over 80 DSG challenges, the team has devised a data assessment framework to evaluate the degree of potential success for an existing or prospective datasets. They subsequently illustrate the application of this framework through ten past DSG challenges to demonstrate how it can be used to determine the quality of the data available.