Open minds: How open-source tools are broadening the horizons for data science

From understanding plankton to upgrading planes, the Turing is helping to spread the benefits of open AI and data science

Last updated
Monday 31 Oct 2022

Too often, the codes and software systems that AI and data-driven technologies are based on remain behind closed doors, for company profit or competitive advantage. That way, we all miss out on the best of what these technologies have to offer. Now, though, a raft of projects funded by The Alan Turing Institute’s AI for science and government (ASG) research programme is demonstrating how adopting the most open and collaborative approaches is leading to tools that are more easily adapted to solving diverse problems.

Concept illustration of open tools being shared between researchers
Open-source tools can be adapted for diverse research problems (illustration: Jonny Lighthands)

In the North Sea, the Cefas Endeavour research vessel, a floating laboratory, now uses software developed via the ASG programme to study zooplankton – microscopic marine animals including tiny crustaceans and worms that drift through the oceans, feeding everything higher up the food chain. Whilst they are barely visible to the naked eye, closer inspection reveals their various forms; telling one type from another is key to keeping tabs on our oceans’ health. Counting and categorising zooplankton by eye is labour intensive and slow, so scientists at the government’s Centre for Environment, Fisheries and Aquaculture Science (Cefas) wanted to automate the process, using computer software to identify plankton from images taken by the Endeavour’s new high-speed camera system. But their computational expertise was limited.

“We didn’t know how to improve the algorithm ourselves to make it more effective,” explains Sophie Pitois, a marine ecologist at Cefas. So when she found out about the Turing’s Data Study Groups, expert ‘hackathons’ that bring together data scientists from different institutions to work on real-world data science challenges, she was immediately on board.

It was a fruitful collaboration: within a fortnight, the assembled team built a classifier that could take an image and correctly say what type of plankton it showed over 90% of the time. The new software uses artificial neural networks – algorithms inspired by the way the human brain works – trained on 58,000 images of plankton hand-labelled by Cefas, and is capable of counting and classifying 50 plankton per second. It’s a “step change” in performance, according to Robert Blackwell, a Cefas scientist who took part.

The Cefas Endeavour research vessel navigating the Thames in London, Tower Bridge in the background
The Cefas Endeavour research vessel now uses Turing-developed software to classify plankton in camera images (image: Crown copyright)

The group tested its algorithm during the Data Study Group in December 2021, using an early-stage, open-source tool called scivision, a platform for exploring computer vision data designed by ASG-funded researchers. By May 2022, the new software was at sea, and Cefas is now developing the system so that scientists can access plankton data to guide and adapt sampling, via the research vessel, as it’s happening.

“This is what we’re working towards,” says Pitois. “You press that button, the instruments start collecting the information, process it at the same time, and someone in an office somewhere can visualise how things are evolving.” What’s more, she adds, all the code the team wrote is open source and could be modified to identify a multitude of “very small things in the seas”, potentially transforming how scientists study our oceans. One possible application is in classifying the trillions of tiny pieces of plastic (‘microplastics’) that pervade our oceans.

Time for a change

The team behind another open-source tool, Raphtory, is planning its own Data Study Group early next year. Raphtory analyses how connections in any network – like a social network or transport network – evolve over time. Effectively a time machine for network data, it allows users to travel back to any point in a network’s history, so they can, for instance, analyse misinformation about COVID-19 on Twitter over the last two years, or the shifting interactions within social media platforms popular among the alt-right. The upcoming Data Study Group will bring together prospective new Raphtory users – along with their datasets – to help identify new and creative ways to use it.

Since the project’s inception in 2017, the team has focused on making Raphtory open source and free for all. But the researchers are increasingly taking a double-pronged approach thanks to the recent launch of a Turing spin-off, Pometry, that provides tailor-made versions of the software for users with specific needs. As CTO and founder Ben Steer notes, Pometry can tackle ever more diverse and challenging problems by building on the existing platform.

In the cryptocurrency space, Pometry has provided custom tools to help blockchain analysis companies quickly detect ‘bad guys’ operating within complex webs of crypto-transactions. The time travel element is critical, as it’s only by looking at entire transaction histories that users who unknowingly accept tainted currency can be differentiated from criminals who propagate lengthy chains of fraudulent transactions and should therefore be blacklisted or prosecuted. “What we’re enabling those companies to do,” explains Steer, “is massively scale up the analytics they’re running, and provide a much clearer and, a lot of times, more accurate answer.” Increasingly, these kinds of analytics are necessary for companies dealing in cryptocurrencies to help them comply with financial regulations.

A man analyses the price chart for bitcoin in a smartphone app
Identifying cryptocurrency fraud is one of many potential applications for the Raphtory and Pometry tools (image: Mykhaylo Kozelko / Shutterstock)

Very recently, Pometry started working with nPlan, a company that uses AI in construction scheduling. Here, the tool views complex construction projects as networks of interdependent processes – its benefit is in extracting more detailed insights from past projects to inform future ones. “We started our collaboration in September 2022 and have already seen great progress on the initial problem,” says nPlan’s Research Director, Vahan Hovhannisyan.

Tools of all trades

Meanwhile, another set of ASG-funded tools is finding diverse uses in engineering, manufacturing and beyond. Pranay Seshadri, a Data-Centric Engineering Group Leader at the Turing, has developed computational tools that are now being used by Rolls-Royce to help streamline design processes for its jet engines. Though the tools weren’t initially openly available, Seshadri realised that sharing the code and making it more adaptable could benefit people in widely different industries and research spaces.

“We want [the projects we work on] to be open source,” he says. “But perhaps more than that, we want a certain generalisability, in the sense that we don’t want to develop tools just for one company, or indeed one sector.”

Photograph of the interior of a Rolls-Royce Trent XWB jet engine
Computational tools are helping Rolls-Royce to improve the efficiency of its jet engines (image: Duc Huy Nguyen / Shutterstock)

Now downloaded over 52,000 times as the equadratures toolbox, Seshadri’s algorithms help simplify the complex computer models used in manufacturing and engineering. However, their ability to zero in on the most important variables in any complex system makes them valuable for other types of model, too, for example, in helping to identify the most important environmental threats to coastal marshlands as part of a United States Geological Survey project.

To offer support to diverse users of equadratures, Seshadri has given introductory workshops attended by over 120 people, including at University of Cambridge and University of Warwick, as well as for companies including Siemens and McLaren. But he’s also keen that the toolbox exists as something people can pick up and use without support, or modify if they want to. So there’s now an online forum, where users can share feedback, and code.

The world is our data

If developing open tools is about placing agency into the hands of users, then a perfect example is the innovative, ASG-funded project Colouring Cities. Designed as a knowledge exchange platform for buildings data, it is boldly setting out to conquer the world, one city at a time. Project lead Polly Hudson, a Senior Research Fellow at the Turing, started by collecting buildings data for London and putting it on a permanent, data-rich map. The idea is now being adopted and tested by buildings researchers internationally, from Colombia to Indonesia, thanks to the project’s open-source code.

Map of central London with the buildings coloured according to current use
Colouring London collates data about the city's buildings, in this case coloured according to the buildings' current usage 

The platform joins up data – everything from building age and historical usage to up-to-date information about energy efficiency – that is often fragmented and difficult to access in the buildings space, and maps it to individual building footprints. Of course, open maps like Google Maps already provide the same zoom-in capability, but they don’t provide data on the variables needed to tackle the tricky research problems faced by local policy makers. Problems like: how do we minimise materials and emissions when updating and improving our city’s buildings? Colouring Cities also uses automated approaches to supplement the data – these can, for example, spot high streets based on the characteristics of building footprints attached to them. Through crowdsourcing, the results can then be verified by people living near those streets.

The Australian offshoot of this project has already spread to all the country’s major cities, with a focus on transforming urban planning. Meanwhile, Colouring Athens plans to collect data that will contribute to mitigating risk from natural disasters and climate change, according to coordinator Athina Vlachou at the National Technical University of Athens. In Germany, the Leibniz Institute of Ecological Urban and Regional Development is developing Colouring Dresden in collaboration with the Association of German Architects, and local partners including the state library. “We believe that the data is equally relevant for policy makers, urban planners, civil society, and city residents who, through their active participation in research, better understand Dresden’s buildings and their cultural value, and actively participate in solutions for a better future in the city,” says researcher Robert Hecht, suggesting that the data could be used for research that informs, for example, planning for more energy-efficient buildings. Colouring Dresden has received funding from the German Federal Ministry of Education and Research.

It takes some big picture thinking to envisage what this project, once fully-fledged, could achieve, and Hudson believes this kind of thinking is only possible outside of traditional research settings. “You have to do this kind of research in a more multidisciplinary, collaborative institution,” she says, “where you can answer very big questions by producing resources that enable potentially hundreds of thousands of people to begin to connect stuff they couldn’t connect.” That, she adds, is why Colouring Cities is so at home at the Turing.

To get the most out of AI and data science, we need to be doing more than just using open-source code: we need to be opening our minds about how we work. That means creating collaboratively, choosing open-science principles, and seeking to tackle more diverse problems that can benefit us all. Each of these innovative ASG projects broadens the possibilities, and raises the bar, for what can be accomplished through innovative, open AI and data science.


Header illustration: Jonny Lighthands