The collection, processing, analysis, integration and understanding of all of Earth’s data is one of science’s grand challenges.
The three major data scientific challenges are in the systematic acquisition and processing of very large amounts of geoscientific data, their integration across modalities and scales, and their use in deriving and validating models to answer scientific questions and make timely decisions.
The merging of data-centric techniques with modelling and simulation will be of great long-term importance for the development of data science as a key driver of economic growth and improved living conditions (including safety of person and property) on this planet. The geoscience community as a whole is already tackling both deep science and societal/business issues, for both long-term value and immediate translational benefits.
The three major data scientific challenges to be addressed are:
1. Acquisition and processing: A wide variety of geophysical data (potential fields, electromagnetic data, seismic data, weather data, etc.) is acquired with very broad wavelength ranges, from surface sensor arrays, drilled wells, satellites and many other sources. These data sets are collectively among the largest science data sets in use, comparable in size and complexity only to those from astronomy and particle physics.
2. Integration: jointly understanding the different types of data is a major challenge. Methodologically, there is a major gap between statistical modelling and machine learning on one side and numerical or physical modelling on the other. Hence a systematic approach to consistent data integration and model building is of highest value and priority.
3. Deriving and validating models: Both data sources and the models come with recognized issues that existing methodology has difficulties to cope with – such as features for which exact physical models are unknown (e.g., sub-surface geology, earthquakes), or models which are difficult to reconcile (e.g., seismic measurements vs social media alerts) – but which novel data science based approaches can address.
Research will be conducted together with translational stakeholders and world-leading domain experts, focusing on the following interrelated topic areas identified by the scoping activity:
I. Geohazards and geo-risk:
Earthquakes, landslides, tsunamis, flood risks, risks to mechanical structures, predictions and risk assessment, risk mitigation, early warning, emergency response.
II. Geomodelling and model inversion:
Shared Earth models, parameter fusion for deep Earth physics, assimilation of disparate and distributed data sources, multi-scale modelling, data integration, upscaling and downscaling.
III. Resources, energy, carbon capture/storage and avoidance:
Optimal, responsible and sustainable resource use and exploitation, water, mineral, and hydrocarbon exploration and production, carbon cycle understanding, carbon capture and storage for energy-cost and climate sustainability.
IV. Advanced analytics, statistics, machine learning:
Statistical modelling, machine learning and modern data analytics methodology for the geosciences, e.g., spatio-temporal modelling, probabilistic modelling and uncertainty quantification; open methodological questions in model building, model assessment, prediction and forecasting workflows.
V. Data acquisition and integrated models:
Obtaining and integrating different types of Earth-related data, of different provenance, different scales and resolutions, reconciling and integrating different types of models, e.g., numerical vs statistical.
VI. Integrated data federation for geoscience:
Practical challenges in sustained collaboration between different communities, data sources, scientific expertise spectra, computational resources, translational stakeholders.
Part of The Alan Turing Institute-Lloyd’s Register Foundation Programme for Data-Centric Engineering.