Earth Systems Lab 2025

Earth Systems Lab 2025

Foundation Models in Extreme Environments

PUBLISHED 24 FEB 2026

Earth Systems Lab (ESL) by FDL applies AI technologies to space science, pushing the frontiers of research, developing new tools to help solve some of the biggest challenges that humanity faces. These include the effects of climate change, predicting space weather, improving disaster response and identifying meteorites that could hold the key to the history of the universe.

FDL is a public-private partnership with the an Space Agency (ESA) and Trillium Technologies. It works with commercial partners such as Scan, NVIDIA, Google Cloud, IBM and Airbus, amongst others to provide expertise and the computing resources necessary for rapid experimentation and iteration in data intensive areas.

ESL Logo Section

Project Background

Geospatial Foundation Models (GFMs) for Earth observation are rapidly gaining traction, as they are trained on vast datasets and can learn transferable representations of the Earths's surface that can be adapted to various tasks downstream. This makes them very powerful when attempting to predict disaster likelihood or timelines for events such as wildfires and flooding, and planning the appropriate and timely emergency response.

Geospatial Foundation Models training data sources including satellite imagery and sensor data

However, as the use of GFMs increases, there are a growing number of examples that question their reliability when used with real-world data. This could be particularly problematical when used in high-stakes scenarios such as disaster monitoring, where unseen data is common. GFMs can fail when faced with out-of-distribution (OOD) data such as spatial or temporal extremes caused by unprecedented geography or weather extremes, that are poorly represented in the training data.

Geospatial Foundation Models elements and components diagram

This causes issues as it becomes unclear when any given model will generalise well or not, and ultimately affects ESA's value chain of 'science to society' - GFMs are extremely powerful tools, but only if they are trustworthy.

Project Approach

In order to enhance the reliability of GFMs, the ESL team sought to develop a framework to help foundation models flag when they may fail. To develop the model a collection of datasets was used as detailed below:

SSL4EO-S12

This is a large-scale dataset derived from ESA's Sentinel-1 and Sentinel-2 satellites, covering 250,000 locations worldwide. It provides 13 spectral bands at a 10m resolution, sampled throughout the year, capturing temporal dynamics including vegetation cycles, land cover changes and climate variations.

ExEBench Burn Scars

This dataset consists of harmonised imagery from NASA's LandSat and Sentinel-2 satellites taken from 2018 to 2021. It covers multiple spectral bands (visible, infrared, near infrared and shortwave infrared) and contains 804 images of 512 x 512 pixels at a resolution of 30m per pixel.

WorldFloods

This dataset is derived from 509 Sentinel-1 full image scenes depicting flood events from 2016 to 2019. These are then patched to 256 x256 pixel and 224 x 224 pixel subsets and pre-trained using SSL4EO.

HydroATLAS

This dataset was included in the teams' work to provide complementary semantic analysis using the BasinATLAS annotations to give hydro-environmental attributes aggregated within a multi-level hierarchy of hydrological units.

These datasets were combined into a foundation model based on a vision transformer (ViT) architecture with ~23 million parameters, using K-means clustering. This is a popular unsupervised machine learning algorithm that partitions data points into distinct, non-overlapping groups (clusters) by minimising the distance to cluster centres (centroids). When a model classifies normal images as ‘in-distribution’, it learns to make features of the same class cluster tightly, minimising distance to their class centroid and maximising distance from others. Out-of-distribution (OOD) examples, being unseen, remain scattered in the feature space, often equidistant or poorly clustered relative to all known class centroids. This results in a Nearest Centroid Distance Deficit (NCDD) score calculated by the difference between an image's distance to its nearest ID centroid and its distance to the next nearest centroid, effectively identifying these dispersed patterns.

Embeddings visualization for the ESL 2025 project

AI model development and training were carried out on a Scan Cloud GPU-accelerated server and Google Cloud Platform GPU instances, requiring more than 800GB of RAM for the demanding OOD training stage.

Project Results

The resulting framework was named SHRUG-FM (Systematic Handling of Real-world Uncertainty for Geospatial FMs). The framework addressed two primary types of uncertainty - firstly 'not knowing because of the data' and secondly 'not knowing because of the model'.

NCDD

To assess data-related uncertainty, SHRUG-FM compared input images to the foundation model's training data, both in raw input space and in the model's embedding space. It was discovered that that OOD signals, particularly NCDD, correlate strongly with F1 scores (a balanced combination of precision - how many selected items are relevant; and recall - how many relevant items are selected) from the HydroATLAS dataset.

F1

For model-related uncertainty, SHRUG-FM employs ensemble techniques, training multiple models with randomness and analysing their agreement or disagreement on predictions. Higher predictive variance among ensemble members indicates greater uncertainty. This uncertainty-based flagging effectively discards unreliable predictions, especially those with high predictive variance, thereby improving the trustworthiness of the foundation model's outputs.

The three complementary uncertainty signals - input OOD, embedding OOD, and task-specific predictive uncertainty - were integrated into a SHRUG-FM system. This system provided a reliability-aware prediction mechanism that can either provide a prediction, raise a warning, or 'shrug' (indicating it doesn't know) when uncertainty is high. The metrics can be integrated into a dashboard that visualises predictions, probability maps and reliability score.

GFM

SHRUG-FM indicated that GFMs with low elevation, low pasture extent and large river areas are associated with lower performance and stronger OOD signals (higher NCDD scores). Using burn scars as context the team demonstrated that only 14.8% of their FM test set met the criteria of 'Accept' (a good prediction), versus 38.8% meeting 'Review' (a correctable prediction) and the remaining 46.4% falling under 'Fail' (an unacceptable prediction).

This low 14.8% acceptance level may indicate that criteria were too strict or that the readiness of many GFM models is in doubt.

Conclusions

SHRUG-FM has demonstrated it can offer practical steps to enhance the reliability of GFMs for Earth observation in high stakes environmental monitoring. It showcases how to use complementary uncertainty signals to identify where and why models may fail and provide misleading feedback. This adaptable framework is ready for use in critical climate-sensitive applications such as burn scar segmentation, with future work planned to extend its utility to flood mapping and landslide detection.

You can learn more about Earth Systems Lab 2025 research and this Foundation Models in Extreme Environments project by reading the ESL 2025 RESULTS BOOKLET, where a summary, poster and full technical memorandum can be viewed and downloaded.

The Scan Partnership

Scan is a major supporter of ESL 2025 and FDL , building on its participation in the previous five years events. As an NVIDIA Elite Solution Provider, Scan contributes multiple DGX supercomputers via Scan Cloud, in order to facilitate much of the machine learning and deep learning development and training required during the research sprint period.

Project Wins

share

Successful demonstration of practical steps to improve the reliability of GFMs for Earth observation

timeline

Development of a ready-to-use framework for reliable burn scar segmentation

memory

Time savings generated during eight-week research sprint due to access to GPU-accelerated DGX systems

James Parr

James Parr

Founder, FDL / CEO, Trillium Technologies

"FDL has established an impressive success rate for applied AI research output at an exceptional pace. Research outcomes are regularly accepted to respected journals, presented at scientific conferences and have been deployed on NASA and ESA initiatives - and in space."

Glyn Merga

Glyn Merga

Head of Cloud Architecture, Scan

"We are proud to be continuing our work with FDL and NVIDIA to support the ESL 2025 event for the sixth year running. It is a huge privilege to be associated with such ground-breaking research efforts in light of the challenges we all face when it comes to life-changing events like climate change and extreme weather."

Speak to an expert

You’ve seen how Scan continues to help the Earth Systems Lab and FDL further its research into the climate change and space. Contact our expert AI team to discuss your project requirements.

phone_iphone Phone: 01204 474210

mail Email: [email protected]

Read more case studies

We have a large range of case studies from many industries

Find Out More