Joint Seminar: Sampling Strategies to overcome set imbalance: A case study of Machine Learning Emulators of Atmospheric Gravity Wave Momentum Transport

With the ultimate goal of developing a data-driven parameterization of gravity waves (GW) for use in general circulation models (GCMs), we investigate various machine learning (ML) architectures that emulate the Alexander-Dunkerton ’99 (AD99) scheme, an existing GW parameterization. We analyze the distribution of errors as functions of shear-related metrics in an effort to diagnose the disparity between online and offline performance of the trained emulators, and develop a sampling algorithm to treat biases due to underrepresented areas of the phase space. 


It has been shown in similar previous efforts [Espinosa GRL 2022] that stellar offline performance does not guarantee adequate performance online and even lead to instabilities. A thorough error analysis reveals that the majority of the samples are learned quickly whereas some stubborn samples remain poorly represented. We find that the more error-prone samples are those with wind profiles that have large shears– this is corroborated with physical intuition as large shears indicate many breaking levels and therefore parameterizing GWs for samples with large shear wind profiles is a more difficult, complex task. To remedy this, we develop a sampling strategy that performs a parameterized histogram equalization. 

The sampling algorithm uses a linear mapping from the original histogram to the uniform histogram parameterized by $t \in [0,1]$. A given value $t$ and and a predetermined "maximum repeat" together assign each bin a new probability. The new probability is applied in two different implementations: 1) by sampling the bins to adjust the distribution the learning algorithm encounters; 2) by weighting the loss function to achieve the same effect. We find that this strategy improves the errors at the tail portion of the distribution (which is oversampled) except at the extreme end, but loses accuracy minimally at the peak of the distribution (which is undersampled).

Although we study the performance of this algorithm in the context of training a GW parameterization emulator, this strategy can be used for learning datasets with long tail distributions where the rare samples are associated with low accuracy. Instances of this type of datasets are prevalent in earth system dynamics: launching of GWs, and extreme events like hurricanes, heat waves are just a few examples.





13:30 h


Bundesstr. 53, room 022/023
Seminar Room 022/023, Ground Floor, Bundesstrasse 53, 20146 Hamburg, Hamburg


Minah Yang


Chetankumar Jalihal

Back to listing