Training Pipeline
From raw hydrometeorological data to a trained error-correction model. This page documents the complete pipeline: data assembly, feature engineering, optimization strategy, and experiment design.
Model Architecture

Hydra v3 architecture: Feature Importance Gate, GRU encoder, Multi-Scale Temporal Convolutions, Transformer encoder with attention pooling, and regime-conditioned bias correction.
Data Assembly
NWM v2.1 Retrospective
Hourly CHRTOUT streamflow analysis from the National Water Model retrospective run (1979-2020). Provides the baseline forecast that Hydra corrects.
nwm_cms
USGS Streamflow
Hourly observed discharge from USGS gauging stations. Serves as ground truth for computing residuals and evaluating model skill.
usgs_cms
ERA5 / ERA5-Land
Meteorological reanalysis providing atmospheric and land-surface variables. 6-hourly data reindexed to hourly with nearest-neighbor (tolerance = 3h, no future leakage).
15 features
Target Variables
Residual Mode (default)
y_residual = USGS - NWM
Model predicts the NWM error; corrected flow = NWM + predicted residual
Direct Mode
y_corrected = USGS
Model directly predicts observed streamflow without explicit residual decomposition
Input Features
ERA5 Meteorological Features
Normalization
- 1.Z-score normalization computed on training split only, then applied identically to validation and test splits. Prevents information leakage.
- 2.asinh target transform applied to residual and corrected targets. Stabilizes variance across low-flow and high-flow regimes without log-domain issues at zero.
- 3.Per-site training — each gauge is trained independently to prevent cross-site sequence leakage.
Data Augmentation
During training, each input sequence has a 50% chance of being perturbed with additive Gaussian noise (sigma = 0.05). This regularizes against overfitting to exact feature values and improves generalization to unseen weather patterns.
Data Splits
Training
2010-01-01 to 2017-12-31
8 years of hourly data per site (~70,000 samples)
Validation
2018-01-01 to 2018-12-31
1 year for early stopping and hyperparameter selection
Test
2019-01-01 to 2020-12-31
2 years held out for final evaluation (never seen during training)
Training Configuration
Epochs
Maximum training epochs
Batch Size
Sequences per gradient step
Sequence
7-day input window
Learn Rate
Initial learning rate
Patience
Early stopping epochs
Optimizer
- Primary
- Ranger (RAdam + Lookahead)
- Fallback
- AdamW
- Weight Decay
- 5e-5
- LR Schedule (Ranger)
- ReduceOnPlateau (patience=3, factor=0.5)
- LR Schedule (AdamW)
- CosineAnnealing (T_max=epochs)
Training Details
- Mixed Precision
- FP16 via torch.autocast (GPU)
- Gradient Clipping
- max_norm = 1.0
- Best Model
- Checkpoint with lowest val loss
- Target Transform
- asinh(x) for variance stabilization
- Sites
- Trained independently per gauge
Multi-Objective Loss
The training loss combines multiple objectives. Core terms are always active; optional terms are enabled per-experiment. A LossAutoNormalizer uses exponential moving averages to keep all components at unit scale, so weights act as pure priority signals.
Gaussian NLL
coreHeteroscedastic negative log-likelihood on residual and corrected predictions. Learns per-sample uncertainty.
Consistency Loss
coreMSE between corrected prediction and observed streamflow. Ensures residual + NWM aligns with direct correction.
Non-Negativity Penalty
physicsPhysics constraint penalizing negative streamflow: relu(-Q)^2. Streamflow cannot be negative.
NSE Surrogate
hydrologyDifferentiable Nash-Sutcliffe Efficiency proxy. Directly optimizes the standard hydrological skill metric.
KGE Stabilizer
hydrologyKling-Gupta decomposition into correlation, variability ratio, and bias ratio components.
Quantile Pinball
uncertaintyPinball loss for probabilistic prediction intervals. Calibrates uncertainty quantiles.
Experiment Design
19 experiments across 3 study sites systematically ablate architecture, training strategy, and input features. Each experiment uses identical data splits and evaluation protocol.
Architecture Ablation
Compare encoder architectures while holding inputs and training procedure constant.
- -LSTM
- -Transformer-only
- -GRU-Transformer v2
- -Hydra v3
Training Ablation
Vary training constraints and data sampling strategies on the best architecture.
- -Causal attention mask
- -Non-negativity constraint
- -Event oversampling (3x Q90+)
- -Combined configurations
Input Feature Ablation
Test which input sources drive predictive skill.
- -NWM + ERA5 (standard)
- -ERA5-only (no NWM)
- -USGS + NWM + ERA5
- -USGS + ERA5 (no NWM)
Now that you understand the training pipeline, see how different configurations perform.
Explore Experiments