...

A machine learning approach to model channel hydraulic geometry

Arash Modaresi Rad, Mike Johnson

AHG/FHG Data

In-channel shape

Data Visualization

Machine learning approuch

The reference fabric data

The soil dataset

The DEM dataset

The StreamCat dataset

Spatially and temporally aggregating data

The USGS and NWM streamflow dataset

AHG/FHG Data

Using the USGS “HYDRoacoustic dataset in support of the Surface Water Oceanographic Topography satellite” mission (HYDRoSWOT) Johnson et al., 2023 provided a robust dataset of fitted AHG parameters, that uses the best combination of linear, nonlinear least square fits, and a constrained genetic algorithm. We use the derived FHG coefficient and exponents to build a machine learning model to translate catchment and station surrounding surface, subsurface, climate and anthropogenic characteristics into FHG coefficient and exponents. This coefficients would represent in channel geometrical characteristics that can be used to inform hydrological models and enhance flood mapping.

this done by considering the continuity relations that can be derived from the above power-law forms

In-channel shape

Dingman (2007) derived a symmetrical channel cross section as

where z is height of the bed at the lowest channel elevation point, x is horizontal distance from the center, ΨBF is the bankfull maximum depth, and WBF is bankfull width. The exponent r reflects the cross-section shape:

A triangle is represented by r = 1, a parabola by r = 3, and forms with increasingly flatter bottoms and steeper banks by increasing values of r; in the limit as r → ∞, the channel is rectangular.

Here we want to establish a relationship between watershed/site characteristics and the 6 parameters of the above equations. A map of spatial distribution of data is shown.

We can look at the density of these stations that are accumulated in different stream networks. The higher the concentration of stations within a river network the bigger the dots and vice versa. This resulted in 1432 distinct river sytems out of 3543 total.

The bigger dots represent higher concentration of stations within a single river network and smaller ones are lower concentrated

Soil dataset

We also looked into soil texture, water retention, and hydraulic conductivity characteristics as point estimate at station. These data were collected form POLARIS 30m res dataset for entire CONUS.

Landcover and DEM dataset

...

Data Visualization

The reference fabric

We can plot the distribution of predictor variables from reference fabric attributes dataset. These values are power transformed and standardized. The three colors represent a low, mid, and high classification of values of corresponding exponent coefficient. The two thresholds are derived from 33 and 66 quantiles of exponent coefficient values.

Parameters related to river width

a and b

Parameters related to river depth

c and f

Parameters related to river velocity

k and m

Machine learning approuch

Here we implement different machine learning approaches to model 6 FHG coefficients using different datasets

The reference fabric data

This dataset is comprised of upstream drainage area characteristics

Definitions

hwnodesqkm --> Area that drains to the headwater node in square kilometers <font color='red'>contains NaN!</font>
slopelenkm --> Flow-line length used to calculate slope, in kilometers
slope --> Slope of the flowline from smoothed elevation (unitless)
streamorde --> Modified Strahler stream order
streamleve --> Stream level
totdasqkm --> Total cumulative area, in square kilometers
pathlength --> Distance downstream to network end
arbolatesu --> Arbolate sum, the sum of the lengths of all digitized flowlines upstream from the downstream end of the immediate flowline, in kilometers
areasqkm --> Catchment area, in square kilometers
lengthkm --> From NHDFlowline feature
roughness --> Manning's roughness

During data preprocessing, due to NaN values in hwnodesqkm parmeter a dummy variable is introduced "hwnodesqkm_dummy" and NaN varibles are converted to 0

The soil dataset

The POLARIS soil dataset is comprised of the following features:

Definitions

alpha_mean_0_5 --> parameter of the van Genuchten equation corresponding approximately to the inverse of

the air-entry value, (cm-1)

bd_mean_0_5 --> the soil bulk density, (g cm-3)
clay_mean_0_5 --> % clay
hb_mean_0_5 --> Brooks-Corey parameter related to the air-entry pressure (cm)
ksat_mean_0_5 --> the effective saturated hydraulic conductivity, (cm hr-1)
lambda_mean_0_5 --> Brooks-Corey parameter the pore size distribution index, (dimensionless)
n_mean_0_5 --> the empirical shape-defining parameters in the van Genuchten equation, (dimensionless)
om_mean_0_5 --> the organic matter content, (%)
ph_mean_0_5 --> soil PH
sand_mean_0_5 --> % sand
silt_mean_0_5 --> % silt
theta_r_mean_0_5 --> the residual soil water content, (cm3 cm-3)
theta_s_mean_0_5 --> the saturated soil water content, (cm3 cm-3)
During data preprocessing, due to NaN values in some costal areas soil parameter a dummy variable is introduced and NaN variables are converted to 0

The DEM dataset

We extract point, buffer and catchment elevation characteristics form DEM dataset:

Definitions

elevation --> elevation (m)
aspect--> aspect
slope --> slope

The StreamCat dataset

The EPA streamCat dataset has over 600 metrics that include local catchment (Cat), watershed (Ws), and special metrics and are available for ~2.65 million streams.

This dataset contains climate, landsurface, subsurface and anthropogenic variables

Spatially and temporally aggregating data

We also extracted different characteristics by spatially and temporally aggregating data over the entire period within a buffer of 500m around each station.

Definitions

SM_ave --> average Soil moisture (0 - 10 cm underground) m^3 m-3 (source: FLDAS)
SM_max --> maximum Soil moisture (0 - 10 cm underground) m^3 m-3 (source: FLDAS)
SM_min --> minimum Soil moisture (0 - 10 cm underground) m^3 m-3 (source: FLDAS)
ST_ave --> average Soil temperature (0 - 10 cm underground) K (source: FLDAS)
ST_max --> maximum Soil temperature (0 - 10 cm underground) K (source: FLDAS)
ST_min --> minimum Soil temperature (0 - 10 cm underground) K (source: FLDAS)
Q_mean --> average Storm surface runoff kg m-2 s-1 (source: FLDAS)
Q_max --> maximum Storm surface runoff kg m-2 s-1 (source: FLDAS)
Q_min --> minimum Storm surface runoff kg m-2 s-1 (source: FLDAS)
Qb_mean --> average Baseflow-groundwater runoff kg m-2 s-1 (source: FLDAS)
Qb_max --> maximum Baseflow-groundwater runoff kg m-2 s-1 (source: FLDAS)
Qb_min --> minimum Baseflow-groundwater runoff kg m-2 s-1 (source: FLDAS)
ET_ave --> Total evapotranspiration kg/m^2/8day (source: MODIS)
AI --> aridity index
LAI_ave --> average Leaf Area Index (source: MODIS)
LAI_min --> minimum Leaf Area Index (source: MODIS)
LAI_max --> maximum Leaf Area Index (source: MODIS)
Precip_ave --> average 30-year average of monthly total precipitation (including rain and melted snow) (source: PRISM)
Precip_min --> minimum 30-year average of monthly total precipitation (including rain and melted snow) (source: PRISM)
Precip_max --> maximum 30-year average of monthly total precipitation (including rain and melted snow) (source: PRISM)
NDVI_ave --> average Normalized difference vegetation index (source: Landsat)
NDVI_min --> minimum Normalized difference vegetation index (source: Landsat)
NDVI_max --> maximum Normalized difference vegetation index (source: Landsat)
elevation_ave --> average elevation (source: NASA SRTM Digital Elevation 30m)
slope_ave --> average slope (source: NASA SRTM Digital Elevation 30m)
aspect_ave --> median aspect (source: NASA SRTM Digital Elevation 30m)

During data preprocessing, due to NaN values in hwnodesqkm parmeter a dummy variable is introduced "hwnodesqkm_dummy" and NaN varibles are converted to 0

The USGS and NWM streamflow dataset

Here we use different flow percentiles of national water model (NWM) 2.1 :

Definitions

nwm21_min --> Minimum flow value from NWM2.1
nwm21_25 --> 25% percentiles flow value from NWM2.1
nwm21_50 --> 50% percentiles flow value from NWM2.1
nwm21_75 --> 75% percentiles flow value from NWM2.1
nwm21_max --> Maximum flow value from NWM2.1

We also use different flood frequency values form NWIS:

Definitions

ff_1.5 --> 1.5 year flood frequency
ff_2 --> 2 year flood frequency
ff_5 --> 5 year flood frequency
ff_10 --> 10 year flood frequency
ff_15 --> 15 year flood frequency
ff_25 --> 25 year flood frequency
ff_35 --> 35 year flood frequency
ff_50 --> 50 year flood frequency
ff_60 --> 60 year flood frequency
ff_75 --> 75 year flood frequency
ff_85 --> 85 year flood frequency
ff_90 --> 90 year flood frequency
ff_95 --> 95 year flood frequency
ff_98 --> 98 year flood frequency
ff_99 --> 99 year flood frequency
ff_100 --> 100 year flood frequency
ff_200 --> 200 year flood frequency
ff_500 --> 500 year flood frequency
ff_1000 --> 1000 year flood frequency

Model Performance

XGBoost Model

We used 15% of data for testing, 12.75% for validation and the rest for training purposes.

We implemented a grid search algorithm during the k-fold cross validation to find best hyper parameters using space described below:

'max_depth': [3, 4, 5, 6, 7],

'learning_rate': [0.001, 0.01, 0.05],

'n_estimators': [1000, 2000, 3000, 5000, 6000],

'colsample_bytree': [0.3, 0.5, 0.7]

Each kflod cross validations is repeated 3 times to ensure randomness plays little part in results.

Validation fit and accuracy of "b" parameter

Testing fit and accuracy of "b" parameter

Feature Importance

A comprehensive analysis is provided in Importance page

We use three techniques to assess feature Importance:

1- Permutation approach: This approach provides a relative importance scores for the training dataset that is independent of the model used.

2- XGBoost approach: This approach provides a relative importance scores for the training dataset based on the trained XGBoost model that is that provide a deeper insight into feature selection by model.

3- Shap tree based approach: We use a tree algorithm that is based on game theory and determine XGBoost feature importance's. This also allows us to look at inner interactions between variables.

...

A machine learning approach to model channel hydraulic geometry

Arash Modaresi Rad, Mike Johnson

AHG/FHG Data

In-channel shape

Data Visualization

Machine learning approuch

The reference fabric data

The soil dataset

The DEM dataset

The StreamCat dataset

Spatially and temporally aggregating data

Definitions

The USGS and NWM streamflow dataset

Model Performance

Feature Importance

Permutation

XGBoost