...

A machine learning approach to model channel hydraulic geometry

Arash Modaresi Rad, Mike Johnson

AHG/FHG Data

Using the USGS “HYDRoacoustic dataset in support of the Surface Water Oceanographic Topography satellite” mission (HYDRoSWOT) Johnson et al., 2023 provided a robust dataset of fitted AHG parameters, that uses the best combination of linear, nonlinear least square fits, and a constrained genetic algorithm. We use the derived FHG coefficient and exponents to build a machine learning model to translate catchment and station surrounding surface, subsurface, climate and anthropogenic characteristics into FHG coefficient and exponents. This coefficients would represent in channel geometrical characteristics that can be used to inform hydrological models and enhance flood mapping.

this done by considering the continuity relations that can be derived from the above power-law forms

In-channel shape

Dingman (2007) derived a symmetrical channel cross section as

where z is height of the bed at the lowest channel elevation point, x is horizontal distance from the center, ΨBF is the bankfull maximum depth, and WBF is bankfull width. The exponent r reflects the cross-section shape: 

A triangle is represented by r = 1, a parabola by r = 3, and forms with increasingly flatter bottoms and steeper banks by increasing values of r; in the limit as r → ∞, the channel is rectangular.

Here we want to establish a relationship between watershed/site characteristics and the 6 parameters of the above equations. A map of spatial distribution of data is shown.


We can look at the density of these stations that are accumulated in different stream networks. The higher the concentration of stations within a river network the bigger the dots and vice versa. This resulted in 1432 distinct river sytems out of 3543 total.

The bigger dots represent higher concentration of stations within a single river network and smaller ones are lower concentrated

Soil dataset

We also looked into soil texture, water retention, and hydraulic conductivity characteristics as point estimate at station. These data were collected form POLARIS 30m res dataset for entire CONUS. 

Landcover and DEM dataset

...

Data Visualization

We can plot the distribution of predictor variables from reference fabric attributes dataset. These values are power transformed and standardized. The three colors represent a low, mid, and high classification of values of corresponding exponent coefficient. The two thresholds are derived from 33 and 66 quantiles of exponent coefficient values.

Parameters related to river width

a and b

Parameters related to river depth

c and f

Parameters related to river velocity

k and m

 

 Machine learning approuch


Here we implement different machine learning approaches to model 6 FHG coefficients using different datasets 

This dataset is comprised of upstream drainage area characteristics 

During data preprocessing,  due to NaN values in hwnodesqkm parmeter a dummy variable is introduced "hwnodesqkm_dummy" and NaN varibles are converted to 0 

The POLARIS soil dataset is comprised of the following features:

the air-entry value, (cm-1) 

We extract point, buffer and catchment elevation characteristics form DEM dataset:

The EPA streamCat dataset has over 600 metrics that include local catchment (Cat), watershed (Ws), and special metrics and are available for ~2.65 million streams. 


We also extracted different characteristics by spatially and temporally aggregating data over the entire period within a buffer of 500m around each station.

During data preprocessing,  due to NaN values in hwnodesqkm parmeter a dummy variable is introduced "hwnodesqkm_dummy" and NaN varibles are converted to 0 

Here we use different flow percentiles of national water model (NWM) 2.1 :

We also use different flood frequency values form NWIS:

Model Performance

XGBoost Model 

We used 15% of data for testing, 12.75% for validation and the rest for training purposes.

We implemented a grid search algorithm during the k-fold cross validation to find best hyper parameters using space described below:

'max_depth': [3, 4, 5, 6, 7],

'learning_rate': [0.001, 0.01, 0.05],

'n_estimators': [1000, 2000, 3000, 5000, 6000],

'colsample_bytree': [0.3, 0.5, 0.7]

Each kflod cross validations is repeated 3 times to ensure randomness plays little part in results.

Validation fit and accuracy of "b" parameter

Testing fit and accuracy of "b" parameter

Feature Importance

A comprehensive analysis is provided in Importance page

We use three techniques to assess feature Importance:

1- Permutation approach: This approach provides a relative importance scores for the training dataset that is independent of the model used.

2- XGBoost approach: This approach provides a relative importance scores for the training dataset based on the trained XGBoost model that is that provide a deeper insight into feature selection by model.

3-  Shap tree based approach: We use a tree algorithm that is based on game theory and determine XGBoost feature importance's. This also allows us to look at inner interactions between variables.

Permutation 

XGBoost