...
A machine learning approach to model channel hydraulic geometry
Arash Modaresi Rad, Mike Johnson
AHG/FHG Data
Using the USGS “HYDRoacoustic dataset in support of the Surface Water Oceanographic Topography satellite” mission (HYDRoSWOT) Johnson et al., 2023 provided a robust dataset of fitted AHG parameters, that uses the best combination of linear, nonlinear least square fits, and a constrained genetic algorithm. We use the derived FHG coefficient and exponents to build a machine learning model to translate catchment and station surrounding surface, subsurface, climate and anthropogenic characteristics into FHG coefficient and exponents. This coefficients would represent in channel geometrical characteristics that can be used to inform hydrological models and enhance flood mapping.
this done by considering the continuity relations that can be derived from the above power-law forms
In-channel shape
Dingman (2007) derived a symmetrical channel cross section as
where z is height of the bed at the lowest channel elevation point, x is horizontal distance from the center, ΨBF is the bankfull maximum depth, and WBF is bankfull width. The exponent r reflects the cross-section shape:
A triangle is represented by r = 1, a parabola by r = 3, and forms with increasingly flatter bottoms and steeper banks by increasing values of r; in the limit as r → ∞, the channel is rectangular.
Here we want to establish a relationship between watershed/site characteristics and the 6 parameters of the above equations. A map of spatial distribution of data is shown.
We can look at the density of these stations that are accumulated in different stream networks. The higher the concentration of stations within a river network the bigger the dots and vice versa. This resulted in 1432 distinct river sytems out of 3543 total.
Soil dataset
We also looked into soil texture, water retention, and hydraulic conductivity characteristics as point estimate at station. These data were collected form POLARIS 30m res dataset for entire CONUS.
Landcover and DEM dataset
...
Data Visualization
The reference fabric
We can plot the distribution of predictor variables from reference fabric attributes dataset. These values are power transformed and standardized. The three colors represent a low, mid, and high classification of values of corresponding exponent coefficient. The two thresholds are derived from 33 and 66 quantiles of exponent coefficient values.
Parameters related to river width
a and b
Parameters related to river depth
c and f
Parameters related to river velocity
k and m
Machine learning approuch
Here we implement different machine learning approaches to model 6 FHG coefficients using different datasets
- The reference fabric data
This dataset is comprised of upstream drainage area characteristics
Definitions
hwnodesqkm --> Area that drains to the headwater node in square kilometers <font color='red'>contains NaN!</font>
slopelenkm --> Flow-line length used to calculate slope, in kilometers
slope --> Slope of the flowline from smoothed elevation (unitless)
streamorde --> Modified Strahler stream order
streamleve --> Stream level
totdasqkm --> Total cumulative area, in square kilometers
pathlength --> Distance downstream to network end
arbolatesu --> Arbolate sum, the sum of the lengths of all digitized flowlines upstream from the downstream end of the immediate flowline, in kilometers
areasqkm --> Catchment area, in square kilometers
lengthkm --> From NHDFlowline feature
roughness --> Manning's roughness
During data preprocessing, due to NaN values in hwnodesqkm parmeter a dummy variable is introduced "hwnodesqkm_dummy" and NaN varibles are converted to 0
- The soil dataset
The POLARIS soil dataset is comprised of the following features:
Definitions
alpha_mean_0_5 --> parameter of the van Genuchten equation corresponding approximately to the inverse of
the air-entry value, (cm-1)
bd_mean_0_5 --> the soil bulk density, (g cm-3)
clay_mean_0_5 --> % clay
hb_mean_0_5 --> Brooks-Corey parameter related to the air-entry pressure (cm)
ksat_mean_0_5 --> the effective saturated hydraulic conductivity, (cm hr-1)
lambda_mean_0_5 --> Brooks-Corey parameter the pore size distribution index, (dimensionless)
n_mean_0_5 --> the empirical shape-defining parameters in the van Genuchten equation, (dimensionless)
om_mean_0_5 --> the organic matter content, (%)
ph_mean_0_5 --> soil PH
sand_mean_0_5 --> % sand
silt_mean_0_5 --> % silt
theta_r_mean_0_5 --> the residual soil water content, (cm3 cm-3)
theta_s_mean_0_5 --> the saturated soil water content, (cm3 cm-3)
During data preprocessing, due to NaN values in some costal areas soil parameter a dummy variable is introduced and NaN variables are converted to 0
- The DEM dataset
We extract point, buffer and catchment elevation characteristics form DEM dataset:
Definitions
elevation --> elevation (m)
aspect--> aspect
slope --> slope
- The StreamCat dataset
The EPA streamCat dataset has over 600 metrics that include local catchment (Cat), watershed (Ws), and special metrics and are available for ~2.65 million streams.
This dataset contains climate, landsurface, subsurface and anthropogenic variables
- Spatially and temporally aggregating data
We also extracted different characteristics by spatially and temporally aggregating data over the entire period within a buffer of 500m around each station.
- Definitions
SM_ave --> average Soil moisture (0 - 10 cm underground) m^3 m-3 (source: FLDAS)
SM_max --> maximum Soil moisture (0 - 10 cm underground) m^3 m-3 (source: FLDAS)
SM_min --> minimum Soil moisture (0 - 10 cm underground) m^3 m-3 (source: FLDAS)
ST_ave --> average Soil temperature (0 - 10 cm underground) K (source: FLDAS)
ST_max --> maximum Soil temperature (0 - 10 cm underground) K (source: FLDAS)
ST_min --> minimum Soil temperature (0 - 10 cm underground) K (source: FLDAS)
Q_mean --> average Storm surface runoff kg m-2 s-1 (source: FLDAS)
Q_max --> maximum Storm surface runoff kg m-2 s-1 (source: FLDAS)
Q_min --> minimum Storm surface runoff kg m-2 s-1 (source: FLDAS)
Qb_mean --> average Baseflow-groundwater runoff kg m-2 s-1 (source: FLDAS)
Qb_max --> maximum Baseflow-groundwater runoff kg m-2 s-1 (source: FLDAS)
Qb_min --> minimum Baseflow-groundwater runoff kg m-2 s-1 (source: FLDAS)
ET_ave --> Total evapotranspiration kg/m^2/8day (source: MODIS)
AI --> aridity index
LAI_ave --> average Leaf Area Index (source: MODIS)
LAI_min --> minimum Leaf Area Index (source: MODIS)
LAI_max --> maximum Leaf Area Index (source: MODIS)
Precip_ave --> average 30-year average of monthly total precipitation (including rain and melted snow) (source: PRISM)
Precip_min --> minimum 30-year average of monthly total precipitation (including rain and melted snow) (source: PRISM)
Precip_max --> maximum 30-year average of monthly total precipitation (including rain and melted snow) (source: PRISM)
NDVI_ave --> average Normalized difference vegetation index (source: Landsat)
NDVI_min --> minimum Normalized difference vegetation index (source: Landsat)
NDVI_max --> maximum Normalized difference vegetation index (source: Landsat)
elevation_ave --> average elevation (source: NASA SRTM Digital Elevation 30m)
slope_ave --> average slope (source: NASA SRTM Digital Elevation 30m)
aspect_ave --> median aspect (source: NASA SRTM Digital Elevation 30m)
During data preprocessing, due to NaN values in hwnodesqkm parmeter a dummy variable is introduced "hwnodesqkm_dummy" and NaN varibles are converted to 0
- The USGS and NWM streamflow dataset
Here we use different flow percentiles of national water model (NWM) 2.1 :
Definitions
nwm21_min --> Minimum flow value from NWM2.1
nwm21_25 --> 25% percentiles flow value from NWM2.1
nwm21_50 --> 50% percentiles flow value from NWM2.1
nwm21_75 --> 75% percentiles flow value from NWM2.1
nwm21_max --> Maximum flow value from NWM2.1
We also use different flood frequency values form NWIS:
Definitions
ff_1.5 --> 1.5 year flood frequency
ff_2 --> 2 year flood frequency
ff_5 --> 5 year flood frequency
ff_10 --> 10 year flood frequency
ff_15 --> 15 year flood frequency
ff_25 --> 25 year flood frequency
ff_35 --> 35 year flood frequency
ff_50 --> 50 year flood frequency
ff_60 --> 60 year flood frequency
ff_75 --> 75 year flood frequency
ff_85 --> 85 year flood frequency
ff_90 --> 90 year flood frequency
ff_95 --> 95 year flood frequency
ff_98 --> 98 year flood frequency
ff_99 --> 99 year flood frequency
ff_100 --> 100 year flood frequency
ff_200 --> 200 year flood frequency
ff_500 --> 500 year flood frequency
ff_1000 --> 1000 year flood frequency
Model Performance
XGBoost Model
We used 15% of data for testing, 12.75% for validation and the rest for training purposes.
We implemented a grid search algorithm during the k-fold cross validation to find best hyper parameters using space described below:
'max_depth': [3, 4, 5, 6, 7],
'learning_rate': [0.001, 0.01, 0.05],
'n_estimators': [1000, 2000, 3000, 5000, 6000],
'colsample_bytree': [0.3, 0.5, 0.7]
Each kflod cross validations is repeated 3 times to ensure randomness plays little part in results.
Validation fit and accuracy of "b" parameter
Testing fit and accuracy of "b" parameter
Feature Importance
A comprehensive analysis is provided in Importance page
We use three techniques to assess feature Importance:
1- Permutation approach: This approach provides a relative importance scores for the training dataset that is independent of the model used.
2- XGBoost approach: This approach provides a relative importance scores for the training dataset based on the trained XGBoost model that is that provide a deeper insight into feature selection by model.
3- Shap tree based approach: We use a tree algorithm that is based on game theory and determine XGBoost feature importance's. This also allows us to look at inner interactions between variables.