It’s been pretty quiet over here for a long time. But I have been busy beavering away at many interesting projects at NIWA, including a project where we developed a new method for identifying when your regression model is starting to make things up (or more technically, extrapolating beyond the bounds of the dataset).
Regression models are used across the environmental sciences to find patterns between a response and its potential predictors. These patterns can be used to predict a response across broad areas or under new environmental conditions. Our paper compares performance of two flexible regression techniques when predicting across a deliberately induced spectrum of interpolation to extrapolation. Various data sets were divided into two geographical, environmental and random groups. Models were trained on one half of the data and tested on the other. The two methods incorporate nonlinear and interacting relationships but suffer from unquantified uncertainty when extrapolating. Random forests always performed better than multivariate adaptive regression splines when interpolating within environmental space, and when extrapolating in geographical space. Random forests models were transferable in geographic space but not to environmental conditions outside the training data. Neither technique was successful when extrapolating across environmental gradients. The paper also describes and tests a new method to calculate degree of extrapolation: a value quantifying interpolation versus extrapolation for each prediction from either regression technique. The method can be used to indicate risk of spurious predictions when predicting at new locations (e.g., nationally) or under new environmental conditions (e.g., climatic change).
Booker, D.J. & A.L. Whitehead. (2018). Inside or outside: quantifying extrapolation across river networks. Water Resources Research. doi:10.1029/2018WR023378 [online]