Amy Whitehead's Research

the ecological musings of a conservation biologist

Leave a comment

New paper – Inside or outside: quantifying extrapolation across river networks

It’s been pretty quiet over here for a long time. But I have been busy beavering away at many interesting projects at NIWA, including a project where we developed a new method for identifying when your regression model is starting to make things up (or more technically, extrapolating beyond the bounds of the dataset).

Regression models are used across the environmental sciences to find patterns between a response and its potential predictors. These patterns can be used to predict a response across broad areas or under new environmental conditions. Our paper compares performance of two flexible regression techniques when predicting across a deliberately induced spectrum of interpolation to extrapolation. Various data sets were divided into two geographical, environmental and random groups. Models were trained on one half of the data and tested on the other. The two methods incorporate nonlinear and interacting relationships but suffer from unquantified uncertainty when extrapolating. Random forests always performed better than multivariate adaptive regression splines when interpolating within environmental space, and when extrapolating in geographical space. Random forests models were transferable in geographic space but not to environmental conditions outside the training data. Neither technique was successful when extrapolating across environmental gradients. The paper also describes and tests a new method to calculate degree of extrapolation: a value quantifying interpolation versus extrapolation for each prediction from either regression technique. The method can be used to indicate risk of spurious predictions when predicting at new locations (e.g., nationally) or under new environmental conditions (e.g., climatic change).

Booker, D.J. & A.L. Whitehead. (2018). Inside or outside: quantifying extrapolation across river networks. Water Resources Research. doi:10.1029/2018WR023378 [online]


Extracting raster data using a shapefile

I recently had an email from a PhD student in Austria who had a raster showing the distribution of Douglas Fir in Europe and wanted to know what proportion of each European country was covered in this species. They had a raster with presence (1) and absence (0) of Douglas-fir in Europe and wanted to calculate the number of cells with 1 and 0 within each country of the Europe. I’ve put together a dummy example below which shows how to R script to extract the number of raster cells in each country that meet a certain condition.


Douglas Fir (source: Pixabay)

Essentially the script works through the following steps:

  1. Loads the relevant shapefile and raster datasets.
  2. Identifies all of the countries within the shapefile.
  3. Within a loop, masks the presence-absence raster by each country and counts the number of cells that meet the required condition.

Continue reading


Copying files with R

Following on from my recent experience with deleting files using R, I found myself needing to copy a large number of raster files from a folder on my computer to a USB drive so that I could post them to a colleague (yes, snail mail – how old and antiquated!).  While this is not typically a difficult task to do manually, I didn’t want to copy all of the files within the folder and there was no way to sort the folder in a sensible manner that meant I could select out the files that I wanted without individually clicking on all 723 files (out of ~4,300) and copying them over.  Not only would this have been incredibly tedious(!), it’s highly likely that I would have made a mistake and missed something important or copied over files that they didn’t need. So enter my foray into copying files using R. Continue reading


Converting shapefiles to rasters in R

I’ve been doing a lot of analyses recently that need rasters representing features in the landscape. In most cases, these data have been supplied as shapefiles, so I needed to quickly extract parts of a shapefile dataset and convert them to a raster in a standardised format. Preferably with as little repetitive coding as possible. So I created a simple and relatively flexible function to do the job for me.

The function requires two main input files: the shapefile (shp) that you want to convert and a raster that represents the background area (mask.raster), with your desired extent and resolution. The value of the background raster should be set to a constant value that will represent the absence of the data in the shapefile (I typically use zero).

The function steps through the following:

  1. Optional: If shp is not in the same projection as the mask.raster, set the current projection (proj.from) and then transform the shapefile to the new projection ( using transform=TRUE.
  2. Convert shp to a raster based on the specifications of mask.raster (i.e. same extent and resolution).
  3. Set the value of the cells of the raster that represent the polygon to the desired value.
  4. Merge the raster with mask.raster, so that the background values are equal to the value of mask.raster.
  5. Export as a tiff file in the working directory with the label specified in the function call.
  6. If desired, plot the new raster using map=TRUE.
  7. Return as an object in the global R environment.

The function is relatively quick, although is somewhat dependant on how complicated your shapefile is. The more individual polygons that need to filtered through and extracted, the longer it will take. Continue reading


Remotely deleting files from R

Sometimes programs generate a LOT of files while running scripts. Usually these are important (why else would you be running the script?). However, sometimes scripts generate mountains of temporary files to create summary outputs that aren’t really useful in their own right. Manually deleting such temporary files can be a very time consuming and tedious process, particularly if they are mixed in with the important ones. Not to mention the risk of accidentally deleting things you need because you’ve got bored and your mind has wandered off to more exciting things… watching orca swim past from the hut window

…like watching orca swim past the hut window!

I had exactly this problem a few months ago when I had ~65,000 temp files from a modelling process that were no longer needed, inconveniently mixed in with the things I needed to keep. Clearly deleting these files manually wasn’t really going to be an option. There are a number of ways to tackle this problem but R provided a simple two-line solution to the problem. Continue reading


Creating a presence-absence raster from point data

I’m working on generating species distribution models at the moment for a few hundred species. Which means that I’m trying to automate as many steps as possible in R to avoid having to click buttons hundreds of times in ArcView.

One of the tasks that I need to do is to convert presence-only latitude and longitude data into a presence-absence raster for each species. It seems like this would be something that relatively simple but it took me longer than it should have to figure it out. So I’m posting my code here so 1) I don’t forget how I did it; and 2) because I had someone ask me how to exactly this thing this afternoon and it took me ages to hunt through my poorly organised files to find this piece of code! So here it is:

Because I’m a function kinda girl, I wrote this as a function. It basically goes through three steps:

1. Take an existing raster of the area you are interested in mask.raster and set the background cells to zero (absences).

2. rasterize the presence points for your species and set those cells to one (presences).

3. Label the new raster by your species names raster.label and save it as a new raster.

presence.absence.raster <- function (mask.raster,,raster.label="") {

# set the background cells in the raster to 0
mask.raster[!] <- 0

#set the cells that contain points to 1
speciesRaster <- rasterize(,mask.raster,field=1)
speciesRaster <- merge(speciesRaster,mask.raster)

#label the raster
names(speciesRaster) <- raster.label

Below is an example of how the function works using data on the global distribution of foxes data from the biomod2 package.


# read in species point data and extract data for foxes
mySpecies <- read.csv(system.file("external/species/mammals_table.csv", package="biomod2"), row.names = 1)
species <- "VulpesVulpes"

# extract fox data from larger dataset and keep only the x and y coordinates <- mySpecies[,c("X_WGS84", "Y_WGS84",species)] <-[$VulpesVulpes==1,c("X_WGS84", "Y_WGS84")]

# read in a raster of the world
myRaster <- raster(system.file( "external/bioclim/current/bio3.grd",package="biomod2"))

# create presence absence raster for foxes
pa.raster <- presence.absence.raster(mask.raster=myRaster,, raster.label=species)
plot(pa.raster, main=names(pa.raster))


In this plot, the presences (1) are shown in green and the absences (0) in light grey.

Helpful things to remember (or things I learnt the hard way)

  1. Make sure your species point data and raster are in the same projection and that they actually overlap!
  2. Set your desired raster extent and resolution in the mask.raster before you get started.
  3. The species point data that you feed into the function should just be a list of x and y co-ordinates – no species names or abundances or you’ll confuse the poor beast and it won’t work!

And yes, foxes are also present in Australia where they are a pest. I guess this map shows their natural range before people started doing silly things.

Bought to you by the powers of knitr & RWordpress 

(Well it was and then things went a little awry, so I had to do some tinkering!)


Combining dataframes when the columns don’t match

Most of my work recently has involved downloading large datasets of species occurrences from online databases and attempting to smoodge1 them together to create distribution maps for parts of Australia. Online databases typically have a ridiculous number of columns with obscure names which can make the smoodging process quite difficult.

For example, I was trying to combine data from two different regions into one file, where one region had 72 columns of data and another region had 75 columns. If you try and do this using rbind, you get an error but going through and identifying non-matching columns manually would be quite tedious and error-prone.

Here's an example of the function in use with some imaginary data. You'll note that Database One and Two have unequal number of columns (5 versus 6), a number of shared columns (species, latitude, longitude, database) and some unshared columns (method, data.source).

  species latitude longitude        method     database
1       p   -33.87     150.5   camera trap
2       a   -33.71     151.3 live trapping
3       n   -33.79     151.8   camera trap
4       w   -34.35     151.3 live trapping
5       h   -31.78     151.8   camera trap
6       q   -33.17     151.2 live trapping
      database species latitude longitude data.source accuracy
1 database.two       d   -33.95     152.7   herbarium    3.934
2 database.two       f   -32.60     150.2      museum    8.500
3 database.two       z   -32.47     150.7   herbarium    3.259
4 database.two       f   -30.67     150.6      museum    2.756
5 database.two       e   -32.73     149.4   herbarium    4.072
6 database.two       x   -33.49     153.3      museum    8.169
rbind(, database.two)
Error: numbers of columns of arguments do not match

So I created a function that can be used to combine the data from two dataframes, keeping only the columns that have the same names (I don't care about the other ones). I'm sure there are other fancier ways of doing this but here's how my function works.

The basics steps
1. Specify the input dataframes
2. Calculate which dataframe has the greatest number of columns
3. Identify which columns in the smaller dataframe match the columns in the larger dataframe
4. Create a vector of the column names that occur in both dataframes
5. Combine the data from both dataframes matching the listed column names using rbind
6. Return the combined data

rbind.match.columns <- function(input1, input2) {
    n.input1 <- ncol(input1)
    n.input2 <- ncol(input2)

    if (n.input2 < n.input1) {
        TF.names <- which(names(input2) %in% names(input1))
        column.names <- names(input2[, TF.names])
    } else {
        TF.names <- which(names(input1) %in% names(input2))
        column.names <- names(input1[, TF.names])

    return(rbind(input1[, column.names], input2[, column.names]))

rbind.match.columns(, database.two)
   species latitude longitude     database
1        p   -33.87     150.5
2        a   -33.71     151.3
3        n   -33.79     151.8
4        w   -34.35     151.3
5        h   -31.78     151.8
6        q   -33.17     151.2
7        d   -33.95     152.7 database.two
8        f   -32.60     150.2 database.two
9        z   -32.47     150.7 database.two
10       f   -30.67     150.6 database.two
11       e   -32.73     149.4 database.two
12       x   -33.49     153.3 database.two

Running the function gives us a new dataframe with the four shared columns and twelve records, reflecting the combined data. Awesome!

Edited to add:

Viri asked a good question in the comments – what if you want to keep all of the columns in both data frames? The easiest solution to this problem is to add dummy columns to each dataframe that represent the columns missing from the other data frame and then use rbind to join them together. Of course, you won't actually have any data for these additional columns, so we simply set the values to NA. I've wrapped this up into a function as well.

rbind.all.columns <- function(x, y) {

    x.diff <- setdiff(colnames(x), colnames(y))
    y.diff <- setdiff(colnames(y), colnames(x))

    x[, c(as.character(y.diff))] <- NA

    y[, c(as.character(x.diff))] <- NA

    return(rbind(x, y))
rbind.all.columns(, database.two)

And here you can see that we now have one dataframe containing all seven columns from our two sources, with NA values present where we are missing data from one of the dataframes. Nice!

   species latitude longitude        method     database data.source	accuracy
1        p   -33.87     150.5   camera trap        <NA>	NA
2        a   -33.71     151.3 live trapping        <NA>	NA
3        n   -33.79     151.8   camera trap        <NA>	NA
4        w   -34.35     151.3 live trapping        <NA>	NA	
5        h   -31.78     151.8   camera trap        <NA>	NA
6        q   -33.17     151.2 live trapping        <NA>	NA
7        d   -33.95     152.7          <NA> database.two   herbarium	3.934
8        f   -32.60     150.2          <NA> database.two      museum	8.500
9        z   -32.47     150.7          <NA> database.two   herbarium	3.259
10       f   -30.67     150.6          <NA> database.two      museum	2.756
11       e   -32.73     149.4          <NA> database.two   herbarium	4.072
12       x   -33.49     153.3          <NA> database.two      museum	8.169

Happy merging everyone!

1 A high technical and scientific term!

Bought to you by the powers of knitr & RWordpress

1 Comment

My academic history (in a word-cloud)

Earlier this week, the postdocs in the QAECO lab presented a short talk about their academic life: past research, current research and future aspirations.  Limited to three powerpoint slides and eight minutes, it was an interesting exercise in brevity and finding ways to clearly communicate the major themes of your research.

My research history hasn’t really followed any obvious themes.  In fact, it’s really been driven by a series of happy accidents and crap-portunities*.   I started out as a field ecologist; studying the habitat use of freshwater fish, and working as a ranger for the Department of Conservation managing threatened bird species.  This was followed by more freshwater fish research, this time on rainbow trout in the United States, before I veered off (due to a crap-portunity) into the mysterious world of modelling (of the mathematical kind).  This fortuitous deviation led to an investigation of the sustainability of harvesting shovelnose sturgeon for caviar for my Masters and greatly influenced my research interests.  Returning to NZ, more ranger work for DOC led to a PhD looking at ways to improve the effectiveness of whio conservation: a glorious combination of fieldwork with spatial, demographic and population modelling.

Another happy accident saw me end up at Landcare Research, where I became the person that analysed all the random data living in the bottom of people’s filing cabinets.  I  investigated the effects of removing livestock from high country farmland, modeled disease transmission in Tasmanian devils, and looked for relationships between rabbit breeding and grass growth.  A short break in contracts provided an opportunity to do volunteer work in Antarctica.  This led to three seasons monitoring Adélie penguins and skua; and research looking at the effects of environmental conditions on Adélie penguin chick condition and the relationship between skua populations and penguin density.  I also spent time estimating the number of burrow-nesting petrels across a group of islands by identifying relationships between burrow density and habitat, burrow occupancy and breeding success.  And I briefly branched out into scary maths when I constructed a virtual model of masting trees to assess the impacts of herbivory.

I summarised all this in my talk by producing a word-cloud (in R because that’s how I roll) that showed the relative importance of keywords from each of these projects (and some pretty pictures of my study subjects to keep people interested).

academic wordcloud

So I guess you could say the central themes of my research to date are the management and conservation of populations threatened by invasive pests.  There will almost certainly be modelling involved, probably in (but not limited to) R, and I’d be keen on doing some fieldwork if possible.  I’m good at sorting out your horribly stored dataset (although I’d rather not) and I can put my hand to most forms of analysis (provided it is google-able and preferably done in R).  I’m currently branching out into yet another new field: conservation planning in the face of regional and urban development in the QAECO lab at the University of Melbourne but that’s another story for another time.

* When new opportunities arise out of the shitty things that happen.

Photo credits
I’d like to claim credit for these amazing photographs but, the truth is, I never actually saw some of my study species  (one of the downsides of being a modeller).  So thanks to the following contributors of the photographs I acquired from the internet.
Anticlockwise from bottom left: kokopu – Steve Moore; tussocks; rabbits; devils – Ian Waldie/Getty; drylands; masting; Adélie penguin – Amy Whitehead; whio – Amy Whitehead; oi – Mike Danzenbaker; kaki – Glenda Rees; shovelnose sturgeon – Konrad P. Schmidt; rainbow trout; kakapo – Mark Carwardine/; skua – Amy Whitehead; takahe


Randomly deleting duplicate rows from a dataframe

I use R a lot in my day to day workflow, particularly for manipulating raw data files into a format that can be used for analysis. This is often a brain-taxing exercise and, sometimes, it would be totally quicker to do it in Excel. But I like to make sure that my manipulations are reproducible. 1. This helps me remember what I actually did and 2. it is hugely helpful when the raw data changes for some reason (new data are added, corrections are made, …) as I can simply rerun the code.

One of the things I battled with a few nights ago was randomly deleting duplicate records from a dataframe without using some horrendous for loop. I’m working on a dataset of Adélie penguin chick weights where we’ve measured approximately 50 chicks selected randomly at three sites once a week for the past 16 years. That’s a lot of data. But some of the chicks come from the same nest, so those data points aren’t really independent.

Two chicks from the same nest - clearly somebody has been eating all the pies!

Two chicks from the same nest – clearly somebody has been eating all the pies!

I wanted to randomly remove one chick from nests where there were two. The data look something like the data below, with nests labelled with a unique identifier and chicks within each nest labelled sequentially (a or b) in the order they were measured.

##    nest chick weight
## 1     1     a 1020.9
## 2     1     b 1042.2
## 3     2     a  844.5
## 4     2     b  829.2
## 5     3     a  871.2
## 6     4     a 1133.1
## 7     4     b 1070.6
## 8     5     a 1159.7
## 9     6     a  692.5
## 10    6     b  786.8

I’ve used the duplicated function before to remove duplicate rows but it always retains the first record. While this is appropriate in some cases, it’s possible that we have a bias towards weighing larger or smaller chicks first, so I wanted to remove rows randomly. But I kept getting stuck with using a for loop, which isn’t very efficient.

So I googled “How to randomly delete rows in R”, because this is my strategy for figuring this out in R, and I found an answer on this forum. The function duplicated.random does exactly what I wanted, so I’m reproducing it here so that I can find it again. Thanks to Magnus Torfason-2 for providing the solution (I wish I was this clever).

This function returns a logical vector, the elements of which are FALSE, unless there are duplicated values in x, in which case all but one elements are TRUE (for each set of duplicates). The only difference between this function and the duplicated() function is that rather than always returning FALSE for the first instance of a duplicated value, the choice of instance is random.

duplicated.random = function(x, incomparables = FALSE, ...) 
     if ( is.vector(x) ) 
         permutation = sample(length(x)) 
         x.perm      = x[permutation] 
         result.perm = duplicated(x.perm, incomparables, ...) 
         result      = result.perm[order(permutation)] 
     else if ( is.matrix(x) ) 
         permutation = sample(nrow(x)) 
         x.perm      = x[permutation,] 
         result.perm = duplicated(x.perm, incomparables, ...) 
         result      = result.perm[order(permutation)] 
         stop(paste("duplicated.random() only supports vectors", 
                "matrices for now.")) 

I applied this function to my nest dataset to give me a logical column indicating which chick was randomly selected for the analysis.$duplicated.chick <- duplicated.random($nest)

##    nest chick weight duplicated.chick
## 1     1     a 1020.9             TRUE
## 2     1     b 1042.2            FALSE
## 3     2     a  844.5            FALSE
## 4     2     b  829.2             TRUE
## 5     3     a  871.2            FALSE
## 6     4     a 1133.1            FALSE
## 7     4     b 1070.6             TRUE
## 8     5     a 1159.7            FALSE
## 9     6     a  692.5             TRUE
## 10    6     b  786.8            FALSE

In this case, I retain all the chicks labelled false as they either were the only chick in the nest or they have been randomly selected by the function. The chicks labelled true are the chicks that weren’t selected for analysis. Slightly counter-intuitive. But then it’s an easy step to filter out the data that I want to keep and run the analyses.

##    nest chick weight duplicated.chick
## 2     1     b 1042.2            FALSE
## 3     2     a  844.5            FALSE
## 5     3     a  871.2            FALSE
## 6     4     a 1133.1            FALSE
## 8     5     a 1159.7            FALSE
## 10    6     b  786.8            FALSE

You can see that there is now only one record per nest and, where there were two chicks, the selected chicks include a random smattering of both a and b. This seems like a useful function for all sorts of situations.

On a slight geekery aside, this is the first blog post that I’ve written using RMarkdown and knitr in Rstudio. I’m liking it. I still need to iron out some kinks in the process but expect to be bored silly with more R-oriented blog posts in the future.