Amy Whitehead's Research

the ecological musings of a conservation biologist


8 Comments

Copying files with R

Following on from my recent experience with deleting files using R, I found myself needing to copy a large number of raster files from a folder on my computer to a USB drive so that I could post them to a colleague (yes, snail mail – how old and antiquated!).  While this is not typically a difficult task to do manually, I didn’t want to copy all of the files within the folder and there was no way to sort the folder in a sensible manner that meant I could select out the files that I wanted without individually clicking on all 723 files (out of ~4,300) and copying them over.  Not only would this have been incredibly tedious(!), it’s highly likely that I would have made a mistake and missed something important or copied over files that they didn’t need. So enter my foray into copying files using R. Continue reading


18 Comments

Converting shapefiles to rasters in R

I’ve been doing a lot of analyses recently that need rasters representing features in the landscape. In most cases, these data have been supplied as shapefiles, so I needed to quickly extract parts of a shapefile dataset and convert them to a raster in a standardised format. Preferably with as little repetitive coding as possible. So I created a simple and relatively flexible function to do the job for me.

The function requires two main input files: the shapefile (shp) that you want to convert and a raster that represents the background area (mask.raster), with your desired extent and resolution. The value of the background raster should be set to a constant value that will represent the absence of the data in the shapefile (I typically use zero).

The function steps through the following:

  1. Optional: If shp is not in the same projection as the mask.raster, set the current projection (proj.from) and then transform the shapefile to the new projection (proj.to) using transform=TRUE.
  2. Convert shp to a raster based on the specifications of mask.raster (i.e. same extent and resolution).
  3. Set the value of the cells of the raster that represent the polygon to the desired value.
  4. Merge the raster with mask.raster, so that the background values are equal to the value of mask.raster.
  5. Export as a tiff file in the working directory with the label specified in the function call.
  6. If desired, plot the new raster using map=TRUE.
  7. Return as an object in the global R environment.

The function is relatively quick, although is somewhat dependant on how complicated your shapefile is. The more individual polygons that need to filtered through and extracted, the longer it will take. Continue reading


3 Comments

Remotely deleting files from R

Sometimes programs generate a LOT of files while running scripts. Usually these are important (why else would you be running the script?). However, sometimes scripts generate mountains of temporary files to create summary outputs that aren’t really useful in their own right. Manually deleting such temporary files can be a very time consuming and tedious process, particularly if they are mixed in with the important ones. Not to mention the risk of accidentally deleting things you need because you’ve got bored and your mind has wandered off to more exciting things…

...like watching orca swim past from the hut window

…like watching orca swim past the hut window!

I had exactly this problem a few months ago when I had ~65,000 temp files from a modelling process that were no longer needed, inconveniently mixed in with the things I needed to keep. Clearly deleting these files manually wasn’t really going to be an option. There are a number of ways to tackle this problem but R provided a simple two-line solution to the problem. Continue reading


35 Comments

Combining dataframes when the columns don’t match

Most of my work recently has involved downloading large datasets of species occurrences from online databases and attempting to smoodge1 them together to create distribution maps for parts of Australia. Online databases typically have a ridiculous number of columns with obscure names which can make the smoodging process quite difficult.

For example, I was trying to combine data from two different regions into one file, where one region had 72 columns of data and another region had 75 columns. If you try and do this using rbind, you get an error but going through and identifying non-matching columns manually would be quite tedious and error-prone.

Here's an example of the function in use with some imaginary data. You'll note that Database One and Two have unequal number of columns (5 versus 6), a number of shared columns (species, latitude, longitude, database) and some unshared columns (method, data.source).

  species latitude longitude        method     database
1       p   -33.87     150.5   camera trap database.one
2       a   -33.71     151.3 live trapping database.one
3       n   -33.79     151.8   camera trap database.one
4       w   -34.35     151.3 live trapping database.one
5       h   -31.78     151.8   camera trap database.one
6       q   -33.17     151.2 live trapping database.one
      database species latitude longitude data.source accuracy
1 database.two       d   -33.95     152.7   herbarium    3.934
2 database.two       f   -32.60     150.2      museum    8.500
3 database.two       z   -32.47     150.7   herbarium    3.259
4 database.two       f   -30.67     150.6      museum    2.756
5 database.two       e   -32.73     149.4   herbarium    4.072
6 database.two       x   -33.49     153.3      museum    8.169
rbind(database.one, database.two)
Error: numbers of columns of arguments do not match

So I created a function that can be used to combine the data from two dataframes, keeping only the columns that have the same names (I don't care about the other ones). I'm sure there are other fancier ways of doing this but here's how my function works.

The basics steps
1. Specify the input dataframes
2. Calculate which dataframe has the greatest number of columns
3. Identify which columns in the smaller dataframe match the columns in the larger dataframe
4. Create a vector of the column names that occur in both dataframes
5. Combine the data from both dataframes matching the listed column names using rbind
6. Return the combined data

rbind.match.columns <- function(input1, input2) {
    n.input1 <- ncol(input1)
    n.input2 <- ncol(input2)

    if (n.input2 < n.input1) {
        TF.names <- which(names(input2) %in% names(input1))
        column.names <- names(input2[, TF.names])
    } else {
        TF.names <- which(names(input1) %in% names(input2))
        column.names <- names(input1[, TF.names])
    }

    return(rbind(input1[, column.names], input2[, column.names]))
}

rbind.match.columns(database.one, database.two)
   species latitude longitude     database
1        p   -33.87     150.5 database.one
2        a   -33.71     151.3 database.one
3        n   -33.79     151.8 database.one
4        w   -34.35     151.3 database.one
5        h   -31.78     151.8 database.one
6        q   -33.17     151.2 database.one
7        d   -33.95     152.7 database.two
8        f   -32.60     150.2 database.two
9        z   -32.47     150.7 database.two
10       f   -30.67     150.6 database.two
11       e   -32.73     149.4 database.two
12       x   -33.49     153.3 database.two

Running the function gives us a new dataframe with the four shared columns and twelve records, reflecting the combined data. Awesome!

Edited to add:

Viri asked a good question in the comments – what if you want to keep all of the columns in both data frames? The easiest solution to this problem is to add dummy columns to each dataframe that represent the columns missing from the other data frame and then use rbind to join them together. Of course, you won't actually have any data for these additional columns, so we simply set the values to NA. I've wrapped this up into a function as well.

rbind.all.columns <- function(x, y) {

    x.diff <- setdiff(colnames(x), colnames(y))
    y.diff <- setdiff(colnames(y), colnames(x))

    x[, c(as.character(y.diff))] <- NA

    y[, c(as.character(x.diff))] <- NA

    return(rbind(x, y))
}
rbind.all.columns(database.one, database.two)

And here you can see that we now have one dataframe containing all seven columns from our two sources, with NA values present where we are missing data from one of the dataframes. Nice!

   species latitude longitude        method     database data.source	accuracy
1        p   -33.87     150.5   camera trap database.one        <NA>	NA
2        a   -33.71     151.3 live trapping database.one        <NA>	NA
3        n   -33.79     151.8   camera trap database.one        <NA>	NA
4        w   -34.35     151.3 live trapping database.one        <NA>	NA	
5        h   -31.78     151.8   camera trap database.one        <NA>	NA
6        q   -33.17     151.2 live trapping database.one        <NA>	NA
7        d   -33.95     152.7          <NA> database.two   herbarium	3.934
8        f   -32.60     150.2          <NA> database.two      museum	8.500
9        z   -32.47     150.7          <NA> database.two   herbarium	3.259
10       f   -30.67     150.6          <NA> database.two      museum	2.756
11       e   -32.73     149.4          <NA> database.two   herbarium	4.072
12       x   -33.49     153.3          <NA> database.two      museum	8.169

Happy merging everyone!

1 A high technical and scientific term!

Bought to you by the powers of knitr & RWordpress


6 Comments

Randomly deleting duplicate rows from a dataframe

I use R a lot in my day to day workflow, particularly for manipulating raw data files into a format that can be used for analysis. This is often a brain-taxing exercise and, sometimes, it would be totally quicker to do it in Excel. But I like to make sure that my manipulations are reproducible. 1. This helps me remember what I actually did and 2. it is hugely helpful when the raw data changes for some reason (new data are added, corrections are made, …) as I can simply rerun the code.

One of the things I battled with a few nights ago was randomly deleting duplicate records from a dataframe without using some horrendous for loop. I’m working on a dataset of Adélie penguin chick weights where we’ve measured approximately 50 chicks selected randomly at three sites once a week for the past 16 years. That’s a lot of data. But some of the chicks come from the same nest, so those data points aren’t really independent.

Two chicks from the same nest - clearly somebody has been eating all the pies!

Two chicks from the same nest – clearly somebody has been eating all the pies!

I wanted to randomly remove one chick from nests where there were two. The data look something like the data below, with nests labelled with a unique identifier and chicks within each nest labelled sequentially (a or b) in the order they were measured.

##    nest chick weight
## 1     1     a 1020.9
## 2     1     b 1042.2
## 3     2     a  844.5
## 4     2     b  829.2
## 5     3     a  871.2
## 6     4     a 1133.1
## 7     4     b 1070.6
## 8     5     a 1159.7
## 9     6     a  692.5
## 10    6     b  786.8

I’ve used the duplicated function before to remove duplicate rows but it always retains the first record. While this is appropriate in some cases, it’s possible that we have a bias towards weighing larger or smaller chicks first, so I wanted to remove rows randomly. But I kept getting stuck with using a for loop, which isn’t very efficient.

So I googled “How to randomly delete rows in R”, because this is my strategy for figuring this out in R, and I found an answer on this forum. The function duplicated.random does exactly what I wanted, so I’m reproducing it here so that I can find it again. Thanks to Magnus Torfason-2 for providing the solution (I wish I was this clever).

This function returns a logical vector, the elements of which are FALSE, unless there are duplicated values in x, in which case all but one elements are TRUE (for each set of duplicates). The only difference between this function and the duplicated() function is that rather than always returning FALSE for the first instance of a duplicated value, the choice of instance is random.


duplicated.random = function(x, incomparables = FALSE, ...) 
{ 
     if ( is.vector(x) ) 
     { 
         permutation = sample(length(x)) 
         x.perm      = x[permutation] 
         result.perm = duplicated(x.perm, incomparables, ...) 
         result      = result.perm[order(permutation)] 
         return(result) 
     } 
     else if ( is.matrix(x) ) 
     { 
         permutation = sample(nrow(x)) 
         x.perm      = x[permutation,] 
         result.perm = duplicated(x.perm, incomparables, ...) 
         result      = result.perm[order(permutation)] 
         return(result) 
     } 
     else 
     { 
         stop(paste("duplicated.random() only supports vectors", 
                "matrices for now.")) 
     } 
} 

I applied this function to my nest dataset to give me a logical column indicating which chick was randomly selected for the analysis.

nest.data$duplicated.chick <- duplicated.random(nest.data$nest)

##    nest chick weight duplicated.chick
## 1     1     a 1020.9             TRUE
## 2     1     b 1042.2            FALSE
## 3     2     a  844.5            FALSE
## 4     2     b  829.2             TRUE
## 5     3     a  871.2            FALSE
## 6     4     a 1133.1            FALSE
## 7     4     b 1070.6             TRUE
## 8     5     a 1159.7            FALSE
## 9     6     a  692.5             TRUE
## 10    6     b  786.8            FALSE

In this case, I retain all the chicks labelled false as they either were the only chick in the nest or they have been randomly selected by the function. The chicks labelled true are the chicks that weren’t selected for analysis. Slightly counter-intuitive. But then it’s an easy step to filter out the data that I want to keep and run the analyses.

##    nest chick weight duplicated.chick
## 2     1     b 1042.2            FALSE
## 3     2     a  844.5            FALSE
## 5     3     a  871.2            FALSE
## 6     4     a 1133.1            FALSE
## 8     5     a 1159.7            FALSE
## 10    6     b  786.8            FALSE

You can see that there is now only one record per nest and, where there were two chicks, the selected chicks include a random smattering of both a and b. This seems like a useful function for all sorts of situations.

On a slight geekery aside, this is the first blog post that I’ve written using RMarkdown and knitr in Rstudio. I’m liking it. I still need to iron out some kinks in the process but expect to be bored silly with more R-oriented blog posts in the future.