I use R a lot in my day to day workflow, particularly for manipulating raw data files into a format that can be used for analysis. This is often a brain-taxing exercise and, sometimes, it would be totally quicker to do it in Excel. But I like to make sure that my manipulations are reproducible. 1. This helps me remember what I actually did and 2. it is hugely helpful when the raw data changes for some reason (new data are added, corrections are made, …) as I can simply rerun the code.

One of the things I battled with a few nights ago was randomly deleting duplicate records from a dataframe without using some horrendous `for`

loop. I’m working on a dataset of Adélie penguin chick weights where we’ve measured approximately 50 chicks selected randomly at three sites once a week for the past 16 years. That’s a lot of data. But some of the chicks come from the same nest, so those data points aren’t really independent.

I wanted to randomly remove one chick from nests where there were two. The data look something like the data below, with nests labelled with a unique identifier and chicks within each nest labelled sequentially (*a* or *b*) in the order they were measured.

```
## nest chick weight
## 1 1 a 1020.9
## 2 1 b 1042.2
## 3 2 a 844.5
## 4 2 b 829.2
## 5 3 a 871.2
## 6 4 a 1133.1
## 7 4 b 1070.6
## 8 5 a 1159.7
## 9 6 a 692.5
## 10 6 b 786.8
```

I’ve used the `duplicated`

function before to remove duplicate rows but it always retains the first record. While this is appropriate in some cases, it’s possible that we have a bias towards weighing larger or smaller chicks first, so I wanted to remove rows randomly. But I kept getting stuck with using a `for`

loop, which isn’t very efficient.

So I googled “How to randomly delete rows in R”, because this is my strategy for figuring this out in R, and I found an answer on this forum. The function `duplicated.random`

does exactly what I wanted, so I’m reproducing it here so that I can find it again. Thanks to Magnus Torfason-2 for providing the solution (I wish I was this clever).

This function returns a logical vector, the elements of which are FALSE, unless there are duplicated values in x, in which case all but one elements are TRUE (for each set of duplicates). The only difference between this function and the duplicated() function is that rather than always returning FALSE for the first instance of a duplicated value, the choice of instance is random.

```
duplicated.random = function(x, incomparables = FALSE, ...)
{
if ( is.vector(x) )
{
permutation = sample(length(x))
x.perm = x[permutation]
result.perm = duplicated(x.perm, incomparables, ...)
result = result.perm[order(permutation)]
return(result)
}
else if ( is.matrix(x) )
{
permutation = sample(nrow(x))
x.perm = x[permutation,]
result.perm = duplicated(x.perm, incomparables, ...)
result = result.perm[order(permutation)]
return(result)
}
else
{
stop(paste("duplicated.random() only supports vectors",
"matrices for now."))
}
}
```

I applied this function to my nest dataset to give me a logical column indicating which chick was randomly selected for the analysis.

`nest.data$duplicated.chick <- duplicated.random(nest.data$nest)`

`## nest chick weight duplicated.chick ## 1 1 a 1020.9 TRUE ## 2 1 b 1042.2 FALSE ## 3 2 a 844.5 FALSE ## 4 2 b 829.2 TRUE ## 5 3 a 871.2 FALSE ## 6 4 a 1133.1 FALSE ## 7 4 b 1070.6 TRUE ## 8 5 a 1159.7 FALSE ## 9 6 a 692.5 TRUE ## 10 6 b 786.8 FALSE`

In this case, I retain all the chicks labelled false as they either were the only chick in the nest or they have been randomly selected by the function. The chicks labelled true are the chicks that weren’t selected for analysis. Slightly counter-intuitive. But then it’s an easy step to filter out the data that I want to keep and run the analyses.

```
## nest chick weight duplicated.chick
## 2 1 b 1042.2 FALSE
## 3 2 a 844.5 FALSE
## 5 3 a 871.2 FALSE
## 6 4 a 1133.1 FALSE
## 8 5 a 1159.7 FALSE
## 10 6 b 786.8 FALSE
```

You can see that there is now only one record per nest and, where there were two chicks, the selected chicks include a random smattering of both *a* and *b*. This seems like a useful function for all sorts of situations.

On a slight geekery aside, this is the first blog post that I’ve written using RMarkdown and knitr in Rstudio. I’m liking it. I still need to iron out some kinks in the process but expect to be bored silly with more R-oriented blog posts in the future.

23 January 2013 at 3:09 pm

Nice post Amy!

Great to see others using knitr to write their blogs! Look out for an update to my post on the matter coming soon. It’ll even include a way to upload images and embed them straight into blog posts all from within the Rmarkdown file.

Welcome back to Melbourne 🙂

23 January 2013 at 3:32 pm

Thanks! I was inspired by your post and was going to bug you about some stuff, so looking forward to see what you have to say 🙂

28 June 2013 at 5:19 am

Thanks Amy!

I ran into exactly the same issue and was getting stuck into really ugly coding… One google search later, I was reading your blog: Nice, clear and to the point. Problem solved!

Looking forward to reading you again.

28 June 2013 at 9:52 am

Thanks Tony. Glad I could be of help.

29 June 2013 at 2:18 am

Hey Amy,

Thinking again about this random deletion issue this morning with a fresh mind and some caffeine, I think I found another straightforward way to do it.

1) Randomly shuffle the rows of your data set: data <- data[sample(1 : nrow(data)), ]

2) Use the regular duplicated() function

It seems to do the trick.

30 June 2013 at 1:06 pm

True…that would be an easier way of doing things. Thanks Tony.