Most of my work recently has involved downloading large datasets of species occurrences from online databases and attempting to smoodge1 them together to create distribution maps for parts of Australia. Online databases typically have a ridiculous number of columns with obscure names which can make the smoodging process quite difficult.
For example, I was trying to combine data from two different regions into one file, where one region had 72 columns of data and another region had 75 columns. If you try and do this using rbind
, you get an error but going through and identifying non-matching columns manually would be quite tedious and error-prone.
Here's an example of the function in use with some imaginary data. You'll note that Database One and Two have unequal number of columns (5 versus 6), a number of shared columns (species, latitude, longitude, database) and some unshared columns (method, data.source).
species latitude longitude method database 1 p -33.87 150.5 camera trap database.one 2 a -33.71 151.3 live trapping database.one 3 n -33.79 151.8 camera trap database.one 4 w -34.35 151.3 live trapping database.one 5 h -31.78 151.8 camera trap database.one 6 q -33.17 151.2 live trapping database.one
database species latitude longitude data.source accuracy 1 database.two d -33.95 152.7 herbarium 3.934 2 database.two f -32.60 150.2 museum 8.500 3 database.two z -32.47 150.7 herbarium 3.259 4 database.two f -30.67 150.6 museum 2.756 5 database.two e -32.73 149.4 herbarium 4.072 6 database.two x -33.49 153.3 museum 8.169
rbind(database.one, database.two)
Error: numbers of columns of arguments do not match
So I created a function that can be used to combine the data from two dataframes, keeping only the columns that have the same names (I don't care about the other ones). I'm sure there are other fancier ways of doing this but here's how my function works.
The basics steps
1. Specify the input dataframes
2. Calculate which dataframe has the greatest number of columns
3. Identify which columns in the smaller dataframe match
the columns in the larger dataframe
4. Create a vector of the column names that occur in both dataframes
5. Combine the data from both dataframes matching the listed column names using rbind
6. Return the combined data
rbind.match.columns <- function(input1, input2) { n.input1 <- ncol(input1) n.input2 <- ncol(input2) if (n.input2 < n.input1) { TF.names <- which(names(input2) %in% names(input1)) column.names <- names(input2[, TF.names]) } else { TF.names <- which(names(input1) %in% names(input2)) column.names <- names(input1[, TF.names]) } return(rbind(input1[, column.names], input2[, column.names])) } rbind.match.columns(database.one, database.two)
species latitude longitude database 1 p -33.87 150.5 database.one 2 a -33.71 151.3 database.one 3 n -33.79 151.8 database.one 4 w -34.35 151.3 database.one 5 h -31.78 151.8 database.one 6 q -33.17 151.2 database.one 7 d -33.95 152.7 database.two 8 f -32.60 150.2 database.two 9 z -32.47 150.7 database.two 10 f -30.67 150.6 database.two 11 e -32.73 149.4 database.two 12 x -33.49 153.3 database.two
Running the function gives us a new dataframe with the four shared columns and twelve records, reflecting the combined data. Awesome!
Edited to add:
Viri asked a good question in the comments – what if you want to keep all of the columns in both data frames? The easiest solution to this problem is to add dummy columns to each dataframe that represent the columns missing from the other data frame and then use rbind
to join them together. Of course, you won't actually have any data for these additional columns, so we simply set the values to NA
. I've wrapped this up into a function as well.
rbind.all.columns <- function(x, y) { x.diff <- setdiff(colnames(x), colnames(y)) y.diff <- setdiff(colnames(y), colnames(x)) x[, c(as.character(y.diff))] <- NA y[, c(as.character(x.diff))] <- NA return(rbind(x, y)) }
rbind.all.columns(database.one, database.two)
And here you can see that we now have one dataframe containing all seven columns from our two sources, with NA
values present where we are missing data from one of the dataframes. Nice!
species latitude longitude method database data.source accuracy 1 p -33.87 150.5 camera trap database.one <NA> NA 2 a -33.71 151.3 live trapping database.one <NA> NA 3 n -33.79 151.8 camera trap database.one <NA> NA 4 w -34.35 151.3 live trapping database.one <NA> NA 5 h -31.78 151.8 camera trap database.one <NA> NA 6 q -33.17 151.2 live trapping database.one <NA> NA 7 d -33.95 152.7 <NA> database.two herbarium 3.934 8 f -32.60 150.2 <NA> database.two museum 8.500 9 z -32.47 150.7 <NA> database.two herbarium 3.259 10 f -30.67 150.6 <NA> database.two museum 2.756 11 e -32.73 149.4 <NA> database.two herbarium 4.072 12 x -33.49 153.3 <NA> database.two museum 8.169
Happy merging everyone!
1 A high technical and scientific term!
Pingback: Recent Qaecologist blog posts (May 2013, Weeks 1 & 2) | Quantitative & Applied Ecology Group
11 October 2013 at 4:13 am
How about just doing this? No need to write a function.
common.names <- intersect(colnames(database.one), colnames(database.two))
combined.database <- rbind(database.one[, common.names], database.two[, common.names])
11 October 2013 at 10:09 am
That is a much more elegant and simple solution. I didn’t know about intersect, so thanks for the tip!
10 July 2014 at 1:57 am
if I have three or more data how I can do , also use intersect or ?? thanks
10 July 2014 at 10:26 am
Hi, thanks for your question. I have typically been joining three or more dataframes together and do it in an iterative process by attaching two together and then the third to the merged dataframe (see below). This works pretty well, although could be a bit slow and tedious if you had many dataframes. Hope that helps but I’d be interested to know if you come up with a better solution for combining multiple dataframes.
all.data <- rbind.match.columns(data.one,data.two)
all.data <- rbind.match.columns(all.data,data.three)
….
20 January 2017 at 2:23 am
Hi 李麗花 and Amy,
Yes, Amy is right with the iterative approach.
However, no need to write it out manually. This is exactly what is called Reduce in functional programming.
What does “?Reduce” do? It takes a list of arguments and applies a specified function always to the result of the first application and a new element. The first application is ususally simply the first two elements. (There are some more details such as the order of the reductions (e.g. f(f(f(a,b),c),d) is not the same as f(f(a,b),f(c,d)) ) , but in the example here this doesn’t make a difference.
With the given example it would be:
all.data <- Reduce( rbind.all.columns, list( data.one, data.two, data.three ) )
Fabian
20 January 2017 at 7:34 am
Thanks Fabian – I didn’t know about the Reduce function but it looks like it would be super handy for these types of applications.
27 March 2014 at 11:52 am
The functions works great for my data, thanks!! opposite to the WoA suggestion… I also would like to do the same but showing the empty columns from the arguments that doesn’t match. Do you know how can I do that?
18 April 2014 at 3:24 pm
Thanks Viri. I had a think about your question and it’s actually not that difficult. It seems that the easiest option would be to add additional columns to your data frames so that you capture all the missing ones and set the values to NA. I’ve written this into another function and updated the post.
10 August 2014 at 6:54 pm
Hi Amy, thanks so much for this blog post, this is exactly what I was trying to find! Especially the edit from Viri’s suggestions. Happy days!
10 August 2014 at 7:05 pm
Glad you found it helpful Emeline – have fun mashing dataframes together!
12 November 2014 at 9:14 am
I like this but end up with the result in my functions environment and not as a dataframe! I think I am doing something wrong. My species data runs along the top (x) as oppose to down the side (y) of my two dataframes any tips for this?
12 November 2014 at 12:03 pm
Hi Rob,
That is a little strange that your output ends up in the functions environment. Without seeing your code, I don’t really have a solution to that. With your second question, are you interested in merging the datasets by columns (ie species) or rows (whatever your other variables are)? Depending on how your dataframes are structured, you could either transpose them both (using
t
(dataframe1) so that they are in a format that matches my example or you may need to look at usingcbind
(or a hack thereof similar to what I have done here) which is a way of combining dataframes by columns. I’m happy to give you a more specific answer to both your questions if you want to send me a coded example.Good luck!
13 November 2014 at 2:25 pm
Hey Amy, some solutions and things to add to this nice blog piece! http://stackoverflow.com/questions/26874710/how-does-one-combine-two-uneven-dataframes-to-create-a-full-species-matrix-for-a/26900774#26900774
11 February 2015 at 12:34 am
Say I have 10 files (.txt) and each file has 101(rows) X 6 column of data. All the columns for all the files are identical. I want to merge the 1st column of each file to create a 101 (row) x 10 column merged data. How would I do it?
Bernard
12 February 2015 at 10:47 am
Hi Bernard,
This is a really simple problem, which I’ve outlined in the code below using some dummy data. You can either do it manually or I’ve shown how you could do it in a loop, which might be useful if you have a large number of files.
# Ten dummy datasets named "data1" to "data10" filled with random data
for(i in 1:10){
assign(paste0("data",i),data.frame(column1=runif(101),column2=runif(101),column3=runif(101),column4=runif(101),column5=runif(101),column6=runif(101)))
}
# The manual way
output.manual <- data.frame(data1[,1],data2[,1],data3[,1],data4[,1],data5[,1],data6[,1],data7[,1],data8[,1],data9[,1],data10[,1])
colnames(output.manual) <- c("data1","data2","data3","data4","data5","data6","data7","data8","data9","data10")
## Doing it in a loop
# create an empty data frame to put your results into
output.loop <- as.data.frame(matrix(NA,101,10))
for(i in 1:10){
# get the data from each dataset iteratively
d <- get(paste0("data",i))
# add to the output data
output.loop[,i] <- d[,1]
# rename the column based on the name of the data origin
colnames(output.loop)[i] <- paste0("data",i)
}
Hope that helps.
12 February 2015 at 9:42 pm
Thanks Amy,
The magic of someone who knows R well. Just as an additional 2 questions:
1) I have 6 columns in each data file, and the above solution is great to append the same column from each file. I want to append all the 1st column of the 6 files (into one merged file), the 2nd column of all files (into another), etc…. I am preparing my dataset for PCA.
2) in addition, if I want to concatenate one column from each file one-on top of the other (instead of the above), to create 101X 6 rows of data, what must I change? I am doing research in gait biomechanics.
Very grateful,
Bernard
18 February 2015 at 2:56 pm
Hi Bernard,
Question 1:
Essentially you just need to wrap the piece of code I presented before in another loop that goes through each column in the dataframe.
for(j in 1:6){
output <- output.loop
for(i in 1:10){
d <- get(paste0("data",i))
output[,i] <- d[,j]
colnames(output)[i] <- paste0("data",i)
}
assign(paste0("output.",j),output)
rm(output)
}
This will produce six dataframes called output1 – output6
27 May 2015 at 1:22 am
Very helpful! Thank you!
19 June 2015 at 3:38 pm
use dplyr’s function, bind_rows.
28 August 2015 at 6:15 am
Check out rbind.fill which does your second version (keeps all columns, inserts NAs where data is missing)
9 November 2015 at 2:43 am
No need for a new function if trying to merge two dataframes:
mergeddfs <- data.frame(df1,df2)
Nice website tho
3 September 2016 at 5:29 am
library(data.table)
consolid<-rbindlist(list(newdt,newdt2), fill=TRUE, use.names = TRUE)
3 January 2017 at 7:36 pm
An extended version of the above rbind.all.columns is implemented as RbindAll in package SSBtools.
https://CRAN.R-project.org/package=SSBtools
3 January 2017 at 8:41 pm
Thanks for the tip Øyvind – I love how R is a constantly evolving environment.
6 January 2017 at 8:52 am
I was looking for this today and I like your method just as much as anything else (e.g. smartbind, rbind.fill).
I did notice that some condition checking helped with some edge cases:
if (length(x) > 0 & length(y.diff) > 0) {
x[, c(as.character(y.diff))] 0 & length(x.diff) > 0) {
y[, c(as.character(x.diff))] <- NA
}
10 March 2017 at 8:31 pm
Hi Amy,
I found this post really helpful, exactly what I was looking for
I have around 2000 files in a folder which i have to merge
Could you please help with this
15 March 2017 at 12:06 pm
Hi Padhu,
Glad you found it helpful.
Your problem should be reasonably easy to solve with the following code – you’ll note that I’m not using the functions outlined in this post as there are new functions in dplyr that achieve the same outcome.
require(dplyr)
ThisPath <- "The path where your data lives"
TheseFiles <- dir(myPath, pattern = "\.csv$") # identify the files you want to read in
# read in all the files into a list
dataList <- pblapply(TheseFiles,function(myFile) read.csv(paste0(ThisPath,myFile)))
names(dataList) <- TheseFiles
# merge the data files
mergedData % bind_rows(.id="TheseFiles")
Hopefully this works for you but let me know if you have any questions
11 March 2017 at 2:45 am
Thank yoy Amy. I was searching for this.. Added to my bookmarks
11 March 2017 at 2:51 am
I have a list of dataframes with different colnames and i want to rbind all dataframes intersecting names. Is it posible?
15 March 2017 at 2:56 pm
Hi Sergio,
If I understand your question correctly, you have a list that contains a number of dataframes. These dataframes have some common column names and some column names that are only found in a subset of the dataframes.
The method I’ve suggested below uses the bind_rows function in dplyr as this deals with lists better than the functions that this post deals with. By default,
bind_rows
retains all columns and fills the missing data with NAs but, if you identify the common column names across the dataframes within your list (TheseNames
), you can useselect
to only retain the columns that are common across the dataframes.require(dplyr)
# generate a dummy list with dataframes
dataList <- list()
dataList[["listElement1"]] <- data.frame(column1=runif(10),column2=runif(10),column3=runif(10))
dataList[["listElement2"]] <- data.frame(column1=runif(10),column3=runif(10),letters=letters[1:10])
# identify the common columns
TheseNames <- unlist(lapply(names(dataList),function(x){names(dataList[[x]])}))
TheseNames <- c("Source",TheseNames[duplicated(TheseNames)])
# bind the dataframes and retain only the common columns
mergedData % bind_rows(.id = "Source") %>% select(one_of(TheseNames))
Hopefully that makes sense and answers your question but give me a yell if you have any questions.
29 March 2017 at 2:10 am
I had a unique problem and found a way to solve it.
I had a structured master data frame with x amount columns and everyday I was adding data (rows) to this table. The problem is that sometimes the new data would have either more or less columns than the master table. I thought to myself, how would I do this in SQL ? Then it dawned on me and I began my research :
fortunately for us, dplyr has a useful function called union_all.
Master_Table <- union_all(Master_Table, New_Data)
This not only aligns the columns that exist, but gives NA for the ones that do not and appends new columns if they don't exist on one side. The perfect solution.
Thought I share
Best,
29 March 2017 at 7:06 am
Thanks Christopher – there are so many useful functions in the dplyr package that can be helpful for solving these types of problems!
25 May 2017 at 10:49 am
This is fantastic!
I changed “NA” by “0” in the function rbind.all.columns in order to match two species matrices that have both shared and exclusive species. Very helpful. Thanks!
25 May 2017 at 11:11 am
That’s a great use of the function – glad you found it helpful!
11 April 2018 at 11:28 am
Thanks a lot, you saved my life !!!!!!
17 May 2018 at 11:48 pm
Thanks Amy! That just saved me a lot of time!
18 May 2018 at 6:57 am
No worries – glad you found it useful 🙂
15 June 2018 at 5:03 am
The rbind.all.columns did not quite work for me, so I tried doing it a little differently. I’m running 3.5.0 and it was giving me an error while running this. I modified it a bit and used the following code
rbind.all.columns <- function(x, y) {
x.diff <- setdiff(colnames(x), colnames(y))
x1 = matrix(data=NA,nrow=0,ncol=length(y.diff))
colnames(x1)=as.character(y.diff)
x = tryCatch(cbind(x,x1),error = function(c) x)
y.diff <- setdiff(colnames(y), colnames(x))
y1 = matrix(data=NA,nrow=0,ncol=length(x.diff))
colnames(y1)=as.character(x.diff)
y = tryCatch(cbind(y,y1), error = function(c) y)
return(rbind(x, y))
}
15 June 2018 at 2:24 pm
Thanks Sachin. Keeping things compatible with new versions is always a pain but glad you managed to get something to work.
20 July 2018 at 5:27 am
What if the column names are completely different and we have to match the records of two data frames.
20 July 2018 at 7:21 am
In this case (assuming that by “match” you mean append the rows of one dataframe on the bottom of the other as I have done in this example), then you will need to rename at least some of the columns to be the same in both dataframes or else it won’t work. Renaming can be done by names(myDataframe) <- c("a","vector","of","names","the","same","length","as","the","number","of","columns", "in","myDataframe")
If you mean "match" in the sense of joining two dataframes together side by side by a common column, then I'd use one of the join commands in dplyr. For example, if we have two data sets that represent sites (named site in one frame and place in another), we could join them using full_join as below.
library(dplyr)
data1 <- data.frame(site = letters[1:5], rainfall= runif(5))
data2 <- dataframe(place = letters[3:7], elevation= runif(5)*100)
allData <- full_join(data1, data2, by = c("site" = "place"))
This will give one final dataframe with three columns, with NA values in columns where data is missing in one of the datasets.
Hope this helps 🙂