Amy Whitehead's Research

the ecological musings of a conservation biologist

Combining dataframes when the columns don’t match

35 Comments

Most of my work recently has involved downloading large datasets of species occurrences from online databases and attempting to smoodge1 them together to create distribution maps for parts of Australia. Online databases typically have a ridiculous number of columns with obscure names which can make the smoodging process quite difficult.

For example, I was trying to combine data from two different regions into one file, where one region had 72 columns of data and another region had 75 columns. If you try and do this using rbind, you get an error but going through and identifying non-matching columns manually would be quite tedious and error-prone.

Here's an example of the function in use with some imaginary data. You'll note that Database One and Two have unequal number of columns (5 versus 6), a number of shared columns (species, latitude, longitude, database) and some unshared columns (method, data.source).

  species latitude longitude        method     database
1       p   -33.87     150.5   camera trap database.one
2       a   -33.71     151.3 live trapping database.one
3       n   -33.79     151.8   camera trap database.one
4       w   -34.35     151.3 live trapping database.one
5       h   -31.78     151.8   camera trap database.one
6       q   -33.17     151.2 live trapping database.one
      database species latitude longitude data.source accuracy
1 database.two       d   -33.95     152.7   herbarium    3.934
2 database.two       f   -32.60     150.2      museum    8.500
3 database.two       z   -32.47     150.7   herbarium    3.259
4 database.two       f   -30.67     150.6      museum    2.756
5 database.two       e   -32.73     149.4   herbarium    4.072
6 database.two       x   -33.49     153.3      museum    8.169
rbind(database.one, database.two)
Error: numbers of columns of arguments do not match

So I created a function that can be used to combine the data from two dataframes, keeping only the columns that have the same names (I don't care about the other ones). I'm sure there are other fancier ways of doing this but here's how my function works.

The basics steps
1. Specify the input dataframes
2. Calculate which dataframe has the greatest number of columns
3. Identify which columns in the smaller dataframe match the columns in the larger dataframe
4. Create a vector of the column names that occur in both dataframes
5. Combine the data from both dataframes matching the listed column names using rbind
6. Return the combined data

rbind.match.columns <- function(input1, input2) {
    n.input1 <- ncol(input1)
    n.input2 <- ncol(input2)

    if (n.input2 < n.input1) {
        TF.names <- which(names(input2) %in% names(input1))
        column.names <- names(input2[, TF.names])
    } else {
        TF.names <- which(names(input1) %in% names(input2))
        column.names <- names(input1[, TF.names])
    }

    return(rbind(input1[, column.names], input2[, column.names]))
}

rbind.match.columns(database.one, database.two)
   species latitude longitude     database
1        p   -33.87     150.5 database.one
2        a   -33.71     151.3 database.one
3        n   -33.79     151.8 database.one
4        w   -34.35     151.3 database.one
5        h   -31.78     151.8 database.one
6        q   -33.17     151.2 database.one
7        d   -33.95     152.7 database.two
8        f   -32.60     150.2 database.two
9        z   -32.47     150.7 database.two
10       f   -30.67     150.6 database.two
11       e   -32.73     149.4 database.two
12       x   -33.49     153.3 database.two

Running the function gives us a new dataframe with the four shared columns and twelve records, reflecting the combined data. Awesome!

Edited to add:

Viri asked a good question in the comments – what if you want to keep all of the columns in both data frames? The easiest solution to this problem is to add dummy columns to each dataframe that represent the columns missing from the other data frame and then use rbind to join them together. Of course, you won't actually have any data for these additional columns, so we simply set the values to NA. I've wrapped this up into a function as well.

rbind.all.columns <- function(x, y) {

    x.diff <- setdiff(colnames(x), colnames(y))
    y.diff <- setdiff(colnames(y), colnames(x))

    x[, c(as.character(y.diff))] <- NA

    y[, c(as.character(x.diff))] <- NA

    return(rbind(x, y))
}
rbind.all.columns(database.one, database.two)

And here you can see that we now have one dataframe containing all seven columns from our two sources, with NA values present where we are missing data from one of the dataframes. Nice!

   species latitude longitude        method     database data.source	accuracy
1        p   -33.87     150.5   camera trap database.one        <NA>	NA
2        a   -33.71     151.3 live trapping database.one        <NA>	NA
3        n   -33.79     151.8   camera trap database.one        <NA>	NA
4        w   -34.35     151.3 live trapping database.one        <NA>	NA	
5        h   -31.78     151.8   camera trap database.one        <NA>	NA
6        q   -33.17     151.2 live trapping database.one        <NA>	NA
7        d   -33.95     152.7          <NA> database.two   herbarium	3.934
8        f   -32.60     150.2          <NA> database.two      museum	8.500
9        z   -32.47     150.7          <NA> database.two   herbarium	3.259
10       f   -30.67     150.6          <NA> database.two      museum	2.756
11       e   -32.73     149.4          <NA> database.two   herbarium	4.072
12       x   -33.49     153.3          <NA> database.two      museum	8.169

Happy merging everyone!

1 A high technical and scientific term!

Bought to you by the powers of knitr & RWordpress

Advertisements

35 thoughts on “Combining dataframes when the columns don’t match

  1. Pingback: Recent Qaecologist blog posts (May 2013, Weeks 1 & 2) | Quantitative & Applied Ecology Group

  2. How about just doing this? No need to write a function.

    common.names <- intersect(colnames(database.one), colnames(database.two))
    combined.database <- rbind(database.one[, common.names], database.two[, common.names])

    • That is a much more elegant and simple solution. I didn’t know about intersect, so thanks for the tip!

    • if I have three or more data how I can do , also use intersect or ?? thanks

      • Hi, thanks for your question. I have typically been joining three or more dataframes together and do it in an iterative process by attaching two together and then the third to the merged dataframe (see below). This works pretty well, although could be a bit slow and tedious if you had many dataframes. Hope that helps but I’d be interested to know if you come up with a better solution for combining multiple dataframes.
        all.data <- rbind.match.columns(data.one,data.two)
        all.data <- rbind.match.columns(all.data,data.three)
        ….

      • Hi 李麗花 and Amy,

        Yes, Amy is right with the iterative approach.
        However, no need to write it out manually. This is exactly what is called Reduce in functional programming.

        What does “?Reduce” do? It takes a list of arguments and applies a specified function always to the result of the first application and a new element. The first application is ususally simply the first two elements. (There are some more details such as the order of the reductions (e.g. f(f(f(a,b),c),d) is not the same as f(f(a,b),f(c,d)) ) , but in the example here this doesn’t make a difference.

        With the given example it would be:

        all.data <- Reduce( rbind.all.columns, list( data.one, data.two, data.three ) )

        Fabian

      • Thanks Fabian – I didn’t know about the Reduce function but it looks like it would be super handy for these types of applications.

  3. The functions works great for my data, thanks!! opposite to the WoA suggestion… I also would like to do the same but showing the empty columns from the arguments that doesn’t match. Do you know how can I do that?

    • Thanks Viri. I had a think about your question and it’s actually not that difficult. It seems that the easiest option would be to add additional columns to your data frames so that you capture all the missing ones and set the values to NA. I’ve written this into another function and updated the post.

  4. Hi Amy, thanks so much for this blog post, this is exactly what I was trying to find! Especially the edit from Viri’s suggestions. Happy days!

  5. I like this but end up with the result in my functions environment and not as a dataframe! I think I am doing something wrong. My species data runs along the top (x) as oppose to down the side (y) of my two dataframes any tips for this?

    • Hi Rob,
      That is a little strange that your output ends up in the functions environment. Without seeing your code, I don’t really have a solution to that. With your second question, are you interested in merging the datasets by columns (ie species) or rows (whatever your other variables are)? Depending on how your dataframes are structured, you could either transpose them both (using t(dataframe1) so that they are in a format that matches my example or you may need to look at using cbind (or a hack thereof similar to what I have done here) which is a way of combining dataframes by columns. I’m happy to give you a more specific answer to both your questions if you want to send me a coded example.
      Good luck!

  6. Say I have 10 files (.txt) and each file has 101(rows) X 6 column of data. All the columns for all the files are identical. I want to merge the 1st column of each file to create a 101 (row) x 10 column merged data. How would I do it?

    Bernard

    • Hi Bernard,
      This is a really simple problem, which I’ve outlined in the code below using some dummy data. You can either do it manually or I’ve shown how you could do it in a loop, which might be useful if you have a large number of files.


      # Ten dummy datasets named "data1" to "data10" filled with random data
      for(i in 1:10){
      assign(paste0("data",i),data.frame(column1=runif(101),column2=runif(101),column3=runif(101),column4=runif(101),column5=runif(101),column6=runif(101)))
      }

      # The manual way
      output.manual <- data.frame(data1[,1],data2[,1],data3[,1],data4[,1],data5[,1],data6[,1],data7[,1],data8[,1],data9[,1],data10[,1])
      colnames(output.manual) <- c("data1","data2","data3","data4","data5","data6","data7","data8","data9","data10")

      ## Doing it in a loop
      # create an empty data frame to put your results into
      output.loop <- as.data.frame(matrix(NA,101,10))

      for(i in 1:10){
      # get the data from each dataset iteratively
      d <- get(paste0("data",i))
      # add to the output data
      output.loop[,i] <- d[,1]
      # rename the column based on the name of the data origin
      colnames(output.loop)[i] <- paste0("data",i)
      }

      Hope that helps.

      • Thanks Amy,

        The magic of someone who knows R well. Just as an additional 2 questions:

        1) I have 6 columns in each data file, and the above solution is great to append the same column from each file. I want to append all the 1st column of the 6 files (into one merged file), the 2nd column of all files (into another), etc…. I am preparing my dataset for PCA.

        2) in addition, if I want to concatenate one column from each file one-on top of the other (instead of the above), to create 101X 6 rows of data, what must I change? I am doing research in gait biomechanics.

        Very grateful,
        Bernard

      • Hi Bernard,

        Question 1:
        Essentially you just need to wrap the piece of code I presented before in another loop that goes through each column in the dataframe.

        for(j in 1:6){
        output <- output.loop

        for(i in 1:10){
        d <- get(paste0("data",i))
        output[,i] <- d[,j]
        colnames(output)[i] <- paste0("data",i)
        }
        assign(paste0("output.",j),output)
        rm(output)
        }

        This will produce six dataframes called output1 – output6

  7. Very helpful! Thank you!

  8. use dplyr’s function, bind_rows.

  9. Check out rbind.fill which does your second version (keeps all columns, inserts NAs where data is missing)

  10. No need for a new function if trying to merge two dataframes:
    mergeddfs <- data.frame(df1,df2)

    Nice website tho

  11. library(data.table)
    consolid<-rbindlist(list(newdt,newdt2), fill=TRUE, use.names = TRUE)

  12. An extended version of the above rbind.all.columns is implemented as RbindAll in package SSBtools.

    https://CRAN.R-project.org/package=SSBtools

  13. I was looking for this today and I like your method just as much as anything else (e.g. smartbind, rbind.fill).
    I did notice that some condition checking helped with some edge cases:

    if (length(x) > 0 & length(y.diff) > 0) {
    x[, c(as.character(y.diff))] 0 & length(x.diff) > 0) {
    y[, c(as.character(x.diff))] <- NA
    }

  14. Hi Amy,
    I found this post really helpful, exactly what I was looking for
    I have around 2000 files in a folder which i have to merge
    Could you please help with this

    • Hi Padhu,
      Glad you found it helpful.
      Your problem should be reasonably easy to solve with the following code – you’ll note that I’m not using the functions outlined in this post as there are new functions in dplyr that achieve the same outcome.


      require(dplyr)

      ThisPath <- "The path where your data lives"
      TheseFiles <- dir(myPath, pattern = "\.csv$") # identify the files you want to read in

      # read in all the files into a list
      dataList <- pblapply(TheseFiles,function(myFile) read.csv(paste0(ThisPath,myFile)))
      names(dataList) <- TheseFiles

      # merge the data files
      mergedData % bind_rows(.id="TheseFiles")

      Hopefully this works for you but let me know if you have any questions

  15. Thank yoy Amy. I was searching for this.. Added to my bookmarks

  16. I have a list of dataframes with different colnames and i want to rbind all dataframes intersecting names. Is it posible?

    • Hi Sergio,
      If I understand your question correctly, you have a list that contains a number of dataframes. These dataframes have some common column names and some column names that are only found in a subset of the dataframes.

      The method I’ve suggested below uses the bind_rows function in dplyr as this deals with lists better than the functions that this post deals with. By default, bind_rows retains all columns and fills the missing data with NAs but, if you identify the common column names across the dataframes within your list (TheseNames), you can use select to only retain the columns that are common across the dataframes.


      require(dplyr)
      # generate a dummy list with dataframes
      dataList <- list()
      dataList[["listElement1"]] <- data.frame(column1=runif(10),column2=runif(10),column3=runif(10))
      dataList[["listElement2"]] <- data.frame(column1=runif(10),column3=runif(10),letters=letters[1:10])

      # identify the common columns
      TheseNames <- unlist(lapply(names(dataList),function(x){names(dataList[[x]])}))
      TheseNames <- c("Source",TheseNames[duplicated(TheseNames)])

      # bind the dataframes and retain only the common columns
      mergedData % bind_rows(.id = "Source") %>% select(one_of(TheseNames))

      Hopefully that makes sense and answers your question but give me a yell if you have any questions.

  17. I had a unique problem and found a way to solve it.

    I had a structured master data frame with x amount columns and everyday I was adding data (rows) to this table. The problem is that sometimes the new data would have either more or less columns than the master table. I thought to myself, how would I do this in SQL ? Then it dawned on me and I began my research :

    fortunately for us, dplyr has a useful function called union_all.

    Master_Table <- union_all(Master_Table, New_Data)

    This not only aligns the columns that exist, but gives NA for the ones that do not and appends new columns if they don't exist on one side. The perfect solution.

    Thought I share

    Best,

    • Thanks Christopher – there are so many useful functions in the dplyr package that can be helpful for solving these types of problems!

  18. This is fantastic!

    I changed “NA” by “0” in the function rbind.all.columns in order to match two species matrices that have both shared and exclusive species. Very helpful. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s