• Uncategorized

About r : How-to-import-multiple-csv-files-at-once

Question Detail

Suppose we have a folder containing multiple data.csv files, each containing the same number of variables but each from different times.
Is there a way in R to import them all simultaneously rather than having to import them all individually?

My problem is that I have around 2000 data files to import and having to import them individually just by using the code:

read.delim(file="filename", header=TRUE, sep="\t")

is not very efficient.

Question Answer

Something like the following should result in each data frame as a separate element in a single list:

temp = list.files(pattern="*.csv")
myfiles = lapply(temp, read.delim)

This assumes that you have those CSVs in a single directory–your current working directory–and that all of them have the lower-case extension .csv.

If you then want to combine those data frames into a single data frame, see the solutions in other answers using things like do.call(rbind,...), dplyr::bind_rows() or data.table::rbindlist().

If you really want each data frame in a separate object, even though that’s often inadvisable, you could do the following with assign:

temp = list.files(pattern="*.csv")
for (i in 1:length(temp)) assign(temp[i], read.csv(temp[i]))

Or, without assign, and to demonstrate (1) how the file name can be cleaned up and (2) show how to use list2env, you can try the following:

temp = list.files(pattern="*.csv")
list2env(
  lapply(setNames(temp, make.names(gsub("*.csv$", "", temp))), 
         read.csv), envir = .GlobalEnv)

But again, it’s often better to leave them in a single list.

A speedy and succinct tidyverse solution:
(more than twice as fast as Base R’s read.csv)

tbl <-
    list.files(pattern = "*.csv") %>% 
    map_df(~read_csv(.))

and data.table‘s fread() can even cut those load times by half again. (for 1/4 the Base R times)

library(data.table)

tbl_fread <- 
    list.files(pattern = "*.csv") %>% 
    map_df(~fread(.))

The stringsAsFactors = FALSE argument keeps the dataframe factor free, (and as marbel points out, is the default setting for fread)

If the typecasting is being cheeky, you can force all the columns to be as characters with the col_types argument.

tbl <-
    list.files(pattern = "*.csv") %>% 
    map_df(~read_csv(., col_types = cols(.default = "c")))

If you are wanting to dip into subdirectories to construct your list of files to eventually bind, then be sure to include the path name, as well as register the files with their full names in your list. This will allow the binding work to go on outside of the current directory. (Thinking of the full pathnames as operating like passports to allow movement back across directory ‘borders’.)

tbl <-
    list.files(path = "./subdirectory/",
               pattern = "*.csv", 
               full.names = T) %>% 
    map_df(~read_csv(., col_types = cols(.default = "c"))) 

As Hadley describes here (about halfway down):

map_df(x, f) is effectively the same as do.call("rbind", lapply(x, f))….

Bonus Featureadding filenames to the records per Niks feature request in comments below:
* Add original filename to each record.

Code explained: make a function to append the filename to each record during the initial reading of the tables. Then use that function instead of the simple read_csv() function.

read_plus <- function(flnm) {
    read_csv(flnm) %>% 
        mutate(filename = flnm)
}

tbl_with_sources <-
    list.files(pattern = "*.csv", 
               full.names = T) %>% 
    map_df(~read_plus(.))

(The typecasting and subdirectory handling approaches can also be handled inside the read_plus() function in the same manner as illustrated in the second and third variants suggested above.)

### Benchmark Code & Results 
library(tidyverse)
library(data.table)
library(microbenchmark)

### Base R Approaches
#### Instead of a dataframe, this approach creates a list of lists
#### removed from analysis as this alone doubled analysis time reqd
# lapply_read.delim <- function(path, pattern = "*.csv") {
#     temp = list.files(path, pattern, full.names = TRUE)
#     myfiles = lapply(temp, read.delim)
# }

#### `read.csv()`
do.call_rbind_read.csv <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))
}

map_df_read.csv <- function(path, pattern = "*.csv") {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_df(~read.csv(., stringsAsFactors = FALSE))
}


### *dplyr()*
#### `read_csv()`
lapply_read_csv_bind_rows <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    lapply(files, read_csv) %>% bind_rows()
}

map_df_read_csv <- function(path, pattern = "*.csv") {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_df(~read_csv(., col_types = cols(.default = "c")))
}

### *data.table* / *purrr* hybrid
map_df_fread <- function(path, pattern = "*.csv") {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_df(~fread(.))
}

### *data.table*
rbindlist_fread <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    rbindlist(lapply(files, function(x) fread(x)))
}

do.call_rbind_fread <- function(path, pattern = "*.csv") {
    files = list.files(path, pattern, full.names = TRUE)
    do.call(rbind, lapply(files, function(x) fread(x, stringsAsFactors = FALSE)))
}


read_results <- function(dir_size){
    microbenchmark(
        # lapply_read.delim = lapply_read.delim(dir_size), # too slow to include in benchmarks
        do.call_rbind_read.csv = do.call_rbind_read.csv(dir_size),
        map_df_read.csv = map_df_read.csv(dir_size),
        lapply_read_csv_bind_rows = lapply_read_csv_bind_rows(dir_size),
        map_df_read_csv = map_df_read_csv(dir_size),
        rbindlist_fread = rbindlist_fread(dir_size),
        do.call_rbind_fread = do.call_rbind_fread(dir_size),
        map_df_fread = map_df_fread(dir_size),
        times = 10L) 
}

read_results_lrg_mid_mid <- read_results('./testFolder/500MB_12.5MB_40files')
print(read_results_lrg_mid_mid, digits = 3)

read_results_sml_mic_mny <- read_results('./testFolder/5MB_5KB_1000files/')
read_results_sml_tny_mod <- read_results('./testFolder/5MB_50KB_100files/')
read_results_sml_sml_few <- read_results('./testFolder/5MB_500KB_10files/')

read_results_med_sml_mny <- read_results('./testFolder/50MB_5OKB_1000files')
read_results_med_sml_mod <- read_results('./testFolder/50MB_5OOKB_100files')
read_results_med_med_few <- read_results('./testFolder/50MB_5MB_10files')

read_results_lrg_sml_mny <- read_results('./testFolder/500MB_500KB_1000files')
read_results_lrg_med_mod <- read_results('./testFolder/500MB_5MB_100files')
read_results_lrg_lrg_few <- read_results('./testFolder/500MB_50MB_10files')

read_results_xlg_lrg_mod <- read_results('./testFolder/5000MB_50MB_100files')


print(read_results_sml_mic_mny, digits = 3)
print(read_results_sml_tny_mod, digits = 3)
print(read_results_sml_sml_few, digits = 3)

print(read_results_med_sml_mny, digits = 3)
print(read_results_med_sml_mod, digits = 3)
print(read_results_med_med_few, digits = 3)

print(read_results_lrg_sml_mny, digits = 3)
print(read_results_lrg_med_mod, digits = 3)
print(read_results_lrg_lrg_few, digits = 3)

print(read_results_xlg_lrg_mod, digits = 3)

# display boxplot of my typical use case results & basic machine max load
par(oma = c(0,0,0,0)) # remove overall margins if present
par(mfcol = c(1,1)) # remove grid if present
par(mar = c(12,5,1,1) + 0.1) # to display just a single boxplot with its complete labels
boxplot(read_results_lrg_mid_mid, las = 2, xlab = "", ylab = "Duration (seconds)", main = "40 files @ 12.5MB (500MB)")
boxplot(read_results_xlg_lrg_mod, las = 2, xlab = "", ylab = "Duration (seconds)", main = "100 files @ 50MB (5GB)")

# generate 3x3 grid boxplots
par(oma = c(12,1,1,1)) # margins for the whole 3 x 3 grid plot
par(mfcol = c(3,3)) # create grid (filling down each column)
par(mar = c(1,4,2,1)) # margins for the individual plots in 3 x 3 grid
boxplot(read_results_sml_mic_mny, las = 2, xlab = "", ylab = "Duration (seconds)", main = "1000 files @ 5KB (5MB)", xaxt = 'n')
boxplot(read_results_sml_tny_mod, las = 2, xlab = "", ylab = "Duration (milliseconds)", main = "100 files @ 50KB (5MB)", xaxt = 'n')
boxplot(read_results_sml_sml_few, las = 2, xlab = "", ylab = "Duration (milliseconds)", main = "10 files @ 500KB (5MB)",)

boxplot(read_results_med_sml_mny, las = 2, xlab = "", ylab = "Duration (microseconds)        ", main = "1000 files @ 50KB (50MB)", xaxt = 'n')
boxplot(read_results_med_sml_mod, las = 2, xlab = "", ylab = "Duration (microseconds)", main = "100 files @ 500KB (50MB)", xaxt = 'n')
boxplot(read_results_med_med_few, las = 2, xlab = "", ylab = "Duration (seconds)", main = "10 files @ 5MB (50MB)")

boxplot(read_results_lrg_sml_mny, las = 2, xlab = "", ylab = "Duration (seconds)", main = "1000 files @ 500KB (500MB)", xaxt = 'n')
boxplot(read_results_lrg_med_mod, las = 2, xlab = "", ylab = "Duration (seconds)", main = "100 files @ 5MB (500MB)", xaxt = 'n')
boxplot(read_results_lrg_lrg_few, las = 2, xlab = "", ylab = "Duration (seconds)", main = "10 files @ 50MB (500MB)")

Middling Use Case

Larger Use Case

Variety of Use Cases

Rows: file counts (1000, 100, 10)
Columns: final dataframe size (5MB, 50MB, 500MB)
(click on image to view original size)

The base R results are better for the smallest use cases where the overhead of bringing the C libraries of purrr and dplyr to bear outweigh the performance gains that are observed when performing larger scale processing tasks.

if you want to run your own tests you may find this bash script helpful.

for ((i=1; i<=$2; i++)); do 
  cp "$1" "${1:0:8}_${i}.csv";
done

bash what_you_name_this_script.sh "fileName_you_want_copied" 100 will create 100 copies of your file sequentially numbered (after the initial 8 characters of the filename and an underscore).

Attributions and Appreciations

With special thanks to:

  • Tyler Rinker and Akrun for demonstrating microbenchmark.
  • Jake Kaupp for introducing me to map_df() here.
  • David McLaughlin for helpful feedback on improving the visualizations and discussing/confirming the performance inversions observed in the small file, small dataframe analysis results.
  • marbel for pointing out the default behavior for fread(). (I need to study up on data.table.)

Here are some options to convert the .csv files into one data.frame using R base and some of the available packages for reading files in R.

This is slower than the options below.

# Get the files names
files = list.files(pattern="*.csv")
# First apply read.csv, then rbind
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))

Edit: – A few more extra choices using data.table and readr

A fread() version, which is a function of the data.table package. This is by far the fastest option in R.

library(data.table)
DT = do.call(rbind, lapply(files, fread))
# The same using `rbindlist`
DT = rbindlist(lapply(files, fread))

Using readr, which is another package for reading csv files. It’s slower than fread, faster than base R but has different functionalities.

library(readr)
library(dplyr)
tbl = lapply(files, read_csv) %>% bind_rows()

With many files and many cores, fread xargs cat (described below) is about 50x faster than the fastest solution in the top 3 answers.

rbindlist lapply read.delim  500s <- 1st place & accepted answer
rbindlist lapply fread       250s <- 2nd & 3rd place answers
rbindlist mclapply fread      10s
fread xargs cat                5s

Time to read 121401 csvs into a single data.table. Each time is an average of three runs then rounded. Each csv has 3 columns, one header row, and, on average, 4.510 rows. Machine is a GCP VM with 96 cores.

The top three answers by @A5C1D2H2I1M1N2O1R2T1, @leerssej, and @marbel and are all essentially the same: apply fread (or read.delim) to each file, then rbind/rbindlist the resulting data.tables. For small datasets, I usually use the rbindlist(lapply(list.files("*.csv"),fread)) form. For medium-sized datasets, I use parallel’s mclapply instead of lapply, which is much faster if you have many cores.

This is better than other R-internal alternatives, but not the best for a large number of small csvs when speed matters. In that case, it can be much faster to first use cat to first concatenate all the csvs into one csv, as in @Spacedman’s answer. I’ll add some detail on how to do this from within R:

x = fread(cmd='cat *.csv', header=F)

However, what if each csv has a header?

x = fread(cmd="awk 'NR==1||FNR!=1' *.csv", header=T)

And what if you have so many files that the *.csv shell glob fails?

x = fread(cmd='find . -name "*.csv" | xargs cat', header=F)

And what if all files have a header AND there are too many files?

header = fread(cmd='find . -name "*.csv" | head -n1 | xargs head -n1', header=T)
x = fread(cmd='find . -name "*.csv" | xargs tail -q -n+2', header=F)
setnames(x,header)

And what if the resulting concatenated csv is too big for system memory? (e.g., /dev/shm out of space error)

system('find . -name "*.csv" | xargs cat > combined.csv')
x = fread('combined.csv', header=F)

With headers?

system('find . -name "*.csv" | head -n1 | xargs head -n1 > combined.csv')
system('find . -name "*.csv" | xargs tail -q -n+2 >> combined.csv')
x = fread('combined.csv', header=T)

Finally, what if you don’t want all .csv in a directory, but rather a specific set of files? (Also, they all have headers.) (This is my use case.)

fread(text=paste0(system("xargs cat|awk 'NR==1||$1!=\"<column one name>\"'",input=paths,intern=T),collapse="\n"),header=T,sep="\t")

and this is about the same speed as plain fread xargs cat 🙂

Note: for data.table pre-v1.11.6 (19 Sep 2018), omit the cmd= from fread(cmd=.

To sum up, if you’re interested in speed, and have many files and many cores, fread xargs cat is about 50x faster than the fastest solution in the top 3 answers.

Update: here is a function I wrote to easily apply the fastest solution. I use it in production in several situations, but you should test it thoroughly with your own data before trusting it.

fread_many = function(files,header=T,...){
  if(length(files)==0) return()
  if(typeof(files)!='character') return()
  files = files[file.exists(files)]
  if(length(files)==0) return()
  tmp = tempfile(fileext = ".csv")
  # note 1: requires awk, not cat or tail because some files have no final newline
  # note 2: parallel --xargs is 40% slower
  # note 3: reading to var is 15% slower and crashes R if the string is too long
  # note 4: shorter paths -> more paths per awk -> fewer awks -> measurably faster
  #         so best cd to the csv dir and use relative paths
  if(header==T){
    system(paste0('head -n1 ',files[1],' > ',tmp))
    system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
  } else {
    system(paste0("xargs awk '1' > ",tmp),input=files)
  }
  DT = fread(file=tmp,header=header,...)
  file.remove(tmp)
  DT
}

Update 2: here is a more complicated version of the fread_many function for cases where you want the resulting data.table to include a column for the inpath of each csv. In this case, one must also explicitly specify the csv separator with the sep argument.

fread_many = function(files,header=T,keep_inpath=F,sep="auto",...){
  if(length(files)==0) return()
  if(typeof(files)!='character') return()
  files = files[file.exists(files)]
  if(length(files)==0) return()
  tmp = tempfile(fileext = ".csv")
  if(keep_inpath==T){
    stopifnot(sep!="auto")
    if(header==T){
      system(paste0('/usr/bin/echo -ne inpath"',sep,'" > ',tmp))
      system(paste0('head -n1 ',files[1],' >> ',tmp))
      system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{if(FNR>1)print FILENAME,$0}' >> ",tmp),input=files)
    } else {
      system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{print FILENAME,$0}' > ",tmp),input=files)
    }
  } else {
    if(header==T){
      system(paste0('head -n1 ',files[1],' > ',tmp))
      system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
    } else {
      system(paste0("xargs awk '1' > ",tmp),input=files)
    }
  }
  DT = fread(file=tmp,header=header,sep=sep,...)
  file.remove(tmp)
  DT
}

Caveat: all of my solutions that concatenate the csvs before reading them assumes they all have the same separator. If not all of your csvs use the same delimiter, instead use rbindlist lapply fread, rbindlist mclapply fread, or fread xargs cat in batches, where all csvs in a batch use the same separator.

As well as using lapply or some other looping construct in R you could merge your CSV files into one file.

In Unix, if the files had no headers, then its as easy as:

cat *.csv > all.csv

or if there are headers, and you can find a string that matches headers and only headers (ie suppose header lines all start with “Age”), you’d do:

cat *.csv | grep -v ^Age > all.csv

I think in Windows you could do this with COPY and SEARCH (or FIND or something) from the DOS command box, but why not install cygwin and get the power of the Unix command shell?

This is the code I developed to read all csv files into R. It will create a dataframe for each csv file individually and title that dataframe the file’s original name (removing spaces and the .csv) I hope you find it useful!

path <- "C:/Users/cfees/My Box Files/Fitness/"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)), 
read.csv(paste(path,file,sep="")))
}

In my view, most of the other answers are obsoleted by rio::import_list, which is a succinct one-liner:

library(rio)
my_data <- import_list(dir("path_to_directory", pattern = ".csv"), rbind = TRUE)

Any extra arguments are passed to rio::import. rio can deal with almost any file format R can read, and it uses data.table‘s fread where possible, so it should be fast too.

Using plyr::ldply there is roughly a 50% speed increase by enabling the .parallel option while reading 400 csv files roughly 30-40 MB each. Example includes a text progress bar.

library(plyr)
library(data.table)
library(doSNOW)

csv.list <- list.files(path="t:/data", pattern=".csv$", full.names=TRUE)

cl <- makeCluster(4)
registerDoSNOW(cl)

pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))

stopCluster(cl)

Using purrr and including file IDs as a column:

library(tidyverse)


p <- "my/directory"
files <- list.files(p, pattern="csv", full.names=TRUE) %>%
    set_names()
merged <- files %>% map_dfr(read_csv, .id="filename")

Without set_names(), .id= will use integer indicators, instead of actual file names.

If you then want just the short filename without the full path:

merged <- merged %>% mutate(filename=basename(filename))

Building on dnlbrk’s comment, assign can be considerably faster than list2env for big files.

library(readr)
library(stringr)

List_of_file_paths <- list.files(path ="C:/Users/Anon/Documents/Folder_with_csv_files/", pattern = ".csv", all.files = TRUE, full.names = TRUE)

By setting the full.names argument to true, you will get the full path to each file as a separate character string in your list of files, e.g., List_of_file_paths[1] will be something like “C:/Users/Anon/Documents/Folder_with_csv_files/file1.csv”

for(f in 1:length(List_of_filepaths)) {
  file_name <- str_sub(string = List_of_filepaths[f], start = 46, end = -5)
  file_df <- read_csv(List_of_filepaths[f])  
  assign( x = file_name, value = file_df, envir = .GlobalEnv)
}

You could use the data.table package’s fread or base R read.csv instead of read_csv. The file_name step allows you to tidy up the name so that each data frame does not remain with the full path to the file as it’s name.
You could extend your loop to do further things to the data table before transferring it to the global environment, for example:

for(f in 1:length(List_of_filepaths)) {
  file_name <- str_sub(string = List_of_filepaths[f], start = 46, end = -5)
  file_df <- read_csv(List_of_filepaths[f])  
  file_df <- file_df[,1:3] #if you only need the first three columns
  assign( x = file_name, value = file_df, envir = .GlobalEnv)
}

This is my specific example to read multiple files and combine them into 1 data frame:

path<- file.path("C:/folder/subfolder")
files <- list.files(path=path, pattern="/*.csv",full.names = T)
library(data.table)
data = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))

With readr 2.0.0 onwards, you can read multiple files in at once simply by providing a list of their paths to the file argument. Here is an example showing this with readr::read_csv().

packageVersion("readr")
#> [1] '2.0.1'
library(readr)
library(fs)

# create files to read in
write_csv(read_csv("1, 2 \n 3, 4", col_names = c("x", "y")), file = "file1.csv")
write_csv(read_csv("5, 6 \n 7, 8", col_names = c("x", "y")), file = "file2.csv")

# get a list of files
files <- dir_ls(".", glob = "file*csv")
files
#> file1.csv file2.csv

# read them in at once
# record paths in a column called filename
read_csv(files, id = "filename")
#> # A tibble: 4 × 3
#>   filename      x     y
#>   <chr>     <dbl> <dbl>
#> 1 file1.csv     1     2
#> 2 file1.csv     3     4
#> 3 file2.csv     5     6
#> 4 file2.csv     7     8

Created on 2021-09-16 by the reprex package (v2.0.1)

The following codes should give you the fastest speed for big data as long as you have many cores on your computer:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(doParallel, data.table, stringr)

# get the file name
dir() %>% str_subset("\\.csv$") -> fn

# use parallel setting
(cl <- detectCores() %>%
  makeCluster()) %>%
  registerDoParallel()

# read and bind all files together
system.time({
  big_df <- foreach(
    i = fn,
    .packages = "data.table"
  ) %dopar%
    {
      fread(i, colClasses = "character")
    } %>%
    rbindlist(fill = TRUE)
})

# end of parallel work
stopImplicitCluster(cl)

Updated in 2020/04/16:
As I find a new package available for parallel computation, an alternative solution is provided using the following codes.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(future.apply, data.table, stringr)

# get the file name
dir() %>% str_subset("\\.csv$") -> fn

plan(multiprocess)

future_lapply(fn,fread,colClasses = "character") %>% 
  rbindlist(fill = TRUE) -> res

# res is the merged data.table

I like the approach using list.files(), lapply() and list2env() (or fs::dir_ls(), purrr::map() and list2env()). That seems simple and flexible.

Alternatively, you may try the small package {tor} (to-R): By default it imports files from the working directory into a list (list_*() variants) or into the global environment (load_*() variants).

For example, here I read all the .csv files from my working directory into a list using tor::list_csv():

library(tor)

dir()
#>  [1] "_pkgdown.yml"     "cran-comments.md" "csv1.csv"        
#>  [4] "csv2.csv"         "datasets"         "DESCRIPTION"     
#>  [7] "docs"             "inst"             "LICENSE.md"      
#> [10] "man"              "NAMESPACE"        "NEWS.md"         
#> [13] "R"                "README.md"        "README.Rmd"      
#> [16] "tests"            "tmp.R"            "tor.Rproj"

list_csv()
#> $csv1
#>   x
#> 1 1
#> 2 2
#> 
#> $csv2
#>   y
#> 1 a
#> 2 b

And now I load those files into my global environment with tor::load_csv():

# The working directory contains .csv files
dir()
#>  [1] "_pkgdown.yml"     "cran-comments.md" "CRAN-RELEASE"    
#>  [4] "csv1.csv"         "csv2.csv"         "datasets"        
#>  [7] "DESCRIPTION"      "docs"             "inst"            
#> [10] "LICENSE.md"       "man"              "NAMESPACE"       
#> [13] "NEWS.md"          "R"                "README.md"       
#> [16] "README.Rmd"       "tests"            "tmp.R"           
#> [19] "tor.Rproj"

load_csv()

# Each file is now available as a dataframe in the global environment
csv1
#>   x
#> 1 1
#> 2 2
csv2
#>   y
#> 1 a
#> 2 b

Should you need to read specific files, you can match their file-path with regexp, ignore.case and invert.


For even more flexibility use list_any(). It allows you to supply the reader function via the argument .f.

(path_csv <- tor_example("csv"))
#> [1] "C:/Users/LeporeM/Documents/R/R-3.5.2/library/tor/extdata/csv"
dir(path_csv)
#> [1] "file1.csv" "file2.csv"

list_any(path_csv, read.csv)
#> $file1
#>   x
#> 1 1
#> 2 2
#> 
#> $file2
#>   y
#> 1 a
#> 2 b

Pass additional arguments via … or inside the lambda function.

path_csv %>% 
  list_any(readr::read_csv, skip = 1)
#> Parsed with column specification:
#> cols(
#>   `1` = col_double()
#> )
#> Parsed with column specification:
#> cols(
#>   a = col_character()
#> )
#> $file1
#> # A tibble: 1 x 1
#>     `1`
#>   <dbl>
#> 1     2
#> 
#> $file2
#> # A tibble: 1 x 1
#>   a    
#>   <chr>
#> 1 b

path_csv %>% 
  list_any(~read.csv(., stringsAsFactors = FALSE)) %>% 
  map(as_tibble)
#> $file1
#> # A tibble: 2 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 
#> $file2
#> # A tibble: 2 x 1
#>   y    
#>   <chr>
#> 1 a    
#> 2 b

It was requested that I add this functionality to the stackoverflow R package. Given that it is a tinyverse package (and can’t depend on third party packages), here is what I came up with:

#' Bulk import data files 
#' 
#' Read in each file at a path and then unnest them. Defaults to csv format.
#' 
#' @param path        a character vector of full path names
#' @param pattern     an optional \link[=regex]{regular expression}. Only file names which match the regular expression will be returned.
#' @param reader      a function that can read data from a file name.
#' @param ...         optional arguments to pass to the reader function (eg \code{stringsAsFactors}).
#' @param reducer     a function to unnest the individual data files. Use I to retain the nested structure. 
#' @param recursive     logical. Should the listing recurse into directories?
#'  
#' @author Neal Fultz
#' @references \url{https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once}
#' 
#' @importFrom utils read.csv
#' @export
read.directory <- function(path='.', pattern=NULL, reader=read.csv, ..., 
                           reducer=function(dfs) do.call(rbind.data.frame, dfs), recursive=FALSE) {
  files <- list.files(path, pattern, full.names = TRUE, recursive = recursive)

  reducer(lapply(files, reader, ...))
}

By parameterizing the reader and reducer function, people can use data.table or dplyr if they so choose, or just use the base R functions that are fine for smaller data sets.

You may also like...

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.