Accessing S3 data in FaaSr

Overview

When FaaSr functions execute in the cloud, they start from a blank slate - they don’t have file inputs available. Furthermore, when they finish execution, outputs are not automatically saved - it’s your responsibility to save any output that should persist. This is because FaaS platforms are stateless - i.e., no persistent state (e.g. files) is available/saved unless you explicitly do so. Hence, typically FaaSr functions follow this pattern:

Getting/putting files from/to S3

The simplest way to get/put files from/to S3 is to use the faasr_get_file() and faasr_put_file() functions. These examples come from the companion vignette for single function and companion vignette for simple workflow:

Using Arrow and S3

Apache Arrow allows efficient columnar data access for large datasets. FaaSr provides a function faasr_arrow_s3_bucket() that returns an Arrow object that can then be used in your code. For example, the compute_sum function described in the companion vignette for simple workflow can be re-written to use Arrow as follows:

library(arrow)

compute_sum_arrow <- function(folder, input1, input2, output) {

  # Download two input files from bucket, generate a sum of their contents, and write back to bucket

  # The function uses the default S3 bucket name, configured in the FaaSr JSON 
  # folder: name of the folder where the inputs and outputs reside
  # input1, input2: names of the input files
  # output: name of the output file
  
  # The bucket is configured in the JSON payload as My_S3_Bucket
  # In this demo code, all inputs/outputs are in the same S3 folder, which is also configured by the user

  # Set up s3 bucket using arrow
  s3 <- faasr_arrow_s3_bucket()

  # Get file from s3 bucket using arrow
  frame_input1 <- arrow::read_csv_arrow(s3$path(file.path(folder, input1)))
  frame_input2 <- arrow::read_csv_arrow(s3$path(file.path(folder, input2)))
  
  # This demo function computes output <- input1 + input2 and stores the output back into S3
  # First, read the local inputs, compute the sum
  #
  frame_output <- frame_input1 + frame_input2

  # Upload the output file to S3 bucket using arrow
  arrow::write_csv_arrow(frame_output, s3$path(file.path(folder, output)))

  # Print a log message
  # 
  log_msg <- paste0('Function compute_sum finished; output written to ', folder, '/', output, ' in default S3 bucket')
  faasr_log(log_msg)
}