Being able to conveniently exchange data sets via S3 in either R or Python can really aid reproducibility.
Everyone is always asking the age old question: R or Python? Why not both? I refuse to choose! They each have their own strengths and weaknesses, and I often find myself seamlessly interweaving between the two depending on the task at hand.
Being ambidextrous with R and Python has never been easier, as
cross-pollination across the two languages has resulted in many of the
same constructs now existing for both. For example, data frame
implementations such as pandas
and tibble
are
largely converging on many of the same patterns and tooling. One of the
most important features for interoperability is being able to serialize
and de-serialize data frames between the two - which both readily do
natively via CSV or similar formats. Both can also transmit data via
mirrored connectors for all sorts of various databases. There are even
projects like feather which
are specifically targeted towards inter-language serialization between R
and Python.
But what about S3? Whenever I’m handling even moderately large data sets, it’s incredibly convenient to simply stash files in an S3 bucket. Doing so also helps advance the cause of reproducible research, since S3 objects have universally accessible, unique URLs and most folks are likely to have AWS credentials already setup from the AWS CLI or similar.
So how can one easily use S3 as a conduit for exchanging data sets between sessions and even between languages like Python and R? It would be fairly simple to just read/write a file that can be downloaded/uploaded from S3, but it turns out there are some very convenient libraries to help streamline the process.
For R, there’s the aws.s3
package, which is part of the
larger cloudyr
project (basically boto
for R).
There are all kinds of nice functions in this package, but one of my
favorite is s3read_using
, which allows you to utilize your
own file parser such as read_csv
from
tidyverse
/readr
. The example below illustrates
how to load the contents of an S3 object into a data.frame
,
all with one line of code.
library(tidyverse)
library("aws.s3")
<- aws.s3::s3read_using(read_csv, object = "s3://acs-blog-data/iris.csv") df
Likewise, we can write our data frames to S3 using
s3write_using
:
::s3write_using(iris, write_csv, object = "s3://acs-blog-data/iris.csv") aws.s3
We can do the same thing in Python by combining pandas with the handy
s3fs
package, which sits on top of boto3
.
import s3fs
import pandas as pd
= s3fs.S3FileSystem(anon=False)
s3 with s3.open("s3://acs-blog-data/iris.csv") as f:
= pd.read_csv(f) df
Similarly, writing to S3 should be as simple as the following example, although I’ve initially encountered an error.
with s3.open("s3://acs-blog-data/iris.csv", 'wb') as f:
df.write_csv(f)
Being able to conveniently exchange data sets via S3 in either R or Python can really aid reproducibility. Jupyter Notebooks and RMarkdown files can be shared via conventional version control, while sharing of potentially large data sets can be done over S3.
For attribution, please cite this work as
Stewart (2018, May 20). Andrew Stewart: Simple data frame read/write with S3. Retrieved from https://andrewcstewart.github.io/posts/2018-05-20-simple-data-frame-readwrite-with-s3/
BibTeX citation
@misc{stewart2018simple, author = {Stewart, Andrew}, title = {Andrew Stewart: Simple data frame read/write with S3}, url = {https://andrewcstewart.github.io/posts/2018-05-20-simple-data-frame-readwrite-with-s3/}, year = {2018} }