Keeping your datasets in the cloud. Pythonic guide on AWS S3 integration.

Posted on pon 29 kwietnia 2019 in Data Science, Data Engineering


What is AWS, S3 and S3Fs and why you should use it?

Cloud gives you multiple benefits:

  • it's mobile (you can access from anywhere)
  • it's scalable (you can easily increase its size)
  • it's secure (cloud provides advanced security systems and you can add yours on top of these)
  • it's available (risk of data loss is close to zero).

Finally, as you keep your code organized using Git, cloud service gives you the same possibility for datasets making your whole data science project easily reproducible and sharable.

In this post we are going to connect your machine with AWS using a Pythonic API. This setup will allow you to store and read your datasets from cloud directly into Pandas DataFrame. In particular, we are going to see how:

  • you can start your AWS account and create S3 bucket
  • connect your AWS account with your machine through AWS CLI
  • write code for reading and writing files to S3
  • create wrapping functions for both

Integrating your machine with S3

First thing you need to do in order to host your datasets on S3 is to create your AWS account. You can sign up here A few things to highlight:

  • during a signup you will need to provide your credit card info
  • AWS provides you with one-year-long free tier. Trust me, it's more than enough to learn it and see its potential. After one year, their pricing is still reasonable for keeping your files there.

Second thing you need to do is to create an IAM user.

  1. Click here and then on Users on the left side.
  2. Click on Add user, provide username and enable Programmatic access.
  3. Attach AmazonS3FullAccess policy by clicking on attach existing policies directly and searching for it.
  4. Optionally, you can add some metadata describing your user (I skip this step).
  5. Create user and download .csv file with access keys.

AWS IAM user creation

Finally, open your command prompt and install the following packages:

  • pip3 install awscli --upgrade --user
  • pip install s3fs
  • aws configure

You will be asked to type AWS Key ID and AWS Secret Access Key. You can find these in the downloaded .csv file. Optionally, you can also provide default output format and region name. Configuration example:

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnGPLI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json

That's all. You have just configured your machine to work with S3!


Creating your first S3 bucket

Now you're ready to crate your first bucket on S3. Bucket is the name for a specific directory on S3. Treat is as a directory on your laptop. In order to create a bucket, you need to enter AWS S3 console. The procedure is the following:

  1. Enter console link
  2. Click on Create bucket
  3. Set an unique bucket name, click next and later choose default configurations
  4. In the Review tab click Create bucket

That's it. Important thing is to remember the name that you gave to your bucket.


Taking advantage of S3Fs - S3 file interface

According to simple explanation provided in S3Fs documentation:

S3Fs is a Pythonic file interface to S3.

Thanks to S3Fs we can easily save and read files from S3 buckets. This comes in handy when we work on any data science project, because we only need internet connection to access our dataset from anywhere.

Let's jump into examples. Firstly, let's see how we can write a DataFrame to S3:

::python
import s3fs
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]})
# Save it to S3
# Encode DataFrame to bytes
bytes_to_write = df.to_csv(None, index=False).encode()
# Use environmetal permissions
fs = s3fs.S3FileSystem(anon=False)
write_str = 's3://<your_bucket_name>/<yout_file_name>'
# Open connection with S3
with fs.open(write_str, 'wb') as f:
    # Save your file 
    f.write(bytes_to_write)

Important thing to note is write_str parameter. It is a string that you create for writing to S3. This string contains name of the bucket that you created and name of the file that you want to assign. Example of the string can be: s3://my-bucket/results.csv. Note that you can create sub-directories in your bucket.

For reading from S3 we use the following procedure:

::python
# Use environmental permissions to authorize
fs = s3fs.S3FileSystem(anon=False)
# Read a file from S3
df = pd.read_csv('s3://<your_bucket_name>/<yout_file_name>')

As you see, reading is done with the use of pd.read_csv where we provide an url to our S3 as an input parameter.


S3Fs reading and writing wrappers

Below I prepared two functions that you can use in your scripts for both reading and writing your files to AWS S3:

For writing a file to S3:

::python
def save_dataframe_to_s3_as_csv(df: pd.DataFrame, filename: str, bucket_path: str) -> None:
    """
    Saves DataFrame to s3 as csv file.
    :param df: dataframe that will be saved
    :param filename: name of the file that will be saved
    :param bucket_path: absolute s3 path
    :param date: date that will be added to the name of the file
    :return: None
    """
    # Encode DataFrame to bytes
    bytes_to_write = df.to_csv(None, index=False).encode()
    # Use environmental permissions to authorize
    fs = s3fs.S3FileSystem(anon=False)
    # Create write string
    write_str = bucket_path + filename
    with fs.open(write_str, 'wb') as f:
        try:
            f.write(bytes_to_write)
            print(f"{filename} was written to {write_str}")
            return None
        except:
            print(f"There was a problem with writing {filename} to {write_str}.")
            return None

For reading a file from S3:

::python
def read_csv_as_dataframe(bucket_path: str, filename: str, timestamp_column_name: str = None) -> pd.DataFrame:
    """
    Reads s3 csv file as DataFrame.
    :param filename: name of the file that will be appended
    :param bucket_path: absolute s3 path
    :param timestamp_column_name: name of the timestamp column
    :return: csv file read as pandas DataFrame
    """
    # Use environmental permissions to authorize
    fs = s3fs.S3FileSystem(anon=False)
    # Read aggregated stored file
    df = pd.read_csv(bucket_path + filename, parse_dates=[timestamp_column_name])
    return df

Summary

In this post, we configured your machine to work with AWS S3 using S3Fs package. From now on, you can easily store your files on S3 and work with these just like with files stored on your machine. Once your machine crashes (it will one day), you will not worry that you lost important data or results from your projects.

Which functionalities of AWS would you like to have in your data scientist toolkit? Do you have a project where cloud implementation is a great choice? Are you using other cloud services? If so, why?

Let me know in the comment and let's discuss it a little!

Happy coding!