Getting Uploaded Data Out of AWS S3

3 minute read

OK, so you know how to get data into AWS S3, what about getting it out? Previously, we uploaded entries from an imagined photo contest into a bucket. We sent a pair of files, a JSON file with the form data and the image. Let’s presume there’s a Rails app, it’s details don’t matter, but it has a model ContestEntry and we want to populate it from the S3 data. We’re going to write a script to do the import. When a script needs to load Rails, you do something like:

#!/usr/bin/env ruby
require File.expand_path('../../config/environment',  __FILE__)

The exact path for config/environment will depend on where the script is, in this case I’m presuming a subdirectory under Rails.root.

Loading Rails gives us the model. Now we need the S3 files. As before, we use the aws-sdk gem, which should be in your Gemfile.

I cover the basics of authenticating to S3 here. The code below assumes credentials are coming from the environment (or an AWS credential file in dev).

Getting our bucket is easy:

s3 = Aws::S3::Resource::new(region: 'us-west-1')
bucket = s3.bucket('bucket42')

As is getting the files (objects) in the bucket.

bucket.objects.each do |obj|
  # Do something.
end

But from there it gets a little convoluted. The object is actually a Aws::S3::ObjectSummary which has meta data about the object and can preform operations like moving, coping, or deleting the object, but isn’t the S3 object itself. To fetch the actual object, you have to call #get on the ObjectSummary:

object = obj.get

Once you have the actual object (really the Ruby object that wraps HTTP calls that can access the S3 object), you can get it’s data from #body which is actually a StringIO object. Confused? Code brings clarity.

We’ll find all of the json object in the bucket:

json_files = bucket.objects.select {|o| o.key =~ /\.json$/}

Grab the first one, using get to fetch the actual object:

file = json_files.first
s3_object = file.get

Then get it’s contents from body with read (since it’s an IO class object):

json = s3_object.body.read

Finally, we parse that JSON and get a hash:

form_data = JSON.parse(json)

I find that interface a little funky, but Bam! now we have our form data which we can save in our model:

entry = ContestEntry.new(form_data)

(You’re going to validate that data and not accept it blindly, right?)

We saved the ObjectSummary object so we could get the JSON file’s name which is our original UUID:

uuid = File.basename(file.key,'.json')

And with that we can find the photo we uploaded:

photo_file = bucket.objects.detect {|o| o.key =~ /#{uuid}.*(?<!.json)$/}

Note the switch to detect (There can be only one.) and the lovely negated lookbehind regexp! Again when need to get the actual S3 object:

photo_object = photo_file.get
photo_object.content_type # => "image/jpeg"

Which we could save local with something like:

File.open(photo_file.key, 'w') {|f| f.write(photo_object.body.read) }

Or process it with CarrierWave or Paperclip or even leave it in S3 and serve it directly from there.

We have the data in our app and can do whatever it was we wanted with it. All that remains is to somehow mark it the entries having been processed, we don’t want to import it multiple times. The simplest way to do that is to delete it, which can be done by calling #delete on the Aws::S3::ObjectSummary object:

file.delete
photo_file.delete

If you’d rather backup the data instead of throwing it away, you can rename the files with #move_to. The simplest way to us move_to is to pass in a string in the form of “target-bucket-name/target-key”:

file.move_to("completed-bucket/#{file.key}")
photo_file.move_to("completed-bucket/#{photo_file.key}")

If you don’t want to use a separate bucket, you could put the files in a “subfolder” instead:

file.move_to("#{file.bucket.name}/completed/#{file.key}")

But, keep in mind that “folders” in AWS are an illusion. They are really just part of the file’s name. As a result, bucket.objects will return all the files no matter how deeply nested. You can filter by using the “prefix” option:

bucket.objects(prefix: 'completed') # => completed/*

With this approach you’d modify the form to upload with a prefix, say pending. In our original JavaScript, it would be as simple as changing:

var bucket = 'https://s3.amazonaws.com/bucket.example.com/';

to:

var bucket = 'https://s3.amazonaws.com/bucket.example.com/pending/';

and then filter the initial select:

json_files = bucket.objects(prefix: 'pending').select {|o| o.key =~ /\.json$/}

And with that, I end the S3 upload series. You now have the tools to use S3 as a job queue. Use them wisely.

Comments