Command Line AWS S3 Upload/Download Tool using Pharo Smalltalk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Command Line AWS S3 Upload/Download Tool using Pharo Smalltalk

Sven Van Caekenberghe
Command Line AWS S3 Upload/Download Tool using Pharo Smalltalk

This is a little story about a tool that I needed and that I finally implemented in Pharo Smalltalk. I think it is quite nice and elegant and might be useful to others too. Also, I think it is important to share little hacks like this to show that what you can do is almost unlimited (as many others have shown in all kinds of projects).

This is a bit long, so if you're not interested, please ignore.

Amazon S3 is a web service interface to store and retrieve any data from anywhere that is highly scalable, reliable, secure, fast, inexpensive. [ http://aws.amazon.com/s3/ ] [ http://en.wikipedia.org/wiki/Amazon_S3 ]

I needed a client to interface with S3 from Linux to store and retrieve archives and backups up to 100s of Mb large. Although I had already written a client for this web service in Pharo some time ago, I thought it would not be good enough for these amounts of data. Furthermore, it had no command line interface. So I Googled and found s3cmd. This installed easily and quickly in Ubuntu. Uploading/downloading some small test files went OK. However, uploading large files gave all kinds of strange errors, even though I was doing this from EC2 directly on the Amazon network. I don't know Python so I was not capable of debugging the tool.

But I do know Smalltalk and had the basic code already. [ Zinc-AWS in http://www.squeaksource.com/ZincHTTPComponents ]

What was missing was a solution to do streaming uploads and downloads so that these huge files would not be taken into memory all at once and a command line interface. Turns out both were quite easy and elegant to implement.

I added the following methods to ZnAWSS3Client:

downloadFile: filename fromBucket: bucket
        "Do a streaming download of the key filename from bucket,
        creating it as a file with that name in the current directory."
       
        | streaming response |
        streaming := self httpClient streaming.
        self httpClient streaming: true.
        response := self at: bucket -> filename.
        self httpClient streaming: streaming.
        response isSuccess ifFalse: [ ^ ZnHttpUnsuccessful signal ].
        FileStream
                fileNamed: filename
                do: [ :stream | response entity writeOn: stream ].
        ^ response

The #streaming option of the underlying ZnClient is toggled on for this request and reset later to its previous value after the request. The #at: does the actual request, returning a ZnStreamingEntity that is then written to a file. When a streaming entity writes itself to another stream it will actually do a streaming copy (one buffer at a time in a loop) for an efficient download.

uploadFile: filename withMd5: md5 inBucket: bucket
        "Do a streaming upload of the file filename to bucket.
        When md5 is notNil, use it to validate the ETag of the response."
       
        | entry size mimeType fileStream entity response |
        entry := FileDirectory root directoryEntryFor: filename.
        size := entry fileSize.
        mimeType := ZnMimeType forFilenameExtension: (FileDirectory extensionFor: filename).
        fileStream := FileDirectory root readOnlyFileNamed: filename.
        mimeType isBinary ifTrue: [ fileStream binary ].
        (entity := ZnStreamingEntity type: mimeType length: size)
                stream: fileStream.
        self at: bucket -> entry name put: entity.
        (md5 notNil and: [ (md5 sameAs: self eTag) not ])
                ifTrue: [ self error: 'Uploaded ETag does not equal supplied MD5' ].
        ^ self httpClient response

Here the streaming entity is created based on an existing file, the mime type coming from the extension. Then, #at:put: is used to do the upload. Again, when the streaming entity writes itself to the socket stream it will actually do a streaming copy (one buffer at a time in a loop), for an efficient upload.

Amazon S3 delivers MD5 hashes (ETags) that can be used to check the integrity of uploads/downloads. ZnAWSS3Client uses these when its #checkIntegrity option is true using MD5>>#hashMessage:, though not using the streaming option of HashFunction. Since it would be a lot of work to make that possible, I opted for externally supplied MD5 hashes using Linux's md5sum utility.

Next, I took a clean Pharo 1.3 image and the standard Pharo built CogVM and loaded the following to create a server image:

Gofer new
 squeaksource: 'XMLSupport';
 package: 'ConfigurationOfXMLSupport';
 load.
(Smalltalk at: #ConfigurationOfXMLSupport) perform: #loadDefault.

Gofer new
 squeaksource: 'ZincHTTPComponents';
 package: 'Zinc-HTTP';
 package: 'Zinc-AWS';
 load.

Gofer new
 squeaksource: 'ADayAtTheBeach';
 package: 'NonInteractiveTranscript';
 load.

The glue code for each tool consists of a Linux shell script and a Smalltalk startup script. This is a tool for personal use, error handling could be a bit better regarding wrong invocations.

#!/bin/bash
echo S3 Download $*
script_home=$(dirname $0)
script_home=$(cd $script_home && pwd)
image=$script_home/pharo-t3.image
script=$script_home/s3-download.st
bucket=$1
file=$2
vm=$script_home/bin/CogVM
options="-vm-display-null -vm-sound-null"
echo $vm $options $image $script $bucket $file
$vm $options $image $script $bucket $file

NonInteractiveTranscript stdout install.
[

| args bucket filename |
args := Smalltalk commandLine arguments.
bucket := args first.
filename := args second.

(ZnAWSS3Client new)
 accessKeyId: 'AAAAXXYYZZ5HGQZ77EQ';
 secretAccessKey: 'AAAA+D9n7gabcdefghSfevY/mH6KWsvMAcsjzzzz';
 checkIntegrity: false;
 downloadFile: filename fromBucket: bucket;
 close.

Transcript show: 'OK'; cr; endEntry.

] on: Error do: [ :exception |

Smalltalk logError: exception description inContext: exception signalerContext.
Transcript show: 'FAILED'; cr; endEntry.

].
SmalltalkImage current quitPrimitive.

#!/bin/bash
echo S3 Upload $*
script_home=$(dirname $0)
script_home=$(cd $script_home && pwd)
image=$script_home/pharo-t3.image
script=$script_home/s3-upload.st
bucket=$1
file=$2
file_dir=$(dirname $file)
file_name=$(basename $file)
file=$(cd $file_dir && pwd)/$file_name
vm=$script_home/bin/CogVM
options="-vm-display-null -vm-sound-null"
compute_md5="md5sum $file"
md5=`$compute_md5`
echo $vm $options $image $script $bucket $file $md5
$vm $options $image $script $bucket $file $md5

NonInteractiveTranscript stdout install.
[

| args bucket filename md5 |
args := Smalltalk commandLine arguments.
bucket := args first.
filename := args second.
md5 := args third.

(ZnAWSS3Client new)
 accessKeyId: 'AAAAXXYYZZ5HGQZ77EQ';
 secretAccessKey: 'AAAA+D9n7gabcdefghSfevY/mH6KWsvMAcsjzzzz';
 checkIntegrity: false;
 uploadFile: filename withMd5: md5 inBucket: bucket;
 close.

Transcript show: 'OK'; cr; endEntry.

] on: Error do: [ :exception |

Smalltalk logError: exception description inContext: exception signalerContext.
Transcript show: 'FAILED'; cr; endEntry.

]
SmalltalkImage current quitPrimitive.

Both tools are very similar: the shell scripts uses the VM to start the right image after massaging some of the arguments.
The Smalltalk script retrieves arguments, instanciates and invokes the tool to do its job. NonInteractiveTranscript is used with the #stdout option to write OK or FAILED. In the case of failure a PharoDebug.log is written. #quitPrimitive is used for a fast exit (not marking the changes file). Since then I learned from Igor that #quitPrimitive accepts an exit value so it would be possible to do away with the OK/FAILED writing to stdout and just rely on the exit value for an even better integration with *nix.

Uploading/downloading files of 100s of Mb went flawless and was more than fast enough (network IO bound anyway). Repeatably starting/stopping a VM with a 20Mb image is totally acceptable and fast as well.

A big thanks to everybody who is working on all the little details all the time to make things like this possible !

Sven


Reply | Threaded
Open this post in threaded view
|

Re: Command Line AWS S3 Upload/Download Tool using Pharo Smalltalk

Mariano Martinez Peck
Looks awesome. Thanks for sharing  and congrats.

On Fri, Dec 9, 2011 at 4:58 PM, Sven Van Caekenberghe <[hidden email]> wrote:
Command Line AWS S3 Upload/Download Tool using Pharo Smalltalk

This is a little story about a tool that I needed and that I finally implemented in Pharo Smalltalk. I think it is quite nice and elegant and might be useful to others too. Also, I think it is important to share little hacks like this to show that what you can do is almost unlimited (as many others have shown in all kinds of projects).

This is a bit long, so if you're not interested, please ignore.

Amazon S3 is a web service interface to store and retrieve any data from anywhere that is highly scalable, reliable, secure, fast, inexpensive. [ http://aws.amazon.com/s3/ ] [ http://en.wikipedia.org/wiki/Amazon_S3 ]

I needed a client to interface with S3 from Linux to store and retrieve archives and backups up to 100s of Mb large. Although I had already written a client for this web service in Pharo some time ago, I thought it would not be good enough for these amounts of data. Furthermore, it had no command line interface. So I Googled and found s3cmd. This installed easily and quickly in Ubuntu. Uploading/downloading some small test files went OK. However, uploading large files gave all kinds of strange errors, even though I was doing this from EC2 directly on the Amazon network. I don't know Python so I was not capable of debugging the tool.

But I do know Smalltalk and had the basic code already. [ Zinc-AWS in http://www.squeaksource.com/ZincHTTPComponents ]

What was missing was a solution to do streaming uploads and downloads so that these huge files would not be taken into memory all at once and a command line interface. Turns out both were quite easy and elegant to implement.

I added the following methods to ZnAWSS3Client:

downloadFile: filename fromBucket: bucket
       "Do a streaming download of the key filename from bucket,
       creating it as a file with that name in the current directory."

       | streaming response |
       streaming := self httpClient streaming.
       self httpClient streaming: true.
       response := self at: bucket -> filename.
       self httpClient streaming: streaming.
       response isSuccess ifFalse: [ ^ ZnHttpUnsuccessful signal ].
       FileStream
               fileNamed: filename
               do: [ :stream | response entity writeOn: stream ].
       ^ response

The #streaming option of the underlying ZnClient is toggled on for this request and reset later to its previous value after the request. The #at: does the actual request, returning a ZnStreamingEntity that is then written to a file. When a streaming entity writes itself to another stream it will actually do a streaming copy (one buffer at a time in a loop) for an efficient download.

uploadFile: filename withMd5: md5 inBucket: bucket
       "Do a streaming upload of the file filename to bucket.
       When md5 is notNil, use it to validate the ETag of the response."

       | entry size mimeType fileStream entity response |
       entry := FileDirectory root directoryEntryFor: filename.
       size := entry fileSize.
       mimeType := ZnMimeType forFilenameExtension: (FileDirectory extensionFor: filename).
       fileStream := FileDirectory root readOnlyFileNamed: filename.
       mimeType isBinary ifTrue: [ fileStream binary ].
       (entity := ZnStreamingEntity type: mimeType length: size)
               stream: fileStream.
       self at: bucket -> entry name put: entity.
       (md5 notNil and: [ (md5 sameAs: self eTag) not ])
               ifTrue: [ self error: 'Uploaded ETag does not equal supplied MD5' ].
       ^ self httpClient response

Here the streaming entity is created based on an existing file, the mime type coming from the extension. Then, #at:put: is used to do the upload. Again, when the streaming entity writes itself to the socket stream it will actually do a streaming copy (one buffer at a time in a loop), for an efficient upload.

Amazon S3 delivers MD5 hashes (ETags) that can be used to check the integrity of uploads/downloads. ZnAWSS3Client uses these when its #checkIntegrity option is true using MD5>>#hashMessage:, though not using the streaming option of HashFunction. Since it would be a lot of work to make that possible, I opted for externally supplied MD5 hashes using Linux's md5sum utility.

Next, I took a clean Pharo 1.3 image and the standard Pharo built CogVM and loaded the following to create a server image:

Gofer new
 squeaksource: 'XMLSupport';
 package: 'ConfigurationOfXMLSupport';
 load.
(Smalltalk at: #ConfigurationOfXMLSupport) perform: #loadDefault.

Gofer new
 squeaksource: 'ZincHTTPComponents';
 package: 'Zinc-HTTP';
 package: 'Zinc-AWS';
 load.

Gofer new
 squeaksource: 'ADayAtTheBeach';
 package: 'NonInteractiveTranscript';
 load.

The glue code for each tool consists of a Linux shell script and a Smalltalk startup script. This is a tool for personal use, error handling could be a bit better regarding wrong invocations.

#!/bin/bash
echo S3 Download $*
script_home=$(dirname $0)
script_home=$(cd $script_home && pwd)
image=$script_home/pharo-t3.image
script=$script_home/s3-download.st
bucket=$1
file=$2
vm=$script_home/bin/CogVM
options="-vm-display-null -vm-sound-null"
echo $vm $options $image $script $bucket $file
$vm $options $image $script $bucket $file

NonInteractiveTranscript stdout install.
[

| args bucket filename |
args := Smalltalk commandLine arguments.
bucket := args first.
filename := args second.

(ZnAWSS3Client new)
 accessKeyId: 'AAAAXXYYZZ5HGQZ77EQ';
 secretAccessKey: 'AAAA+D9n7gabcdefghSfevY/mH6KWsvMAcsjzzzz';
 checkIntegrity: false;
 downloadFile: filename fromBucket: bucket;
 close.

Transcript show: 'OK'; cr; endEntry.

] on: Error do: [ :exception |

Smalltalk logError: exception description inContext: exception signalerContext.
Transcript show: 'FAILED'; cr; endEntry.

].
SmalltalkImage current quitPrimitive.

#!/bin/bash
echo S3 Upload $*
script_home=$(dirname $0)
script_home=$(cd $script_home && pwd)
image=$script_home/pharo-t3.image
script=$script_home/s3-upload.st
bucket=$1
file=$2
file_dir=$(dirname $file)
file_name=$(basename $file)
file=$(cd $file_dir && pwd)/$file_name
vm=$script_home/bin/CogVM
options="-vm-display-null -vm-sound-null"
compute_md5="md5sum $file"
md5=`$compute_md5`
echo $vm $options $image $script $bucket $file $md5
$vm $options $image $script $bucket $file $md5

NonInteractiveTranscript stdout install.
[

| args bucket filename md5 |
args := Smalltalk commandLine arguments.
bucket := args first.
filename := args second.
md5 := args third.

(ZnAWSS3Client new)
 accessKeyId: 'AAAAXXYYZZ5HGQZ77EQ';
 secretAccessKey: 'AAAA+D9n7gabcdefghSfevY/mH6KWsvMAcsjzzzz';
 checkIntegrity: false;
 uploadFile: filename withMd5: md5 inBucket: bucket;
 close.

Transcript show: 'OK'; cr; endEntry.

] on: Error do: [ :exception |

Smalltalk logError: exception description inContext: exception signalerContext.
Transcript show: 'FAILED'; cr; endEntry.

]
SmalltalkImage current quitPrimitive.

Both tools are very similar: the shell scripts uses the VM to start the right image after massaging some of the arguments.
The Smalltalk script retrieves arguments, instanciates and invokes the tool to do its job. NonInteractiveTranscript is used with the #stdout option to write OK or FAILED. In the case of failure a PharoDebug.log is written. #quitPrimitive is used for a fast exit (not marking the changes file). Since then I learned from Igor that #quitPrimitive accepts an exit value so it would be possible to do away with the OK/FAILED writing to stdout and just rely on the exit value for an even better integration with *nix.

Uploading/downloading files of 100s of Mb went flawless and was more than fast enough (network IO bound anyway). Repeatably starting/stopping a VM with a 20Mb image is totally acceptable and fast as well.

A big thanks to everybody who is working on all the little details all the time to make things like this possible !

Sven





--
Mariano
http://marianopeck.wordpress.com