When writing tests I always find myself with the same problem. Where do I store input data for them? It needs to be accessible to everyone on the team while remaining fast and secure. If files are few and far between and only a couple of Kilobytes in size, I normally keep them around in the Git repo. When files start getting large, say 5 GB, then you can opt for Git Large File Storage (LFS). You'll be able to store files right in the repo though you'll need a paid-for GitHub account (or your own server) for anything above 2 GB. This is why I often opt for AWS S3.
S3 is extremely cheap, can store gigantic files no questions asked, and has great support across many different languages. To sync test data, all I needed to do was write some simple logic to check my disk for a file and if it didn't exist, pull it from S3. There's numerous tools that can do this from the terminal but I wanted to integrate this directly into a test suite I'd written in Rust. I could see myself using this code across a lot of different projects so I decided to bundle it up into its own crate, eventually ending up with s3-filesystem. It does four simple things:
Automatically gets credentials from AWS CLI
Directory traversal of S3 buckets
Download S3 objects to your local machine and open as a File
Save data to both your local machine and S3, returning a File
Let's say you wish to open a file from S3. All you need to do is create an OpenOptions struct and call open_s3. In this example an eu-west2 open source S3 bucket has been used so make sure you're logged into eu-west2 or use a different bucket in your region, there's lots of open datasets available.
use s3_filesystem::OpenOptions;
use tokio::io::AsyncReadExt;
#[tokio::main]
async fn main() {
let bucket = "pansurg-curation-workflo-kendraqueryresults50d0eb-open-data".to_string();
// If a custom S3 client is needed, replace None with Some(client).
// Default behavior is to use AWS CLI env.
let open_options = OpenOptions::new(bucket, None)
.await
.mount_path("data/test/")
.force_download(true);
let mut file = open_options
.open_s3("redasa1-Q1-20/manifest.txt")
.await
.unwrap();
let mut file_data = String::new();
// Read the example file and print it.
file.read_to_string(&mut file_data).await.unwrap();
println!("{}", file_data);
}
{"source-ref": "s3://pansurg-curation-workflo-kendraqueryresults50d0eb-vhc009b4k983/redasa1-Q1-20/docs/0000_pansurg_6736e2d5-4a84-4e3b-a13d-7596ef32b346_1591730933_4cde05db049748fe258963413f6274d569a02226.txt"}
{"source-ref": "s3://pansurg-curation-workflo-kendraqueryresults50d0eb-vhc009b4k983/redasa1-Q1-20/docs/0001_pansurg_6736e2d5-4a84-4e3b-a13d-7596ef32b346_1591730933_61a1fe57af708b2892c49c8dd5ae2a20c158a5bc.txt"}
...
The code above specifies a mount path: the location where you'd like to store the file from S3, whether or not to redownload the file each time, and then the S3 path relative to the bucket for the file which should be downloaded. That's all there is to it.
But what if I want to download more than one file? Well, Walkdir comes in handy here.
use s3_filesystem::OpenOptions;
use tokio::io::AsyncReadExt;
#[tokio::main]
async fn main() {
let bucket = "pansurg-curation-workflo-kendraqueryresults50d0eb-open-data".to_string();
// If a custom S3 client is needed, replace None with Some(client).
// Default behavior is to use AWS CLI env.
let open_options = OpenOptions::new(bucket, None)
.await
.mount_path("data/test/")
.force_download(true);
let dir_entries = open_options.walkdir("redasa1-Q1-20").await.unwrap();
for entry in dir_entries {
if entry.folder {
continue;
}
let _s3_file = open_options.open_s3(&entry.path).await.unwrap();
println!("Entry: {:?} downloaded", entry.path);
}
}
Walkdir will find all of the files in a bucket, just like it would when used on your local system. The walk can be narrowed down by specifying a sub-folder or path like I've done above. Each entry found in S3 will be returned with a path, a size in bytes, and whether or not it's a folder. Ultimately, S3 folders don't really exist. S3 is a key-value store and completely flat - unless a dummy object is made with a name ending in "/" the folder doesn't actually exist and therefore won't be returned by this function.
Writing a file will first write it to your local filesystem and then upload it to S3. When writing, there's no open source bucket I can use as an example so you'll have to make sure you use your own below!
use s3_filesystem::OpenOptions;
use tokio::fs;
#[tokio::main]
async fn main() {
let bucket = "some_bucket".to_string();
let open_options = OpenOptions::new(bucket, None)
.await
.mount_path("data/test/");
let data = "some_test_bytes".as_bytes();
open_options.write_s3("manifest.txt", data).await.unwrap();
println!("Data uploaded successfully");
}
If the bucket doesn't exist or anything else fails the file gets removed from disk. There's currently no caching for writing - it assumes you want to overwrite the current object on S3.
And that's everything the crate does! Like I said, I made this to solve a common problem I have when it comes to syncing input data for tests. All it does is abstract away some boilerplate I'd have to write for every project! I hope someone else finds this useful too.