The situation
A genomics research organisation has:
- 3 PB of historical sequencing data on a 6-year-old SAN. Files are 10 MB to 400 GB each, ~28 million files total. Access pattern: 5% actively queried each month, 95% archival. Retained for regulatory reasons (20 years minimum).
- 50 TB/week of new sequencing data generated by a sequencing lab in the datacentre. Written to the SAN continuously, currently rolled off to tape nightly for offsite storage.
- A 1 Gbps internet circuit shared with everything else in the building, generally 60-70% already utilised during business hours.
- A datacentre lease expiring in nine months. Everything has to be somewhere else by then, the business has decided that “somewhere else” is AWS and the budget is signed off.
- An analyst team that runs ad-hoc read queries against the data using proprietary tools that expect POSIX file semantics. They need read access to continue throughout the migration and after.
Three distinct problems: move the existing 3 PB (one-time, big, bandwidth-constrained), move the ongoing 50 TB/week (continuous, smaller, needs automation), and keep the analysts working (ongoing read access with file-system semantics). Each problem has a different answer.
What actually matters
Before picking services, it’s worth understanding what “moving data to AWS” actually means in each service’s worldview.
The first thing is bandwidth maths, honestly. 3 PB over 1 Gbps, assuming optimistic 70% sustained throughput (which is fantasy for a shared link), is 3,000,000 / (1000/8 × 3600 × 24 × 0.7) ≈ 397 days. Over a year to push the historical data over the existing circuit, and that’s assuming everyone else stops using the internet. The lease expires in 9 months. Network transfer is not the answer for the backfill.
The second thing is the shape of physical-shipping transfer. The Snow Family spans a range of capacities from small portable devices to a literal shipping container on a truck. AWS ships the device, you load data onto it at LAN speeds, you ship it back, AWS copies it into S3. The interesting design question isn’t which device exists; it’s how parallelisable the on-site loading operation is, because the throughput of the migration is determined by how many devices can be loading simultaneously, not the capacity of any one of them.
The third thing is the shape of software-driven network transfer. There’s a managed-agent option that does authenticated, parallelised, incremental transfers from on-prem filesystems to AWS storage, with metadata preservation, verification, bandwidth throttling, and scheduling built in. It’s network-bound, so the bandwidth maths apply to it the same as they do to a plain copy command, it doesn’t move bytes faster than the pipe. Its value is in everything around the bytes: scheduling, retries, integrity checks, metadata fidelity. Right tool for the ongoing-flow problem; wrong tool for the one-shot bulk problem.
The fourth thing is filesystem-style ongoing access. There’s a class of on-prem appliance that exposes a local filesystem (NFS or SMB) backed by S3, caching recently-accessed files locally and fetching cold files on demand. If the analysts have tools that expect POSIX semantics, this is the way to give them an S3-backed mount without rewriting the tools.
The fifth thing is during-migration access. Once the first physical-shipping device arrives, the analysts need to be reading from somewhere. If data is still landing in the datacentre and being shipped to AWS, reads during the window can go either to the SAN (what exists) or to S3 (what’s been uploaded). A caching filesystem appliance pointing at the S3 bucket lets analysts read from either side transparently: files already uploaded come from the local cache or S3; files not yet uploaded come from the SAN.
The sixth thing is what happens after the lease ends. The analysts move to running their tools on EC2 in AWS, with a high-performance file system hydrated from S3 for fast reads, or a POSIX-compliant shared mount. The datacentre appliance goes away; the analysts’ tooling points at either S3 directly or at an AWS-side file-system service.
What we’ll filter on
Filters for each service against each sub-problem:
- Bandwidth-independence, does this work over a constrained internet circuit?
- Handles one-shot bulk, 3 PB over a fixed deadline.
- Handles ongoing incremental, 50 TB/week, scheduled, verified.
- Provides file-system access. NFS/SMB for tools that need it.
- Works alongside ongoing operations, lab keeps writing while migration happens.
- Verification and audit, chain of custody for regulated data.
The transfer landscape
-
AWS Snowball Edge Storage Optimized (210 TB). 210 TB of usable storage per device, ~100 lbs, ships in a ruggedized case. Data-transfer rate to device over 10/40 Gbps network locally; AWS copies to S3 on return with SHA-256 verification. Encryption in transit and at rest using keys managed in AWS KMS, never stored on the device. Device fee plus per-day holding fee; data transfer itself is included. Turnaround time (order → arrive → fill → ship back → data in S3) is typically 7-10 days per device, depending on loading speed and shipping.
-
AWS Snowmobile. A 45-foot shipping container on a semi-truck, up to 100 PB capacity. AWS brings the truck to your facility; you connect it to your datacentre network over fiber; you load data for weeks; AWS drives it back. Not in every Region; lead time is months. The correct answer at 50+ PB, overkill at 3 PB.
-
AWS DataSync. Agent software (VM or EC2 instance on-prem, or a pre-provisioned Snowcone for small edge sites) that runs transfers. Parallel, incremental, with verification and bandwidth controls. Moves to S3, EFS, FSx. Charges per GB transferred. The correct tool for the continuous 50 TB/week.
-
AWS Storage Gateway. File Gateway. On-prem VM (or hardware appliance) exposing NFSv3/v4.1 or SMBv2/v3 shares. Writes go to S3 asynchronously; reads go to a local cache, missing to S3. Cache size tunes the working-set locality. Transparent to tools that speak NFS/SMB.
-
AWS Storage Gateway. Tape Gateway. Virtual tape library that looks like a physical one to backup software. Data lands in S3 or Glacier. Not relevant here, the lab isn’t using tape backup as its interface.
-
AWS Direct Connect. A private link between the datacentre and AWS, 1-100 Gbps. Useful if bandwidth is the constraint and the migration window allows ordering and provisioning (~weeks to months). Not a transfer service itself, something that makes transfers faster. Relevant for the ongoing flow, not the 3 PB backfill (DX provisioning would compete with the timeline).
-
Plain
aws s3 syncover the internet. A CLI tool with parallelism. The correct answer for small datasets; actively wrong for 3 PB over 1 Gbps.
Side by side
| Option | Bandwidth-independent | Bulk one-shot | Ongoing incremental | File-system access | During ops | Verification |
|---|---|---|---|---|---|---|
| Snowball Edge (15 devices) | ✓ | ✓ (3 PB in ~weeks) | — | ✗ | ✓ | SHA-256 |
| Snowmobile | ✓ | ✓✓ (100 PB) | ✗ | ✗ | — | Chain of custody |
| DataSync | ✗ (network-bound) | ✗ at 3 PB / 1 Gbps | ✓ (50 TB/week fits) | ✗ | ✓ | Per-transfer |
| File Gateway | ✗ | ✗ | ✗ (not primarily) | ✓ (NFS/SMB) | ✓ | Cache consistency |
| Tape Gateway | — | — | — | ✗ (VTL only) | — | — |
| Direct Connect | ✗ (still bandwidth-bound) | Only with 10+ Gbps DX | ✓ | — | ✓ | Transport-level |
aws s3 sync |
✗ | ✗ | Partial | ✗ | ✓ | ETag |
Reading this against the three sub-problems:
- Backfill (3 PB, 9-month deadline): Snowball Edge fleet. Fifteen 210 TB devices in flight, batched so a few are loading on-site at any given time. Realistic end-to-end: 6-10 weeks.
- Ongoing flow (50 TB/week): DataSync. The agent runs on-prem, pushes to S3 on a schedule, verifies each task, handles retries. At 50 TB/week and a 1 Gbps circuit (let’s say 500 Mbps usable for DataSync during business hours, 800 Mbps overnight), 50 TB takes about 12 days of continuous transfer, beyond what fits in a week. Either throttle up (a second DataSync agent on a separate circuit), wait for Direct Connect to be provisioned, or ship a weekly Snowcone/Snowball for the lab’s output until DX is live. Likely a phased plan: Snowball for weekly lab output in month 1-2, DataSync over DX from month 3 onwards.
- File-system access: File Gateway. On-prem during the migration (analysts still on-prem, reads cache-hit from local cache or miss to S3). After the lease ends, move the analyst tooling to EC2 + FSx for Lustre or EFS.
The migration topology
The plan in depth
Phase 1: Snowball fleet for the backfill. Order 15 Snowball Edge Storage Optimized 210 TB devices, initial batch of 5 (to match on-site loading capacity, two loaders, 24/7 rotation). Each device can absorb about 30 TB/day at 10 Gbps sustained; a fully-packed 210 TB takes ~7 days of on-site loading. The rotation: devices arrive, get racked, load for a week, ship back, AWS imports to S3 in 2-3 days. End-to-end per device: ~12 days from order to data-in-S3.
With 5 concurrent devices and a rolling order cadence (new batch arrives as an old batch departs), 3 PB lands in S3 in approximately 8-10 weeks. Buffer for the unknown: start day one of the migration window.
Data organisation on the devices uses the existing directory structure rooted in S3 prefixes matching the SAN’s structure: s3://genomics-archive/samples/2024/Q1/... mirrors /san/samples/2024/Q1/.... This makes verification easy (walk the tree on both sides, compare counts and checksums) and preserves path-based access patterns for the analysts.
Phase 2: File Gateway for ongoing read access. A File Gateway hardware appliance in the datacentre, exposing an NFSv4.1 share that mirrors the S3 bucket. Cache size: 48 TB SSD (a common hardware appliance spec). The working set of “files accessed in the last month” is ~5% of 3 PB = 150 TB; the cache handles a subset and misses to S3 for the rest. Analysts mount the gateway as /mnt/genomics and their existing tools just work. POSIX semantics, file locking, the usual.
During the Snowball upload window, files may be present on the SAN but not yet in S3, or present in S3 but also on the SAN. The migration plan: files pending upload are readable from the SAN (still mounted alongside the gateway); once a Snowball is imported, the corresponding SAN directory is marked read-only and analysts are pointed at the File Gateway path for those files. A directory-level cutover rather than a file-level one keeps the plan simple.
Phase 3: DataSync for continuous weekly transfer. Once Direct Connect is provisioned (parallel track, assume 8-12 weeks to first BGP session), stand up DataSync agents on each side of a 10 Gbps DX. Schedule: nightly task running at 22:00 UTC, scanning /san/new-samples/, copying to s3://genomics-archive/new/YYYY/MM/DD/ with verification and metadata preservation. 50 TB in one night at 10 Gbps sustained is ~12 hours, fits.
Before DX is live, the lab’s weekly output ships via Snowcone (14 TB) or a dedicated Snowball Edge 80 TB, a rolling weekly device that handles the 50 TB/week flow until DX and DataSync take over.
Phase 4: Post-lease architecture. Analysts move to EC2 in AWS, with FSx for Lustre hydrated from S3. FSx for Lustre can link to an S3 bucket as its backing store, loading file metadata on mount and lazy-hydrating file content on first access. This is the POSIX file system the tools expect; it’s backed by S3 so the archive copy is authoritative; Lustre’s read performance handles the analyst workload. The on-prem File Gateway is decommissioned with the datacentre.
S3 lifecycle policy on the bucket:
- Objects under
/samples/(archival, 95% of data) transition to Intelligent-Tiering immediately; Intelligent-Tiering moves to Deep Archive after 180 days of no access. Infrequent-access tier ($0.0125/GB-month) for objects accessed rarely; Deep Archive tier ($0.00099/GB-month) for the cold majority. 3 PB at Deep Archive rates is ~$3,000/month; at Standard it would be $69,000/month. - Objects under
/new/stay in Standard for 90 days (active analysis window), then tier. - Object Lock in Governance mode for regulatory retention (20 years); Legal Holds for specific datasets under litigation if needed.
A worked Snowball cycle
The first Snowball Edge delivery arrives.
# Order via console or CLI
$ aws snowball create-job \
--job-type IMPORT \
--resources '{"S3Resources": [{"BucketArn": "arn:aws:s3:::genomics-archive", "KeyRange": {}}]}' \
--shipping-option TWO_DAY \
--snowball-type SNOWBALL_EDGE_210TB \
--role-arn arn:aws:iam::111122223333:role/snowball-import-role \
--address-id adid-1234567890abcdef
JobId: JID-a1b2c3d4
Device arrives five days later. The team unboxes it, connects it to the datacentre network (10 Gbps SFP+ into a dedicated port on the SAN backbone), unlocks it with the manifest and unlock code from the console, and starts the Snowball Edge client:
# Unlock the device
$ snowballEdge unlock-device \
--manifest-file manifest.bin \
--unlock-code 12345-67890-abcde-fghij \
--endpoint https://192.0.2.10
# Start the S3 adapter
$ snowballEdge start-service \
--service-id s3 \
--virtual-network-interface-arn arn:aws:snowball-device:::interface/s3-adapter
# Copy data to the device
$ aws s3 cp /san/samples/2018 s3://genomics-archive/samples/2018/ \
--recursive \
--endpoint-url http://192.0.2.10:8080 \
--profile snowball
Loading runs for a week at ~1.5 GB/s sustained (bounded by the SAN read rate, not the device). We run a verification pass with aws s3 ls --recursive on both sides and a spot-check of SHA-256 sums on a random 100 files. The device’s console reports 203 TB of 210 TB used when loading stops.
Ship-back:
# Stop the service, lock the device
$ snowballEdge stop-service --service-id s3
$ snowballEdge shutdown-device
# Device shows the return shipping label on its eInk display
# Schedule pickup via the job dashboard
AWS ground-transports the device to the nearest AWS ingest facility, runs the import into s3://genomics-archive/samples/2018/..., and the team gets a notification in the job dashboard. Chain of custody is logged: device ordered, shipped, received, unlocked, data imported, device wiped. SHA-256 checksums are verified on AWS’s side against what the device recorded. The whole cycle is 12 days for 203 TB, effective 16 TB/day, or ~14 Gbps, which is about 14× the internet circuit’s peak.
What’s worth remembering
- The bandwidth maths decide the shape. 3 PB over 1 Gbps is months of transfer; over a Snowball fleet it’s weeks. Run the numbers early.
- Snow Family is a courier operation, not a network operation. Devices ship, get loaded at LAN speed, ship back. AWS imports to S3. The on-device data is encrypted with KMS-held keys and never stored on the device.
- DataSync is for authenticated, verified, incremental flows. The agent handles scheduling, parallelism, throttling, metadata, and retries. The correct tool for 50 TB/week continuous, not for 3 PB one-shot.
- Storage Gateway gives AWS-backed storage a local filesystem face. File Gateway for NFS/SMB, Volume Gateway for iSCSI, Tape Gateway for VTL. Useful during migration and for hybrid workflows that need local caching.
- File Gateway cache sizing follows the working set. Roughly the active-read subset over the cache-retention period. Oversized cache wastes money; undersized cache means frequent S3 reads over a constrained pipe.
- S3 Intelligent-Tiering and Deep Archive handle long-tail cold data. 95% archival data at Deep Archive rates is ~$1/TB/month; at Standard it’s ~$23/TB/month. Lifecycle policies do the tiering automatically.
- Object Lock in Governance mode is the regulatory-retention primitive. Combined with a 20-year retention period, it prevents accidental or malicious deletion for the regulatory window.
- Post-migration, FSx for Lustre is the POSIX bridge to S3. Linked to a bucket, Lustre hydrates files on access and presents a filesystem to tools that expect one. The file-system layer doesn’t have to stay on-prem.
Moving three petabytes is not one problem; it’s three: the historical bulk, the ongoing flow, and the continuous read access. Snowball, DataSync, and Storage Gateway each solve one cleanly. The migration plan is which pieces fit where, in what order, with the lease as the hard deadline.