Technical Tuesday: Dealing with Data

It’s Technical Tuesday! Today we are diving into the controversial topic of paying for data storage, and of course since the two go hand-in-hand: backup solutions. There is no need to take any of this as gospel. We are presenting one way of many of going about this.

Start with Cloud Storage:

The cheapest solution to data storage is Cloud Storage, which to an extent is free.

As an example Google Drive offers 15 GB to any google account for free. You can set it up to automatically sync with multiple drives on multiple computers, and it will not just back up your data, but even keep track of revisions.

If you are associated with a large company or any university, almost everyone with an .edu email address can get unlimited storage on Google Drive.

If you don’t have such a hookup, no worries. You can get drive for work, which runs $10 a month for unlimited storage or  you might be eligible for G-Suite for education, which will get you that unlimited storage for free.

If none of these situations fit you, you can always get a personal license. Amazon only costs $11/year for 100 GB , or $60/year for 1 TB. 

Google Drive is just an example. Your university or place of work might have unlimited storage on DropBox, Amazon, Microsoft OneDrive or a myriad of other sources.

A final note, if your pieces of data are very small or very big compared to 500 MB consider piecing them up into 500 MB chunks in an archive format such as ZIP or RAR. Cloud services can have trouble with handling tens of thousands of files at a time or files on the order of a TB. This also gives a benefit of also reducing the size of certain files (for instance my photometry data is reduced by a factor of 2 using ZIP).

When to augment your cloud storage? 

It depends on your upload speed. Here’s a little test to see how fast your internet is.

Annoyingly upload speeds are reported in megabits (Mb) instead of megabytes (MB). 8 Mb/s is one MB/s. So divide the number you get here by 8.

Optimistically, your internet when plugged in by Ethernet is  between 1-10 MB/s . This means that somewhere between 50-500 GB of data a day you are going to have to find additional storage solutions.

Before You Start: Try IT

If you have such a resource, contact IT about getting a faster internet connection. If you are paying for your own internet, it might be worth considering an upgrade.

If you are at a larger institution, ask IT about Research Cluster Data Storage. As an example of what you are looking for, check out this page of Harvard Medical School for a cluster they run. Institutional solutions will often have much higher upload/download speeds on their intranet, or private network.

Augment Cloud Storage with A  Storage Server

If you find you are limited by your upload speed, consider augmenting your free cloud storage by moving some of your data to a local storage server.

If you don’t have enough confidence to build your own server, you can buy one. Consider purchasing last-gen servers that businesses are getting rid of on eBay or Craigslist. Check out this page which has what to look for as well as things to watch out for. If you want to buy something new but provide your own drives consider 45 Drives.

If you want to build your own server on a budget (~$1000 not including hard drives), here’s a possible outline for you

  1. Processor: A low-tier Xeon Processor with integrated graphics along the lines of of an E3-1225 V5
  2. Memory:  ~8GB of ECC RAM
  3. Motherboard: Choose a board to match your processor socket and supports ECC RAM. 
  4. SATA Controller: Often, you can’t physically plug in as many hard drives as you want. You need to get an expansion card. You can get a card like this and then use 4 in 1 breakout cables to plug up to 8 HDDs into it.
  5. Internal Power Supply: Just make sure you have enough connectors for as many HDDs as you want to run, or additional cables have a  place to plug in if necessary.
  6. UPS: To protect the server from sudden power outages, consider a Uninterruptible Power Supply.
  7. Case:  You could go for a reasonable dedicated server case that is typically in a rack mount form factor. Or you could buy something enormous.

If you feel like some hardcore DIY, that 600 TB setup is open source by BackBlaze.

In the case of a dedicated server you want Western Digital Red branded hard disk drives. Right now 4 TB ($125) and 8 TB ($260) drives are particularly cheap per GB, and both come with 3 year warranties.

Finally you’ll need a server operating system (OS). We recommend unRAID. For the ultra budget conscious consumer, you can often get a Windows Server license from your local IT department. Or tackle your problem with a free Linux distributions (Mint, Ubuntu, Arch, CentOS, etc.). These take some configuration in order to set up a filesharing server (making a ZFS pool, etc.), but can be fully capable. Finally, there are alternative NAS OS that are free (FreeNAS, and Xpenology for starters).

The benefit to unRAID is that it only requires two backup drives for your whole server (to a point). Instead of me fumbling about how parity backups work, just read this. For those of us who grew up with RAID 1, this is black magic. It also makes it easy to have a speedy set of cache drives that let you quickly upload data which is stored in the slower storage drives over time. unRAID maxes out at 24 storage drives, which with 8 TB drives means 192 TB. If you hit that limit, you can always build another server.

If you want to access a personal server, be it local or hosted by someone else, with the same ease as Google Drive on all your devices, consider installing a platform such as NextCloud on your server.

Augment A Storage Server with Personal Drives

Finally, if internal servers and the cloud simply don’t cut it, try to put the really big files in physical drives.  At 0.75 GB/s (That’s 6 Gb/s, I know confusing right?), you can put a drive into a hot-swap port, transfer data onto it, pull it out and bring it to another computer for back up and analysis.

WD Blue 4TB PC Hard Drive - 5400 RPM Class, SATA 6 Gb/s, 64 MB Cache, 3.5" - WD40EZRZ For a drive that will be moved around a lot, you want WD Blue. a WD Blue 4 TB drive comes in at just $100 right now.  This includes a 2-year warranty. You’ll pay $50 per TB for the drive and a backup drive.

StarTech.com 5.25in Trayless Hot Swap Mobile Rack for 3.5in Hard Drive - Internal SATA Backplane EnclosureTo move data around, install a Hot Swap SATA port to your desktops. For around $10 each, these will let you easily transfer drives from desktop to desktop.

Cable Matters USB 3.0 SATA HDD/SSD Docking Station with 10TB+ drive support (Thunderbolt 3 & USB-C compatible)Buy $20 Hard Drive docking stations to transfer files to laptops when necessary.

slide 2 of 8,zoom in, duplicate or dock a sata ssd/hdd with fast performance - works with both usb-c and usb-a enabled devicesOr buy a $100 Hard Drive docking station and drive duplicator to transfer files to laptops and to easily back up your drives.

For robustness, your backup drive should not be used as a hot-swap mobile drive. Either mount them directly in a desktop computer normally used for analysis or duplicate on a weekly basis and then leave them in a safe location.

Panic

Between the three of these solutions, you should be able to deal with acquiring upwards of 10-50 TB a year, but if you are acquiring over 200 TB a year, then it’s time to panic.

100 TB only requires ~$5000 worth of personal drives and backups, but you are hitting the point where drive failures are a part of life. Storing the drives and making sure the backups haven’t failed will be a lot of work. Talk to IT again, and you might want to call up some companies and maybe just stop and think about how your going to analyze all that data anyways.

I hope this has been helpful, and I’m sure many people will disagree with my suggestions here, to summarize:

Thanks to Dr. Walter Schwenger, Isaiah Laderman, and Dr. Marc Ridilla for much of the information in this post.