How to handle large amounts storage on the cloud (or otherwise?)

| | August 8, 2015

I have written an application which does video encoding. The encoding is a pipelined process: first you fetch the video, then you encode it using ffmpeg, then you split the video into multiple parts, etc.

During the course of this, a 1 GB video balloons into several GB of intermediate data. This service is written so that a different program (via RabbitMQ) can handle each piece of the pipeline. Of course, the process doesn’t have to run this way, which brings me to my question.

I’m looking at storage requirements for making the app “live”. With cloud providers, you pay per GB of storage and per GB of transfer. So far so good.

When I transfer this 1 GB video blob from one cloud VM instance to another, or from the VM to the common storage service, does that count against my bandwidth? (I realize this answer will change depending on the host’s terms of service.)

Would it make more sense to have 1 VM perform the entire process, and then spin up multiple instances of that? As opposed to 1 VM only performing a single task in the pipeline? I ask this question in terms of optimizing for cost (lowest storage cost, lowest cost of spinning up VMs. Because the encoding will happen in batch, I am less concerned about pushing out requests quickly).

This scenario is a little bit unique in that I have huge amounts of binary data which cannot be stored efficiently in, say, a database. Which raises a similar question: for those with experience, when your DB VM sends its results back to your web app, are you charged for that intermediate transfer?

Am I even asking the right questions? Is there a guide that I should read, short of calling hosting providers and asking them about pricing myself?

One Response to “How to handle large amounts storage on the cloud (or otherwise?)”

  1. The uniqueness of your scenario makes it rather interesting I’d say!

    About transferring data between Virtual Machines in the cloud, that depends on the provider and the locations. Amazon for example, in EC2, does not charge data for transfers between Web Services on the same location. So, you can minimize your transferring costs up to the initial upload/download of your “big bunch of binary data”.

    Now, can your task be parallelized efficiently? If yes, consider spinning up lots of VMs at the same time to get the job done faster. This is cost effective for sure if time = money, but I am reluctant about your case, because you mention that you are less concerned for pushing changes quickly. You can still have a main VM handling requests and coordinating batches, and startup-shutdown other VMs that will handle some of the work load. You are paying as long as your VM is running, like an utility.

    The good thing in your scenario, is that these kind of batch tasks are ideal for cloud computing, and their pricing model is pretty much straightforward. Such tasks are resource intensive (CPU / RAM) so their “greediness” can be satisfied by the virtually unlimited resources a cloud can offer.

Leave a Reply