A key financial advantage of cloud computing over onsite computing is elasticity: Cloud computing power and storage resources can be provisioned on-demand at virtually any size. This ability of clouds to elastically allocate resources makes performing large-scale data processing and analytic jobs cost-effective.

Cloud clusters

To begin, let’s go over AWS options for large computing clusters and massive data storage.

An AWS cluster can be built either via EC2 instances on IaaS or Elastic Map Reduce (EMR) on Hadoop PaaS. With both options, you need to take several decisions on how to optimize costs. For every cluster node you need to decide its instance type, and then how you want to purchase the instance: On-demand, instance or spot. Remember that optimizing direct costs is only one part of the issue; the other part is optimizing over SLA levels. You need to make these pricing decisions for both IaaS and PaaS (EMR supports both reserved and spot instances). BidElastic can help you to find cost- and SLA-optimal node structures for your Hadoop or Apache Spark installation.

How about data storage?

AWS data storage options for large analytics are even more complicated than those of computing power. To understand the AWS ecosystem of large data storage, let’s begin with data types commonly found in large-scale computations: Transactional data; streamed data, and logs and binary files.

For transactional data AWS offers three choices: Relational (RDS) and NOSQL (DynamoDB) databases, and a data warehouse (RedShift).

  • When storing traditional relational transactional data in RDS you can scale the transactional speed of databases up by increasing database instance size. To improve data reading in RDS you can scale databases out by adding read replicas. Read replica data are copied from the master database asynchronously. Up to five read replicas are available for MySQL and PostgreSQL and up to 15 for Aurora. However, if you’re not concerned with latency, you can overcome this limit by read replicas of read replicas.
  • As a NOSQL PaaS AWS offers DynamoDB supporting scale key-value storage models. The key advantage of DynamoDB is its scaling out model and ability to increase throughput capacity on demand. DynamoDB’s pricing model is based on through-output and stored data volume.
  • Finally, AWS offers Redshift as a data warehousing solution with unlimited capacity columnar storage. Redshift is compatible with PostreSQL database drivers, so using Redshift you can analyze data with SQL queries.

To process streamed data in AWS you can use Kinesis. Kinesis supports simultaneous asynchronous ingestion of several data streams and integrates with AWS technology stack nicely. Within Kinesis there are various services to process data with, for example, you can send each item to an AWS Lambda service or a SQS queue for further processing.

Logs and binary files can be stored in AWS S3. The big advantage of S3 is that a Hadoop cluster can operate directly on S3, so there is no need to build data nodes. So if your problem permits, using S3 greatly increases cost-efficiency of the analytic job. You need to be cautious here however. Data transfer rate of a local volume outperforms that of S3. This problem can be mitigated by data compression and composing clusters out of large numbers of smaller nodes. As data compression methods, we recommend LZO for Hadoop1 and Snappy for Hadoop2. Data transfer parallelism on S3 is efficient, so when a lot of workers on several nodes request S3 data simultaneously, the combined transfer rate is satisfactory for most applications. As another bonus, Amazon Hadoop PaaS (EMR) and standalone Hadoop installations support reading and writing map-reduce operations to S3. It is also possible to query S3 data with Presto or Hive. Integration of S3 data storage with Apache Spark is also straightforward.