Processing Data in the Cloud

cadium828
Jul 19
3 min read

When it comes to data processing, organizations face a critical decision when building their cloud architecture: batch or streaming? AWS offers two different platforms to assist with either option—Amazon’s EMR (Elastic Map Reduce) for batch processing and Amazon Kinesis for real-time data streaming.

While both Amazon EMR and Kinesis process data at scale, they approach the issue with different philosophies and mechanisms. In today’s article, we will cover both of these technologies to help you understand which service best suits your specific data processing needs or maybe, you need both for a comprehensive data strategy.

Amazon EMR

What is it?

AWS Elastic MapReduce is Amazon’s managed big data platform that simplifies running big data frameworks such as Hadoop, Apache Spark, and Presto at scale. EMR handles the heavy lifting of provisioning, configuring, and tuning clusters of EC2 instances to process and analyze vast amounts of data.

How does it work?

Cluster Creation: You define a cluster of EC2 instances, specifying instance types, sizes, and counts.
Job Submission: Submit batch processing jobs using frameworks like Spark or Hadoop MapReduce.
Processing: The cluster processes data in parallel across all nodes.
Results Storage: Results are typically stored in S3 or other persistent storage.
Cluster Termination: Once processing is complete, the cluster can be terminated to save costs.

EMR follows a master-to-slave architecture where a master node coordinates tasks across multiple core and task nodes. The service integrates with S3 for a cost-effective storage solution

Use Cases:

Data Transformation: Convert large datasets from one format to another
Machine Learning: Training models on large historical datasets
Log Analysis: Processing server logs to extract insights
ETL Operations: Extract, transform, and load data into a data warehouse
Genomics: Process genome sequences

Example

A retail company needs to analyze its five years of historical sales data (over 500TB) to identify seasonal buying patterns. Using EMR, they:

Create a cluster with a master node and 20 core nodes
They then Submit a Spark job that reads data from S3, performs aggregations by season, product category, and region
The job runs for approximately 2 hours, analyzing billions of transactions
Results are written back to S3 and later loaded into Amazon Redshift for business intelligence tools
Once the job is completed, the EMR cluster automatically terminates, preventing their costs from skyrocketing up, with 21 EC2 instances running until manually terminated.

The company runs this analysis monthly, paying only for the compute time used during each job.

Amazon Kinesis

What It Is

Amazon Kinesis is a platform for collecting, processing, and analyzing real-time streaming data. It provides the infrastructure to ingest and process data continuously as it’s generated, enabling real-time analytics and responsive applications.

How It Works

Kinesis consists of several components:

Kinesis Data Streams: Collects and stores data streams for processing. Data is organized into shards, each providing 1MB/sec input and 2MB/sec output capacity.
Kinesis Data Firehose: Automatically loads streaming data into destinations like S3, Redshift, or Elasticsearch.
Kinesis Data Analytics: Process streams using SQL or Apache.
Kinesis Video Streams: Captures, processes, and stores video streams.

Data producers (applications, IoT devices, logs) send records to Kinesis, which maintains them in order within each shard. Consumer applications can then read and process these records in real-time.

Use Cases

Real-time Analytics: Processing metrics and events as they occur
Log and Event Data Collection: Centralizing and processing logs from distributed systems
IoT Data Processing: Handling telemetry from connected devices
App Monitoring: Tracking user interactions and system health
Stock Trading Analysis: Processing market data feeds

Example

A financial services company needs to detect potential fraud in credit card transactions as they occur. Using Kinesis:

Their transaction processing system sends each transaction to Kinesis Data Streams
A Kinesis Data Analytics application applies a machine learning model to each transaction in real-time
Transactions with high fraud scores trigger alerts to their security team within seconds
All transaction data is simultaneously archived to S3 via Kinesis Firehose for later batch analysis
The system processes approximately 5,000 transactions per second with sub-second latency

This real-time processing allows the company to detect and block fraudulent transactions before they are completed, saving millions in potential fraud losses.

Conclusion

As you can see, both of these options from AWS provide similar features but use completely different. If you require large batch data processing, I would consider using AWS Elastic MapReduce; however, if you need to stream, process, and analyze near real-time data, you should consider looking at AWS Kinesis.

Stay curious and keep tinkering! I’ll be back with more tech insights in our next deep dive. I look forward to seeing you in the next one!

Processing Data in the Cloud

Amazon EMR

What is it?

How does it work?

Use Cases:

Example

Amazon Kinesis

What It Is

How It Works

Use Cases

Example

Conclusion

Recent Posts

Comments