Amazon EMR
Amazon EMR¶
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. Amazon EMR makes it easy to set up, operate, and scale your big data environments by automating time-consuming tasks like provisioning capacity and tuning clusters. With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions andover 3x faster than standard Apache Spark. You can run workloads on Amazon EC2 instances, on Amazon Elastic Kubernetes Service (EKS) clusters, or on-premises using EMR on AWS Outposts.
Backlinks¶
- AWS overview- Amazon Athena
 Amazon Elasticsearch Service
 Amazon EMR
 Amazon FinSpace
 Amazon Kinesis
 Amazon Kinesis Data Firehose
 Amazon Kinesis Data Analytics
 Amazon Kinesis Data Streams
 Amazon Kinesis Video Streams
 Amazon Redshift
 Amazon QuickSight
 AWS Data Exchange
 AWS Data Pipeline
 AWS Glue
 AWS Lake Formation
 Amazon Managed Streaming for Apache Kafka
 
- Amazon Athena
- AWS Data Pipeline- AWS Data Pipeline is a web service that helps you reliably process and move data between different
 Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and
 efficiently transfer the results to AWS services such as Amazon S3 (p. 74), Amazon RDS (p. 28),
 Amazon DynamoDB (p. 26), and Amazon EMR (p. 11).
 AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant,
 repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing
 inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure
 notification system. AWS Data Pipeline also allows you to move and process data that was previously
 locked up in on-premises data silos.
 
- AWS Data Pipeline is a web service that helps you reliably process and move data between different
- AWS Lake Formation- AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.
 setting up partitions, turning on encryption and managing keys, defining transformation jobs and monitoring their operation, re-organizing data into a columnar format, configuring access control settings, deduplicating redundant data, matching linked records, granting access to data sets, and auditing access over time.
 Creating a data lake with Lake Formation is as simple as defining where your data resides and what data access and security policies you want to apply. Lake Formation then collects and catalogs data from databases and object storage, moves the data into your new Amazon S3 data lake, cleans and classifies data using machine learning algorithms, and secures access to your sensitive data. Your users can then
 access a centralized catalog of data which describes available data sets and their appropriate usage. Your users then leverage these data sets with their choice of analytics and machine learning services, like Amazon EMR for Apache Spark, Amazon Redshift, Amazon Athena, SageMaker, and Amazon QuickSight.
 
- AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.