A Big Data Hadoop and Spark project for absolute beginners
Contents
The following are the takeaways from the course:
- What is the project all about : It entails storing data in to Hadoop and then using Spark to do data cleansing
- 5V’s
- Volume
- Variety
- Velocity
- Veracity
- Value
- 1024 TB = 1 Petabyte
- 1024 PB = Exabyte
- Types of Data : Structured, Unstructured and Semi-Structured
- Hadoop 1.0 - Resource Manager and Job Tracker. Resource Manager part of Map Reduce
- Hadoop 2.0 - YARN takes care of resource management - Resource management and Job scheduling stuff is taken care of.
- YARN sits on HDFS and Map Reduce works with YARN
- Spark can use YARN on HDFS
- Resource Manager - Node Manager processes in YARN
- Pig and Hive run on Map Reduce which in turn sits on YARN which in turn sits on HDFS
- Cost of 3 node cluste
$3
dollars per day -$100
dollars per month - Hive is not a database - It points to data stored in HDFS. It stores metadata
- Storing data in HDFS
- Spark Unified analytics engine for large-data processing
- Installing Spark on Colab
- You need to install java, spark all by yourself everytime you start spark environment
- Learnt about AWS Glue components - Crawlers, Jobs and Triggers
- Learng about AWS Athena
- Spent about 3 hrs replicating the tasks shown in the lecture
- Learnt that there is an R and Python interface for Athena
- Managed to get Datamunging job done via AWS Glue
- User Defined functions
- Joins
- Using AWS Lambda triggers - How to launch a job in AWS Glue using Lambda triggers
- Simple usecase but have learnt the entire workflow