Hadoop Training – Course Content
Overview:
Apache Hadoop is the open source data management software that helps organizations analyze huge volumes of structured and unstructured data, is a very hot topic across the tech industry. It can be quickly learn to take advantage of the MapReduce framework through technical sessions and hands on labs.
Training Objectives of Hadoop:
Hadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo.
Target Students / Prerequisites:
Students must be belonging to IT Background and familiar with Concepts in Java and Linux.
Introduction , The Motivation for Hadoop
- Problems with traditional large-scale systems
- Requirements for a new approach
Hadoop Basic Concepts
- An Overview of Hadoop
- The Hadoop Distributed File System
- Hands on Exercise
- How MapReduce Works
- Hands on Exercies
- Anatomy of a Hadoop Cluster
- Other Hadoop Ecosystem Components
Writing a MapReduce Program
- Examining a Sample MapReduce Program
- With several examples
- Basic API Concepts
- The Driver Code
- The Mapper
- The Reducer
- Hadoop’s Streaming API
Delving Deeper Into The Hadoop API
- More About ToolRunner
- Testing with MRUnit
- Reducing Intermediate Data With Combiners
- The configure and close methods for Map/Reduce Setup and Teardown
- Writing Partitioners for Better Load Balancing
- Hands-On Exercise
- Directly Accessing HDFS
- Using the Distributed Cache
- Hands-On Exercise
Performing several hadoopjobs
- The configure and close Methods
- Sequence Files
- Record Reader
- Record Writer
- Role of Reporter
- Output Collector
- Processing video files and audio files
- Processing image files
- Processing XML files
- Counters
- Directly Accessing HDFS
- ToolRunner
- Using The Distributed Cache
Common MapReduce Algorithms
- Sorting and Searching
- Indexing
- Classification/Machine Learning
- Term Frequency – Inverse Document Frequency
- Word Co-Occurrence
- Hands-On Exercise: Creating an Inverted Index
- Identity Mapper
- Identity Reducer
- Exploring well known problems using MapReduce applications
Usining HBase
- What is HBase?
- HBase API
- Managing large data sets with HBase
- Using HBase in Hadoop applications
- Hands-on Exercise
Using Hive and Pig
- Hive Basics
- Pig Basics
- Hands on Exercise
Practical Development Tips and Techniques
- Debugging MapReduce Code
- Using LocalJobRunner Mode for Easier Debugging
- Retrieving Job Information with Countrers
- Logging
- Splittable File Formats
- Determining the Optimal Number of Reducers
- Map-Only MapReduce Jobs
- Hands on Exercise
Debugging MapReduce Programs
- Testing with MRUnit
- Logging
- Classification/Machine Learning
Advanced MapReduce Programming
- A Recap of the MapReduce Flow
- The Secondary Sort
- CustomizedInputFormats and OutputFormats
- Pipelining Jobs With Oozie
- Map-Side Joins
- Reduce-Side Joins
Joining Data Sets in MapReduce
- Map-Side Joins
- The Secondary Sort
- Reduce-Side Joins
Monitoring and debugging on a Production Cluster
- Counters
- Skipping Bad Records
- Rerunning failed tasks with Isolation Runner
Tuning for Performance in MapReduce
- Reducing network traffic with combiner
- Partitioners
- Reducing the amount of input data
- Using Compression
- Reusing the JVM
- Running with speculative execution
- Refactoring code and rewriting algorithms Parameters affecting Performance
- Other Performance Aspects