Introduction to Big Data (Hadoop)

Introduction to Big Data (Hadoop)

Introduction to Big Data and Hadoop:

Note:

These are just study materials I made for myself for career development when I was learning BigData.

Overview:

  1. Understand the concepts of Big Data.

  2. Explain Hadoop and how it addresses Big Data challenges.

  3. Describe the components of the Hadoop ecosystem.


Traditional Decision-Making Process:

  1. Based on the perception of the task at hand.

  2. Experience and intuition also played a major role in the traditional decision-making process.

  3. Decisions are made based on past experiences and personal instincts.

  4. Decisions are made based on preconceived guidelines rather than facts.


Challenges of Traditional Decision-Making Process:

  1. Takes a long time to arrive at a decision, therefore losing the competitive advantage.

  2. Required human intervention at various stages.

  3. Lacks systematic linkage among strategy, planning, execution, and reporting.

  4. Provides limited scope of data analytics, that is, it provides only a bird’s eye view.

  5. Obstructs the company’s ability to make a fully informed decision.


Big Data Analytics - Solution for Challenges in Traditional Decision-Making Process:

  1. The decision-making is based on data and facts derived from data using data analytics.

  2. It provides a comprehensive view of the overall picture which is a result of analyzing data from various sources.

  3. It provides streamlined decision-making from top to bottom.

  4. Big data analytics helps in analyzing unstructured data.

  5. It helps in faster decision-making thus improving the competitive advantage and saving time and energy.


Case Study: Google’s Self-Driving Car:

  1. Technical Data (Inside Out) →Learning about avoiding obstacles like cones, cyclists, etc... It's the data that comes from the sensors in the car.

  2. Community Data (Outside In) → Crowd-sourced data like traffic, driving conditions, etc...

  3. Personal Data → Drivers' personal preferences regarding driving locations, in-car entertainment, etc...


Big Data:

  • Data that contains greater variety, arriving in increasing volumes and with more velocity.

  • Put simply, Big Data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them.

  • But these data sets can be used to address business problems that couldn’t be tackled before.


Five V’s of Big Data:

  1. Volume:

    • In Big Data, high volumes of low-density, unstructured data should have to be processed.

    • This can be data of unknown value, such as Twitter data feeds, clickstreams on a web page or a mobile app.

    • These can be tens of terabytes of data. In some other cases, these can be hundreds of petabytes.

  2. Value:

    • Value of Big Data usually comes from insight discovery and pattern recognition that lead to more effective operations, stronger customer relationships and other clear and quantifiable business benefits.
  3. Variety:

    • Variety refers to many types of data that are available.

    • In Big Data, data comes in new unstructured data types.

    • Unstructured and semi-structured data types, such as text, audio, and video, require additional processing to derive meaning and support metadata.

  4. Velocity:

    • The fast rate at which data is received and acted on.

    • Normally, the highest velocity of data streams is directed into memory versus being written to disk.

  5. Veracity:

    • The truth or accuracy of data and information assets often determines executive-level confidence.

Big Data Analytics Pipeline:

  1. Data Ingestion Layer

    • The first step is to ingest or catch the data coming from variable sources.

    • Data here is prioritized and categorized which makes data flow easier in the further layers.

  2. Data Collector Layer:

    • The focus is on the transportation of data from the ingestion layer to the rest of the data pipeline.

    • In this layer, components are decoupled, so that analytic capabilities may begin.

  3. Data Processing Layer:

    • The focus is to build the data pipeline processing system.

    • Route the data to a different destination, and classify the data flow, and it's the first point where analytics may take place.

  4. Data Storage Layer:

    • Focuses on where to store a large dataset efficiently.
  5. Data Query Layer:

    • Active analytical processing takes place.

    • The focus is to gather the data's value so that it is made more helpful for the next layer.

  6. Data Visualization Layer:

    • Focuses on visualizing the processed data to make it more understandable.

Types of Data:

  1. Structured Data:

    • Data that has a defined data model format and structure such as Database.
  2. Semi-Structured Data:

    • Textural data files with an apparent pattern, enabling analysis such as Spreadsheets, JSON and XML files.
  3. Quasi-Structured Data:

    • Textual data with erratic formats such as click-stream data can be formatted with effort and software tools.
  4. Unstructured Data:

    • Data that has no inherent structure and is usually stored as different types of files such as PDFs, images, weblogs, etc...

Distributed System:

  1. It is a model in which components located on networked computers communicate and coordinate their actions by passing messages.

  2. It helps to process big data in a massively parallel way in much less time.


Hadoop:

  1. Hadoop is a framework that allows distributed processing of large datasets across clusters of commodity computers using simple programming models.

  2. Characteristics of Hadoop:

    • Economical → Can use ordinary computers for data processing.

    • Reliable → Stores copies of data on different machines and is resistant to hardware failure.

    • Scalable → Can scale both horizontally and vertically.

    • Flexible → Can store huge amounts of data (both structured and unstructured) and decide to use it later.

  3. Difference: Traditional Storage Systems Vs Hadoop:

  4. Hadoop Core Components:

    • The below components are closely coupled.

  5. Components of BigData in-terms of Hadoop and Data Visualization:

  6. Explanation of the Hadoop Ecosystem Components:

    1. Hive, Pig, Scoop and Oozie all need Map Reduce engine to work. So, these are called Abstraction of Map Reduce. If the Map Reduce process is killed, all these processes will also stop working.

  7. Five Daemons of Hadoop and Configuration Files:

    1. Namenode → core-site.xml

    2. Datanode → Workers

    3. Node manager → Workers and yarn-site.xml

    4. Resource Manager → mapped-site.xml

    5. Secondary name node

  8. Commercial Hadoop Distributions:

    • Apache Hadoop doesn’t give any support if anything goes wrong from using its distribution of Hadoop. So, many companies provide Hadoop as a service and give support for commercial purposes.


Hadoop Single Node Installation:

  1. Download Java 8 from Oracle and Hadoop from the Apache Hadoop website. [Note: Download tar.gz files]

  2. Extract the two and place them in the home directory.

  3. Open the .bashrc file and add the below environment variables.

     export JAVA_HOME=/home/<your_username>/Hadoop/jdk1.8.0_311
     export HADOOP_HOME=/home/<your_username>/Hadoop/hadoop-3.3.1
     export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH --> This line should always be at the last
    
  4. Modify the configuration files below (Add these snippets between configuration tags):

    • etc/hadoop/core-site.xml

        <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:50000</value>
        </property>
      
    • etc/hadoop/yarn-site.xml

        <property>
        <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>
        </property>
        <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
        <description>The hostname of the RM.</description>
        <name>yarn.resourcemanager.hostname</name>
        <value>localhost</value>
        </property>
        <property>
        <description>The address of the applications manager interface in the RM.</description>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8032</value>
        </property>
      
    • etc/hadoop/hdfs-site.xml

        <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/your_username/Hadoop/hadoop2-dir/namenode-dir</value>
        </property>
        <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/your_username/Hadoop/hadoop2-dir/datanode-dir</value>
        </property>
      
    • etc/hadoop/mapred-site.xml

        <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        </property>
      
    • etc/hadoop/workers

        localhost
      
    • etc/hadoop/hadoop-env.sh

        export JAVA_HOME=/home/<your_username>/Hadoop/jdk1.8.0_311
      
    • etc/hadoop/mapred-env.sh

        export JAVA_HOME=/home/<your_username>/Hadoop/jdk1.8.0_311
      
    • etc/hadoop/yarn-env.sh

        export JAVA_HOME=/home/<your_username>/Hadoop/jdk1.8.0_311
      
  5. Install SSH and configure it to ask for no password: [Note: Never install pdsh]

     sudo apt-get install openssh-server
    
     ssh-keygen -t rsa
    
     cd .ssh
    
     cat id_rsa.pub >> authorized_keys
    
  6. Test SSH by logging into localhost:

     ssh localhost
    
  7. Format Namenode: [After formatting, namenode-dir will be created]

     cd ~/Hadoop/hadoop-3.3.1
    
     bin/hadoop namenode -format
    
  8. Start the services: [All five daemons will start. datanode-dir will be created]

     cd ~/Hadoop/hadoop-3.3.1/sbin
    
     ./start-all.sh
    
  9. Check if all the services are running:

     jps
    
  10. Check the browser Web UI:

  11. Stop the services: [All the five daemons will stop]

    cd ~/Hadoop/hadoop-3.3.1/sbin
    
    ./stop-all.sh