How to Set up Apache Hadoop on Rocky Linux

Apache Hadoop is an open-source distributed storage and processing framework that can handle large data sets across clusters of computers. It has become a popular choice for organizations looking to process and analyze massive amounts of data. In this guide, we’ll show you how to set up Apache Hadoop on Rocky Linux, a community-supported enterprise operating system.

Prerequisites

Before we begin, make sure you have the following:

A Rocky Linux server
Root or sudo access

How to Set up Apache Hadoop on Rocky Linux

Update Your System

First, update your system to the latest available packages:

sudo dnf update -y

Install Java Development Kit (JDK)

Hadoop requires Java to function properly, so we’ll install JDK using the following command:

sudo dnf install java-11-openjdk-devel -y

Verify the installation by checking the Java version:

java -version

Create a Hadoop User on Rocky Linux

Create a new user for Hadoop and add it to the hadoop group:

sudo useradd -m -s /bin/bash -G hadoop hadoop

Set a password for the Hadoop user:

sudo passwd hadoop

Install Apache Hadoop on Rocky Linux

Download the latest version of Hadoop from the official Apache website. At the time of writing, the latest version is 3.3.1:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Extract the downloaded archive:

tar xzf hadoop-3.3.1.tar.gz

Move the extracted files to the /opt/hadoop directory:

sudo mv hadoop-3.3.1 /opt/hadoop

Change the ownership of the /opt/hadoop directory to the Hadoop user:

sudo chown -R hadoop:hadoop /opt/hadoop

Step 5: Configure Hadoop Environment

Switch to the Hadoop user:

bash

su - hadoop

Add the following lines to the .bashrc file:

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Load the new environment variables:

source .bashrc

Configure Hadoop on Rocky Linux

Edit the core-site.xml file in the $HADOOP_CONF_DIR directory:

vi $HADOOP_CONF_DIR/core-site.xml

Add the following configuration:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Edit the hdfs-site.xml file:

vi $HADOOP_CONF_DIR/hdfs-site.xml

Add the following configuration:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/var/lib/hadoop/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/var/lib/hadoop/hdfs/datanode</value>
    </property>
</configuration>

Save and close the file. These settings define the replication factor, name node directory, and data node directory for HDFS. Next, configure YARN by editing the yarn-site.xml file:

vi $HADOOP_CONF_DIR/yarn-site.xml

Add the following configuration:

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>localhost</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Save and close the file. This configuration sets the resource manager hostname and enables the MapReduce shuffle service on the node manager. Now, set up the MapReduce framework by editing the mapred-site.xml file:

cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
vi $HADOOP_CONF_DIR/mapred-site.xml

Add the following configuration:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Save and close the file. This configuration sets the MapReduce framework to use YARN. After configuring Hadoop, you need to create the HDFS directories defined earlier:

sudo mkdir -p /var/lib/hadoop/hdfs/namenode
sudo mkdir -p /var/lib/hadoop/hdfs/datanode
sudo chown -R hadoop:hadoop /var/lib/hadoop

Now, format the Hadoop distributed file system (HDFS) with the following command:

sudo -u hadoop hdfs namenode -format

This command initializes the HDFS name node. With everything set up, start the Hadoop daemons:

sudo -u hadoop $HADOOP_HOME/sbin/start-dfs.sh
sudo -u hadoop $HADOOP_HOME/sbin/start-yarn.sh

To verify that Hadoop is running correctly, use the jps command:

sudo -u hadoop jps

You should see output similar to the following, indicating that the Hadoop daemons are running:

12345 NameNode
23456 SecondaryNameNode
34567 DataNode
45678 ResourceManager
56789 NodeManager

Congratulations! You have successfully set up Apache Hadoop on your Rocky Linux system. You can now start using Hadoop for your big data processing tasks. For more information on how to use Hadoop, refer to the official Hadoop documentation.

In this article, we’ve covered the installation and configuration of Apache Hadoop on Rocky Linux. We also discussed how to set up HDFS, YARN, and the MapReduce framework for big data processing. If you are interested in learning more about related topics How to Install and Configure Kibana on Rocky Linux and How to Install and Configure Puppet on Rocky Linux.