Apache Hadoop is an open-source distributed storage and processing framework that can handle large data sets across clusters of computers. It has become a popular choice for organizations looking to process and analyze massive amounts of data. In this guide, we’ll show you how to set up Apache Hadoop on Rocky Linux, a community-supported enterprise operating system.
Prerequisites
Before we begin, make sure you have the following:
- A Rocky Linux server
- Root or sudo access
How to Set up Apache Hadoop on Rocky Linux
Update Your System
First, update your system to the latest available packages:
sudo dnf update -y
Install Java Development Kit (JDK)
Hadoop requires Java to function properly, so we’ll install JDK using the following command:
sudo dnf install java-11-openjdk-devel -y
Verify the installation by checking the Java version:
java -version
Create a Hadoop User on Rocky Linux
Create a new user for Hadoop and add it to the hadoop
group:
sudo useradd -m -s /bin/bash -G hadoop hadoop
Set a password for the Hadoop user:
sudo passwd hadoop
Install Apache Hadoop on Rocky Linux
Download the latest version of Hadoop from the official Apache website. At the time of writing, the latest version is 3.3.1:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Extract the downloaded archive:
tar xzf hadoop-3.3.1.tar.gz
Move the extracted files to the /opt/hadoop
directory:
sudo mv hadoop-3.3.1 /opt/hadoop
Change the ownership of the /opt/hadoop
directory to the Hadoop user:
sudo chown -R hadoop:hadoop /opt/hadoop
Step 5: Configure Hadoop Environment
Switch to the Hadoop user:
bash
su - hadoop
Add the following lines to the .bashrc
file:
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Load the new environment variables:
source .bashrc
Configure Hadoop on Rocky Linux
Edit the core-site.xml
file in the $HADOOP_CONF_DIR
directory:
vi $HADOOP_CONF_DIR/core-site.xml
Add the following configuration:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Edit the hdfs-site.xml
file:
vi $HADOOP_CONF_DIR/hdfs-site.xml
Add the following configuration:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/var/lib/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/var/lib/hadoop/hdfs/datanode</value>
</property>
</configuration>
Save and close the file. These settings define the replication factor, name node directory, and data node directory for HDFS. Next, configure YARN by editing the yarn-site.xml file:
vi $HADOOP_CONF_DIR/yarn-site.xml
Add the following configuration:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save and close the file. This configuration sets the resource manager hostname and enables the MapReduce shuffle service on the node manager. Now, set up the MapReduce framework by editing the mapred-site.xml file:
cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
vi $HADOOP_CONF_DIR/mapred-site.xml
Add the following configuration:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Save and close the file. This configuration sets the MapReduce framework to use YARN. After configuring Hadoop, you need to create the HDFS directories defined earlier:
sudo mkdir -p /var/lib/hadoop/hdfs/namenode
sudo mkdir -p /var/lib/hadoop/hdfs/datanode
sudo chown -R hadoop:hadoop /var/lib/hadoop
Now, format the Hadoop distributed file system (HDFS) with the following command:
sudo -u hadoop hdfs namenode -format
This command initializes the HDFS name node. With everything set up, start the Hadoop daemons:
sudo -u hadoop $HADOOP_HOME/sbin/start-dfs.sh
sudo -u hadoop $HADOOP_HOME/sbin/start-yarn.sh
To verify that Hadoop is running correctly, use the jps
command:
sudo -u hadoop jps
You should see output similar to the following, indicating that the Hadoop daemons are running:
12345 NameNode
23456 SecondaryNameNode
34567 DataNode
45678 ResourceManager
56789 NodeManager
Congratulations! You have successfully set up Apache Hadoop on your Rocky Linux system. You can now start using Hadoop for your big data processing tasks. For more information on how to use Hadoop, refer to the official Hadoop documentation.
In this article, we’ve covered the installation and configuration of Apache Hadoop on Rocky Linux. We also discussed how to set up HDFS, YARN, and the MapReduce framework for big data processing. If you are interested in learning more about related topics How to Install and Configure Kibana on Rocky Linux and How to Install and Configure Puppet on Rocky Linux.