How to Set up Apache Spark on Rocky Linux

Apache Spark is a powerful open-source distributed computing system that provides high-level APIs in Java, Scala, Python, and R. It is designed for fast data processing, making it ideal for big data and machine learning workloads. In this tutorial, we’ll walk you through the process of how to set up Apache Spark on Rocky Linux.

Before we begin, you should have some knowledge of Linux administration and basic familiarity with the command line. If you’re new to Rocky Linux, check out our guide on how to install Rocky Linux.

Table of Contents

Prerequisites
Installing Java
Downloading and Installing Apache Spark
Configuring Apache Spark
Starting and Stopping Apache Spark
Running a Spark Application

How to Set up Apache Spark on Rocky Linux

Prerequisites

Before we start, ensure that your system is up to date:

sudo dnf update -y

Installing Java on Rocky Linux

Apache Spark requires Java, so let’s install OpenJDK 11:

sudo dnf install java-11-openjdk -y

Verify the installation by checking the Java version:

java -version

You should see output similar to this:

openjdk version "11.0.13" 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)

Downloading and Installing Apache Spark on Rocky Linux

Download the latest version of Apache Spark from the official website. At the time of writing, the latest version is 3.2.0. You can use wget to download the package:

wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

Extract the downloaded package:

tar xvf spark-3.2.0-bin-hadoop3.2.tgz

Move the extracted files to /opt/spark:

sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark

Configuring Apache Spark on Rocky Linux

Set up environment variables for Apache Spark by creating a new file called spark.sh in the /etc/profile.d/ directory:

sudo touch /etc/profile.d/spark.sh

Open the file using a text editor:

sudo nano /etc/profile.d/spark.sh

Add the following content to the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and exit the file. To apply the changes, run:

source /etc/profile.d/spark.sh

Starting and Stopping Apache Spark in Linux

Now that Apache Spark is installed and configured, you can start the Spark master and worker services.

In order to start the Spark master service, run:

start-master.sh

To start the Spark worker service and connect it to the master, run:

start-worker.sh spark://localhost:7077

To stop the Spark master and worker services, use the following commands:

stop-master.sh
stop-worker.sh

Now that you know how to start and stop the Spark master and worker services, let’s move on to accessing the Spark web user interface. The web UI is a useful tool for monitoring the cluster and analyzing the resource usage.

Accessing the Spark Web User Interface

Open your web browser and navigate to the following URL:

http://<your-master-ip-address>:4040

Replace <your-master-ip-address> with the IP address of your Spark master node. You should see the Spark Application UI, where you can monitor the progress of your Spark applications and access their logs.

Similarly, you can access the Spark Master Web UI by navigating to:

http://<your-master-ip-address>:8080

Here, you can monitor the cluster and manage the Spark workers.

To access the Spark Worker Web UI, navigate to:

http://<your-worker-ip-address>:8081

Replace <your-worker-ip-address> with the IP address of your Spark worker node. This interface allows you to monitor the worker’s resource usage and logs.

Running Spark Applications on Rocky Linux

To run a Spark application, use the spark-submit command followed by the path to your application’s JAR or Python file. For example, to run a Python application:

spark-submit /path/to/your/application.py

Remember to configure the spark.master property in your application to point to your Spark master node using its URL, which should be in the following format: spark://<your-master-ip-address>:7077.

Integrating Spark with Hadoop in Linux

To integrate Apache Spark with Apache Hadoop, you need to configure Spark to use Hadoop’s HDFS for data storage. You can do this by updating the spark-defaults.conf file:

Open the spark-defaults.conf file with your preferred text editor:

sudo nano /opt/spark/conf/spark-defaults.conf

Add the following lines at the end of the file:

spark.hadoop.fs.defaultFS hdfs://<your-hadoop-namenode-ip-address>:9000
spark.eventLog.enabled true
spark.eventLog.dir hdfs://<your-hadoop-namenode-ip-address>:9000/spark-logs

Replace <your-hadoop-namenode-ip-address> with the IP address of your Hadoop NameNode.

Save the file and exit the text editor.

That’s it! You have successfully set up Apache Spark on Rocky Linux and integrated it with Hadoop for distributed data storage and processing. You can now build and run large-scale data processing applications on your Spark cluster.

Conclusion

In this tutorial, we covered the process of installing and configuring Apache Spark on Rocky Linux. We discussed how to set up a Spark cluster, start and stop the Spark services, access the web UIs, and run Spark applications. We also demonstrated how to integrate Spark with Hadoop for distributed data storage.

As you continue to work with Apache Spark, you may find it useful to explore additional tools and frameworks that can help you optimize your data processing tasks. Some popular options include Apache Flink, Apache Kafka, and Apache Cassandra.