Apache Spark is a powerful open-source distributed computing system that provides high-level APIs in Java, Scala, Python, and R. It is designed for fast data processing, making it ideal for big data and machine learning workloads. In this tutorial, we’ll walk you through the process of how to set up Apache Spark on Rocky Linux.
Before we begin, you should have some knowledge of Linux administration and basic familiarity with the command line. If you’re new to Rocky Linux, check out our guide on how to install Rocky Linux.
Table of Contents
- Prerequisites
- Installing Java
- Downloading and Installing Apache Spark
- Configuring Apache Spark
- Starting and Stopping Apache Spark
- Running a Spark Application
How to Set up Apache Spark on Rocky Linux
Prerequisites
Before we start, ensure that your system is up to date:
sudo dnf update -y
Installing Java on Rocky Linux
Apache Spark requires Java, so let’s install OpenJDK 11:
sudo dnf install java-11-openjdk -y
Verify the installation by checking the Java version:
java -version
You should see output similar to this:
openjdk version "11.0.13" 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
Downloading and Installing Apache Spark on Rocky Linux
Download the latest version of Apache Spark from the official website. At the time of writing, the latest version is 3.2.0. You can use wget
to download the package:
wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
Extract the downloaded package:
tar xvf spark-3.2.0-bin-hadoop3.2.tgz
Move the extracted files to /opt/spark
:
sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark
Configuring Apache Spark on Rocky Linux
Set up environment variables for Apache Spark by creating a new file called spark.sh
in the /etc/profile.d/
directory:
sudo touch /etc/profile.d/spark.sh
Open the file using a text editor:
sudo nano /etc/profile.d/spark.sh
Add the following content to the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Save and exit the file. To apply the changes, run:
source /etc/profile.d/spark.sh
Starting and Stopping Apache Spark in Linux
Now that Apache Spark is installed and configured, you can start the Spark master and worker services.
In order to start the Spark master service, run:
start-master.sh
To start the Spark worker service and connect it to the master, run:
start-worker.sh spark://localhost:7077
To stop the Spark master and worker services, use the following commands:
stop-master.sh
stop-worker.sh
Now that you know how to start and stop the Spark master and worker services, let’s move on to accessing the Spark web user interface. The web UI is a useful tool for monitoring the cluster and analyzing the resource usage.
Accessing the Spark Web User Interface
- Open your web browser and navigate to the following URL:
http://<your-master-ip-address>:4040
Replace <your-master-ip-address>
with the IP address of your Spark master node. You should see the Spark Application UI, where you can monitor the progress of your Spark applications and access their logs.
- Similarly, you can access the Spark Master Web UI by navigating to:
http://<your-master-ip-address>:8080
Here, you can monitor the cluster and manage the Spark workers.
- To access the Spark Worker Web UI, navigate to:
http://<your-worker-ip-address>:8081
Replace <your-worker-ip-address>
with the IP address of your Spark worker node. This interface allows you to monitor the worker’s resource usage and logs.
Running Spark Applications on Rocky Linux
To run a Spark application, use the spark-submit
command followed by the path to your application’s JAR or Python file. For example, to run a Python application:
spark-submit /path/to/your/application.py
Remember to configure the spark.master
property in your application to point to your Spark master node using its URL, which should be in the following format: spark://<your-master-ip-address>:7077
.
Integrating Spark with Hadoop in Linux
To integrate Apache Spark with Apache Hadoop, you need to configure Spark to use Hadoop’s HDFS for data storage. You can do this by updating the spark-defaults.conf
file:
- Open the
spark-defaults.conf
file with your preferred text editor:
sudo nano /opt/spark/conf/spark-defaults.conf
- Add the following lines at the end of the file:
spark.hadoop.fs.defaultFS hdfs://<your-hadoop-namenode-ip-address>:9000
spark.eventLog.enabled true
spark.eventLog.dir hdfs://<your-hadoop-namenode-ip-address>:9000/spark-logs
Replace <your-hadoop-namenode-ip-address>
with the IP address of your Hadoop NameNode.
- Save the file and exit the text editor.
That’s it! You have successfully set up Apache Spark on Rocky Linux and integrated it with Hadoop for distributed data storage and processing. You can now build and run large-scale data processing applications on your Spark cluster.
Conclusion
In this tutorial, we covered the process of installing and configuring Apache Spark on Rocky Linux. We discussed how to set up a Spark cluster, start and stop the Spark services, access the web UIs, and run Spark applications. We also demonstrated how to integrate Spark with Hadoop for distributed data storage.
As you continue to work with Apache Spark, you may find it useful to explore additional tools and frameworks that can help you optimize your data processing tasks. Some popular options include Apache Flink, Apache Kafka, and Apache Cassandra.