Introduction to Set Up Apache Spark Environment
Apache Spark is the world’s most popular large-scale data processing engine — but getting it installed and running your first program can feel intimidating without a clear guide. That ends today.
In this step-by-step tutorial, you will learn exactly how to set up the Apache Spark environment on your machine, configure Python (PySpark) or Scala, verify the installation, and execute your very first Spark program — complete with real output you can expect to see.
| ✅ What You Will Build By the end of this guide you will have a fully working Spark environment locally, and you will have successfully run a word count program — the classic ‘Hello World’ of big data — using PySpark. |
Who Is This Guide For?
- Beginners who have never used Apache Spark before
- Python developers wanting to get started with PySpark
- Data engineers setting up a local Spark development environment
- Students learning big data concepts hands-on
What You Will Learn
- Install Java (prerequisite for Spark)
- Download and configure Apache Spark
- Set up environment variables (PATH, SPARK_HOME, JAVA_HOME)
- Install PySpark via pip for Python users
- Launch the Spark shell and verify the installation
- Write and run your first Spark program (Word Count)
- Explore SparkUI — Spark’s built-in monitoring dashboard
Prerequisites & System Requirements
Before installing Apache Spark, make sure your system meets the following requirements. Spark is a Java-based application, so Java is its only hard dependency.
| Component | Minimum Version | Recommended | Notes |
| Java (JDK) | 8 | 11 or 17 LTS | Required — Spark runs on the JVM |
| Python | 3.7 | 3.10 or 3.11 | For PySpark; optional for Scala users |
| RAM | 4 GB | 8 GB or more | 16 GB recommended for production-like testing |
| Disk Space | 2 GB | 5 GB | Includes Spark binaries and working data |
| OS | Win 10 / macOS 11 | Any modern 64-bit OS | Linux preferred for production |
| ⚠️ Windows Users Windows requires an additional WinUtils binary to run Spark correctly. This guide covers that step in detail under the Windows installation section. |
STEP 1 Install Java (JDK)
Apache Spark requires Java Development Kit (JDK) version 8 or higher. We recommend JDK 11 LTS as it offers the best compatibility with Spark 3.x.
🍎 macOS
The easiest way to install Java on macOS is via Homebrew. Open Terminal and run:
# Install Homebrew if not already installed/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"# Install OpenJDK 11brew install openjdk@11# Add to PATH (add this line to ~/.zshrc or ~/.bash_profile)export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"# Verify installationjava -version
🐧 Ubuntu / Debian Linux
# Update package indexsudo apt update# Install OpenJDK 11sudo apt install -y openjdk-11-jdk# Verify installationjava -version# Output should look like:# openjdk version "11.0.x" 2024-xx-xx
🪟 Windows
Download the OpenJDK 11 installer from https://adoptium.net (Eclipse Temurin is recommended). Run the .msi installer and check the option to set JAVA_HOME automatically.
# Open Command Prompt or PowerShell and verify:java -version# Expected output:# openjdk version "11.0.x" ...# OpenJDK Runtime Environment Temurin-11.x.x ...
| 📝 Setting JAVA_HOME Manually If “java -version” fails, set JAVA_HOME manually. On Linux/macOS add this to your shell profile: export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java)))). On Windows, set it in System Properties > Environment Variables. |
STEP 2 Download Apache Spark
Head to the official Apache Spark downloads page at spark.apache.org/downloads.html. As of 2026, the stable release is Spark 3.5.x. Select the following options:
- Spark release: 3.5.x (latest stable)
- Package type: Pre-built for Apache Hadoop 3.3 and later
# Option A: Download via terminal (Linux/macOS)cd ~/Downloadswget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz# Extract the archivetar -xzf spark-3.5.1-bin-hadoop3.tgz# Move to a clean locationsudo mv spark-3.5.1-bin-hadoop3 /opt/spark
On Windows, extract the downloaded .tgz file using 7-Zip or WinRAR into a path without spaces, such as C:\spark.
| ⚠️ Avoid Spaces in Path Never install Spark in a path that contains spaces (e.g. “C:\Program Files\spark”). Spark’s scripts use shell commands that break on paths with spaces. Use C:\spark or C:\tools\spark instead. |
STEP 3 Configure Environment Variables
Environment variables tell your system where to find Spark, Java, and Python. This is the most common source of setup errors — follow these steps carefully.
🐧 🍎 Linux / macOS
Open your shell profile file (~/.bashrc for bash, ~/.zshrc for zsh) and add the following lines:
# Install Homebrew if not already installed/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"# Install OpenJDK 11brew install openjdk@11# Add to PATH (add this line to ~/.zshrc or ~/.bash_profile)export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"# Verify installationjava -version
🪟 Windows
Open System Properties > Advanced > Environment Variables. Add the following User Variables:
# Update package indexsudo apt update# Install OpenJDK 11sudo apt install -y openjdk-11-jdk# Verify installationjava -version# Output should look like:
Step 3b — WinUtils (Windows Only)
Spark on Windows requires a small Hadoop helper binary called winutils.exe. Without it you will see errors like ‘Could not locate executable null\bin\winutils.exe’.
# 1. Download winutils.exe for Hadoop 3.x from:# https://github.com/cdarlint/winutils# 2. Create folder and copy the filemkdir C:\spark\bincopy winutils.exe C:\spark\bin\winutils.exe# 3. Set environment variable (already done above if you set HADOOP_HOME=C:\spark)
| 📝 Verify Your Variables After setting variables, open a new terminal window and run: echo $SPARK_HOME (Linux/macOS) or echo %SPARK_HOME% (Windows). You should see the Spark installation path printed. If not, the variable was not saved correctly. |
STEP 4 Verify the Spark Installation
Open a fresh terminal window and run the Spark shell to confirm Spark is correctly installed and configured.
# Launch Spark shell (Scala interface)spark-shell# You should see output similar to:# Setting default log level to "WARN".# ...# Welcome to# ____ __# / __/__ ___ _____/ /__# _\ \/ _ \/ _ `/ __/ '_/# /___/ .__/\_,_/_/ /_/\_\ version 3.5.1# /_/## Using Scala version 2.12.x# SparkContext available as 'sc'# SparkSession available as 'spark'# scala>
To exit the Spark shell, type :quit or press Ctrl+C.
# Alternatively, launch PySpark (Python interface)pyspark# You should see:# Python 3.x.x# ...# Welcome to# ____ __# / __/__ ___ _____/ /__# ...# SparkSession available as 'spark'# >>>
| ✅ Ignore WARN Messages The terminal will print several WARN lines during startup — this is completely normal. Focus on seeing the Spark ASCII art logo and the “SparkSession available as spark” line. Those confirm a successful installation. |
STEP 5 Install PySpark via pip (Optional but Recommended)
For a cleaner Python development experience — especially in Jupyter notebooks or VS Code — install PySpark as a Python package using pip. This approach bundles Spark within your Python environment.
# Create a virtual environment (best practice)python3 -m venv spark-envsource spark-env/bin/activate # Linux/macOS# spark-env\Scripts\activate.bat # Windows# Install PySparkpip install pyspark==3.5.1# Optional but useful: Jupyter notebook supportpip install jupyter findspark# Verify installationpython -c "import pyspark; print(pyspark.__version__)"# Expected output: 3.5.1
| 💡 PySpark pip vs Manual Installation Installing via pip is great for local development and notebooks. For production clusters, use the manual binary installation (Steps 2-3) so you control the exact Spark configuration and can tune executor memory, cores, and cluster settings. |
STEP 6 Write and Run Your First Spark Program
Now for the exciting part — running your first Spark program! We will build a classic Word Count application: it reads a text file, splits it into words, counts how many times each word appears, and prints the top 10 results.
6a — Create a Sample Text File
| # Create a sample file to process echo “Apache Spark is a fast and general engine for large-scale data processing. Spark provides high-level APIs in Java Scala Python and R. Spark also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames MLlib for machine learning GraphX for graph processing and Spark Streaming.” > ~/sample_text.txt |
6b — The PySpark Word Count Program
Create a new file called word_count.py and paste the following code:
# word_count.py — Your First Apache Spark Program# ─────────────────────────────────────────────────from pyspark.sql import SparkSession# Step 1: Create a SparkSession — the entry point to Sparkspark = SparkSession.builder \ .appName("WordCount") \ .master("local[*]") \ .getOrCreate()# Reduce noisy log outputspark.sparkContext.setLogLevel("ERROR")# Step 2: Read the text file into an RDDtext_rdd = spark.sparkContext.textFile("~/sample_text.txt")# Step 3: Apply transformations# - flatMap splits each line into individual words# - map creates (word, 1) pairs# - reduceByKey sums the counts for each wordword_counts = ( text_rdd .flatMap(lambda line: line.lower().split(" ")) .filter(lambda word: word != "") # remove empty strings .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) # sum counts .sortBy(lambda x: x[1], ascending=False) # sort by count desc)# Step 4: Trigger execution with an action — collect top 10top_10 = word_counts.take(10)print("\n===== TOP 10 WORD COUNTS =====")for word, count in top_10: print(f" {word:<20} : {count}")print("=" * 30)# Step 5: Stop the SparkSessionspark.stop()
6c — Run the Program
# Run using spark-submit (recommended for .py files)spark-submit word_count.py# Or run directly with Python (if PySpark installed via pip)python word_count.py
6d — Expected Output
===== TOP 10 WORD COUNTS ===== spark : 4 and : 4 for : 3 sql : 2 apache : 1 is : 1 a : 1 fast : 1 general : 1 engine : 1==============================
| ✅ Congratulations! You just ran your first Apache Spark program! The lazy evaluation model means the actual computation only happened when you called .take(10). Every flatMap, filter, map, and reduceByKey before that was just building a plan. |
STEP 7 Word Count Using the DataFrame API
The RDD API is powerful but verbose. Modern Spark development uses the DataFrame API, which is more expressive, easier to read, and automatically optimized by Spark’s Catalyst engine. Here is the same Word Count using DataFrames:
# word_count_df.py — DataFrame API versionfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import explode, split, lower, col, descspark = SparkSession.builder \ .appName("WordCountDF") \ .master("local[*]") \ .getOrCreate()spark.sparkContext.setLogLevel("ERROR")# Read text file as a DataFramedf = spark.read.text("~/sample_text.txt")# Split lines into words, explode into rows, countword_counts = ( df .select(explode(split(lower(col("value")), r"\s+")).alias("word")) .filter(col("word") != "") .groupBy("word") .count() .orderBy(desc("count")))# Show top 10print("\n===== WORD COUNT (DataFrame API) =====")word_counts.show(10, truncate=False)
# Expected output:+---------------------+-----+|word |count|+---------------------+-----+|spark |4 ||and |4 ||for |3 ||sql |2 ||apache |1 |+---------------------+-----+only showing top 5 rows
| 💡 RDD vs DataFrame Use the DataFrame API for new projects — it gets optimized by Catalyst automatically. Use RDDs only when you need fine-grained control over partitioning or when working with unstructured data that does not fit a schema. |
STEP 8 Explore the Spark Web UI
Every time a Spark application runs, it launches a built-in web interface called the Spark UI (also called SparkContext UI). This dashboard lets you monitor jobs, stages, tasks, storage, and executors in real time.
How to Access the Spark UI
- While a Spark application is running, open a browser and navigate to:
# Default address for local modehttp://localhost:4040# If port 4040 is in use, Spark will try 4041, 4042, etc.# The exact URL is printed in the console output when Spark starts:# INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4040
Key Tabs in the Spark UI
- Jobs — shows all completed and running jobs; click any job to see its DAG visualization
- Stages — breaks jobs into stages; shows task counts, shuffle read/write bytes, and duration
- Storage — displays cached RDDs and DataFrames with memory and disk usage
- Environment — lists Spark configuration settings, Java version, and classpath
- Executors — shows per-executor memory usage, GC time, and task counts
| 📝 History Server By default, the Spark UI disappears when an application finishes. To browse past job histories, configure the Spark History Server by setting spark.eventLog.enabled=true and starting the history-server.sh script. This is essential for debugging production jobs. |
Common Errors & Troubleshooting
These are the most frequently encountered issues when setting up Spark for the first time, along with their solutions:
Error: ‘JAVA_HOME is not set’
# Check if Java is installed and where it liveswhich javajava -version# Set JAVA_HOME (Linux example for OpenJDK 11)export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"echo "export JAVA_HOME=$JAVA_HOME" >> ~/.bashrcsource ~/.bashrc
Error: ‘Could not locate winutils.exe’ (Windows)
Download winutils.exe from github.com/cdarlint/winutils (match your Hadoop version), place it in %HADOOP_HOME%\bin, and ensure HADOOP_HOME is set correctly in your Environment Variables.
Error: ‘Address already in use :4040’
# Check which Python Spark will useecho $PYSPARK_PYTHON# Point it to the correct Python executableexport PYSPARK_PYTHON="$(which python3)"
Error: ‘Python worker failed to connect back’
This usually means PYSPARK_PYTHON is pointing to the wrong Python binary. Verify by running:
# Check which Python Spark will useecho $PYSPARK_PYTHON# Point it to the correct Python executableexport PYSPARK_PYTHON="$(which python3)"
Error: OutOfMemoryError or GC Overhead
# Increase driver memory when launching spark-submitspark-submit --driver-memory 4g word_count.py# Or configure in SparkSessionspark = SparkSession.builder \ .config("spark.driver.memory", "4g") \ .config("spark.executor.memory", "4g") \ .getOrCreate()
| ✅ Pro Tip The fastest way to solidify your Spark skills is to find a dataset you care about — a CSV from Kaggle, server logs, or financial data — and explore it with PySpark. Real problems force you to encounter and overcome real errors. |
Conclusion
Setting up Apache Spark is genuinely straightforward once you understand the moving parts: Java as the runtime, SPARK_HOME and PATH as the signposts, and PySpark as your Python gateway into Spark’s power.
In this guide you walked through every step from a blank machine to a running Spark job — installing Java, downloading and configuring Spark, setting environment variables, handling Windows-specific quirks, and writing your first Word Count program both in the RDD API and the modern DataFrame API.
The Spark ecosystem is vast, but it all builds on the foundation you just set up. Every distributed join, ML pipeline, or streaming job starts exactly the same way — with a SparkSession and a plan.
Happy Sparking! 🔥
Discover more from DataSangyan
Subscribe to get the latest posts sent to your email.