How to Set Up Apache Spark on Windows, Mac & Linux

Apache Spark is the world’s most popular large-scale data processing engine — but getting it installed and running your first program can feel intimidating without a clear guide. That ends today.

In this step-by-step tutorial, you will learn exactly how to set up the Apache Spark environment on your machine, configure Python (PySpark) or Scala, verify the installation, and execute your very first Spark program — complete with real output you can expect to see.

✅  What You Will Build By the end of this guide you will have a fully working Spark environment locally, and you will have successfully run a word count program — the classic ‘Hello World’ of big data — using PySpark.

Who Is This Guide For?

  • Beginners who have never used Apache Spark before
  • Python developers wanting to get started with PySpark
  • Data engineers setting up a local Spark development environment
  • Students learning big data concepts hands-on

What You Will Learn

  1. Install Java (prerequisite for Spark)
  2. Download and configure Apache Spark
  3. Set up environment variables (PATH, SPARK_HOME, JAVA_HOME)
  4. Install PySpark via pip for Python users
  5. Launch the Spark shell and verify the installation
  6. Write and run your first Spark program (Word Count)
  7. Explore SparkUI — Spark’s built-in monitoring dashboard

Before installing Apache Spark, make sure your system meets the following requirements. Spark is a Java-based application, so Java is its only hard dependency.

ComponentMinimum VersionRecommendedNotes
Java (JDK)811 or 17 LTSRequired — Spark runs on the JVM
Python3.73.10 or 3.11For PySpark; optional for Scala users
RAM4 GB8 GB or more16 GB recommended for production-like testing
Disk Space2 GB5 GBIncludes Spark binaries and working data
OSWin 10 / macOS 11Any modern 64-bit OSLinux preferred for production
⚠️  Windows Users Windows requires an additional WinUtils binary to run Spark correctly. This guide covers that step in detail under the Windows installation section.

Apache Spark requires Java Development Kit (JDK) version 8 or higher. We recommend JDK 11 LTS as it offers the best compatibility with Spark 3.x.

🍎  macOS

The easiest way to install Java on macOS is via Homebrew. Open Terminal and run:

# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install OpenJDK 11
brew install openjdk@11
# Add to PATH (add this line to ~/.zshrc or ~/.bash_profile)
export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"
# Verify installation
java -version

🐧  Ubuntu / Debian Linux

# Update package index
sudo apt update
# Install OpenJDK 11
sudo apt install -y openjdk-11-jdk
# Verify installation
java -version
# Output should look like:
# openjdk version "11.0.x" 2024-xx-xx

🪟  Windows

Download the OpenJDK 11 installer from https://adoptium.net (Eclipse Temurin is recommended). Run the .msi installer and check the option to set JAVA_HOME automatically.

# Open Command Prompt or PowerShell and verify:
java -version
# Expected output:
# openjdk version "11.0.x" ...
# OpenJDK Runtime Environment Temurin-11.x.x ...
📝  Setting JAVA_HOME Manually If “java -version” fails, set JAVA_HOME manually. On Linux/macOS add this to your shell profile: export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java)))). On Windows, set it in System Properties > Environment Variables.

Head to the official Apache Spark downloads page at spark.apache.org/downloads.html. As of 2026, the stable release is Spark 3.5.x. Select the following options:

  • Spark release: 3.5.x (latest stable)
  • Package type: Pre-built for Apache Hadoop 3.3 and later
# Option A: Download via terminal (Linux/macOS)
cd ~/Downloads
wget https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
# Extract the archive
tar -xzf spark-3.5.1-bin-hadoop3.tgz
# Move to a clean location
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

On Windows, extract the downloaded .tgz file using 7-Zip or WinRAR into a path without spaces, such as C:\spark.

⚠️  Avoid Spaces in Path Never install Spark in a path that contains spaces (e.g. “C:\Program Files\spark”). Spark’s scripts use shell commands that break on paths with spaces. Use C:\spark or C:\tools\spark instead.

Environment variables tell your system where to find Spark, Java, and Python. This is the most common source of setup errors — follow these steps carefully.

🐧 🍎  Linux / macOS

Open your shell profile file (~/.bashrc for bash, ~/.zshrc for zsh) and add the following lines:

# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install OpenJDK 11
brew install openjdk@11
# Add to PATH (add this line to ~/.zshrc or ~/.bash_profile)
export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"
# Verify installation
java -version

🪟  Windows

Open System Properties > Advanced > Environment Variables. Add the following User Variables:

# Update package index
sudo apt update
# Install OpenJDK 11
sudo apt install -y openjdk-11-jdk
# Verify installation
java -version
# Output should look like:

Spark on Windows requires a small Hadoop helper binary called winutils.exe. Without it you will see errors like ‘Could not locate executable null\bin\winutils.exe’.

# 1. Download winutils.exe for Hadoop 3.x from:
# https://github.com/cdarlint/winutils
# 2. Create folder and copy the file
mkdir C:\spark\bin
copy winutils.exe C:\spark\bin\winutils.exe
# 3. Set environment variable (already done above if you set HADOOP_HOME=C:\spark)
📝  Verify Your Variables After setting variables, open a new terminal window and run: echo $SPARK_HOME (Linux/macOS) or echo %SPARK_HOME% (Windows). You should see the Spark installation path printed. If not, the variable was not saved correctly.

Open a fresh terminal window and run the Spark shell to confirm Spark is correctly installed and configured.

# Launch Spark shell (Scala interface)
spark-shell
# You should see output similar to:
# Setting default log level to "WARN".
# ...
# Welcome to
# ____ __
# / __/__ ___ _____/ /__
# _\ \/ _ \/ _ `/ __/ '_/
# /___/ .__/\_,_/_/ /_/\_\ version 3.5.1
# /_/
#
# Using Scala version 2.12.x
# SparkContext available as 'sc'
# SparkSession available as 'spark'
# scala>

To exit the Spark shell, type :quit or press Ctrl+C.

# Alternatively, launch PySpark (Python interface)
pyspark
# You should see:
# Python 3.x.x
# ...
# Welcome to
# ____ __
# / __/__ ___ _____/ /__
# ...
# SparkSession available as 'spark'
# >>>
✅  Ignore WARN Messages The terminal will print several WARN lines during startup — this is completely normal. Focus on seeing the Spark ASCII art logo and the “SparkSession available as spark” line. Those confirm a successful installation.

For a cleaner Python development experience — especially in Jupyter notebooks or VS Code — install PySpark as a Python package using pip. This approach bundles Spark within your Python environment.

# Create a virtual environment (best practice)
python3 -m venv spark-env
source spark-env/bin/activate # Linux/macOS
# spark-env\Scripts\activate.bat # Windows
# Install PySpark
pip install pyspark==3.5.1
# Optional but useful: Jupyter notebook support
pip install jupyter findspark
# Verify installation
python -c "import pyspark; print(pyspark.__version__)"
# Expected output: 3.5.1
💡  PySpark pip vs Manual Installation Installing via pip is great for local development and notebooks. For production clusters, use the manual binary installation (Steps 2-3) so you control the exact Spark configuration and can tune executor memory, cores, and cluster settings.

Now for the exciting part — running your first Spark program! We will build a classic Word Count application: it reads a text file, splits it into words, counts how many times each word appears, and prints the top 10 results.

# Create a sample file to process echo “Apache Spark is a fast and general engine for large-scale data processing. Spark provides high-level APIs in Java Scala Python and R. Spark also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames MLlib for machine learning GraphX for graph processing and Spark Streaming.” > ~/sample_text.txt

Create a new file called word_count.py and paste the following code:

# word_count.py — Your First Apache Spark Program
# ─────────────────────────────────────────────────
from pyspark.sql import SparkSession
# Step 1: Create a SparkSession — the entry point to Spark
spark = SparkSession.builder \
.appName("WordCount") \
.master("local[*]") \
.getOrCreate()
# Reduce noisy log output
spark.sparkContext.setLogLevel("ERROR")
# Step 2: Read the text file into an RDD
text_rdd = spark.sparkContext.textFile("~/sample_text.txt")
# Step 3: Apply transformations
# - flatMap splits each line into individual words
# - map creates (word, 1) pairs
# - reduceByKey sums the counts for each word
word_counts = (
text_rdd
.flatMap(lambda line: line.lower().split(" "))
.filter(lambda word: word != "") # remove empty strings
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) # sum counts
.sortBy(lambda x: x[1], ascending=False) # sort by count desc
)
# Step 4: Trigger execution with an action — collect top 10
top_10 = word_counts.take(10)
print("\n===== TOP 10 WORD COUNTS =====")
for word, count in top_10:
print(f" {word:<20} : {count}")
print("=" * 30)
# Step 5: Stop the SparkSession
spark.stop()

# Run using spark-submit (recommended for .py files)
spark-submit word_count.py
# Or run directly with Python (if PySpark installed via pip)
python word_count.py

===== TOP 10 WORD COUNTS =====
spark : 4
and : 4
for : 3
sql : 2
apache : 1
is : 1
a : 1
fast : 1
general : 1
engine : 1
==============================
✅  Congratulations! You just ran your first Apache Spark program! The lazy evaluation model means the actual computation only happened when you called .take(10). Every flatMap, filter, map, and reduceByKey before that was just building a plan.

The RDD API is powerful but verbose. Modern Spark development uses the DataFrame API, which is more expressive, easier to read, and automatically optimized by Spark’s Catalyst engine. Here is the same Word Count using DataFrames:

# word_count_df.py — DataFrame API version
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, lower, col, desc
spark = SparkSession.builder \
.appName("WordCountDF") \
.master("local[*]") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
# Read text file as a DataFrame
df = spark.read.text("~/sample_text.txt")
# Split lines into words, explode into rows, count
word_counts = (
df
.select(explode(split(lower(col("value")), r"\s+")).alias("word"))
.filter(col("word") != "")
.groupBy("word")
.count()
.orderBy(desc("count"))
)
# Show top 10
print("\n===== WORD COUNT (DataFrame API) =====")
word_counts.show(10, truncate=False)
# Expected output:
+---------------------+-----+
|word |count|
+---------------------+-----+
|spark |4 |
|and |4 |
|for |3 |
|sql |2 |
|apache |1 |
+---------------------+-----+
only showing top 5 rows
💡  RDD vs DataFrame Use the DataFrame API for new projects — it gets optimized by Catalyst automatically. Use RDDs only when you need fine-grained control over partitioning or when working with unstructured data that does not fit a schema.

Every time a Spark application runs, it launches a built-in web interface called the Spark UI (also called SparkContext UI). This dashboard lets you monitor jobs, stages, tasks, storage, and executors in real time.

How to Access the Spark UI

  • While a Spark application is running, open a browser and navigate to:
# Default address for local mode
http://localhost:4040
# If port 4040 is in use, Spark will try 4041, 4042, etc.
# The exact URL is printed in the console output when Spark starts:
# INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://localhost:4040

Key Tabs in the Spark UI

  • Jobs — shows all completed and running jobs; click any job to see its DAG visualization
  • Stages — breaks jobs into stages; shows task counts, shuffle read/write bytes, and duration
  • Storage — displays cached RDDs and DataFrames with memory and disk usage
  • Environment — lists Spark configuration settings, Java version, and classpath
  • Executors — shows per-executor memory usage, GC time, and task counts
📝  History Server By default, the Spark UI disappears when an application finishes. To browse past job histories, configure the Spark History Server by setting spark.eventLog.enabled=true and starting the history-server.sh script. This is essential for debugging production jobs.

Common Errors & Troubleshooting

These are the most frequently encountered issues when setting up Spark for the first time, along with their solutions:

Error: ‘JAVA_HOME is not set’

# Check if Java is installed and where it lives
which java
java -version
# Set JAVA_HOME (Linux example for OpenJDK 11)
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
echo "export JAVA_HOME=$JAVA_HOME" >> ~/.bashrc
source ~/.bashrc

Error: ‘Could not locate winutils.exe’ (Windows)

Download winutils.exe from github.com/cdarlint/winutils (match your Hadoop version), place it in %HADOOP_HOME%\bin, and ensure HADOOP_HOME is set correctly in your Environment Variables.

Error: ‘Address already in use :4040’

# Check which Python Spark will use
echo $PYSPARK_PYTHON
# Point it to the correct Python executable
export PYSPARK_PYTHON="$(which python3)"

Error: ‘Python worker failed to connect back’

This usually means PYSPARK_PYTHON is pointing to the wrong Python binary. Verify by running:

# Check which Python Spark will use
echo $PYSPARK_PYTHON
# Point it to the correct Python executable
export PYSPARK_PYTHON="$(which python3)"

Error: OutOfMemoryError or GC Overhead

# Increase driver memory when launching spark-submit
spark-submit --driver-memory 4g word_count.py
# Or configure in SparkSession
spark = SparkSession.builder \
.config("spark.driver.memory", "4g") \
.config("spark.executor.memory", "4g") \
.getOrCreate()
✅  Pro Tip The fastest way to solidify your Spark skills is to find a dataset you care about — a CSV from Kaggle, server logs, or financial data — and explore it with PySpark. Real problems force you to encounter and overcome real errors.

Setting up Apache Spark is genuinely straightforward once you understand the moving parts: Java as the runtime, SPARK_HOME and PATH as the signposts, and PySpark as your Python gateway into Spark’s power.

In this guide you walked through every step from a blank machine to a running Spark job — installing Java, downloading and configuring Spark, setting environment variables, handling Windows-specific quirks, and writing your first Word Count program both in the RDD API and the modern DataFrame API.

The Spark ecosystem is vast, but it all builds on the foundation you just set up. Every distributed join, ML pipeline, or streaming job starts exactly the same way — with a SparkSession and a plan.


Discover more from DataSangyan

Subscribe to get the latest posts sent to your email.

Leave a Reply