Running Spark in GitHub Actions

This is how to quick and easy guide on how to run Apache Spark in GitHub Actions for testing purposes.

Imagine that I start with an example PySpark app.py script that reads data from a JSON file and performs some basic queries.

import sys

from pyspark.sql import SparkSession

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: spark-submit app.py <input>")
        sys.exit(1)

    input_file_path = sys.argv[1]

    spark = SparkSession.builder.appName("Spark").getOrCreate()
    df = spark.read.json(input_file_path)

    df.createOrReplaceTempView("df")

    spark.sql("SELECT * FROM df LIMIT 10").show()
    spark.sql("SELECT COUNT(*) FROM df").show()
    spark.sql("SELECT cn, COUNT(*) FROM df GROUP BY cn ORDER BY 2 DESC").show()
    spark.sql("SELECT cn, MAX(temp) AS max_temp FROM df GROUP BY cn ORDER BY 2 DESC").show()

In local, we can submit it via

spark-submit app.py <input>

To run this PySpark script in GitHub Actions, I've create a workflow file named spark-submit.yaml in the .github/workflows/ directory. The spark-submit.yaml file defines the steps that GitHub Actions should take to run the PySpark script using the spark-submit command.

File: .github/workflows/spark-submit.yaml

name: Spark Submit

on:
  push:
    branches:
      - 'master'

jobs:
  spark-submit:
    runs-on: ubuntu-latest

    strategy:
      matrix:
        python:
          - '3.10'
          - '3.11'
        spark:
          - 3.3.2
          - 3.4.0

    steps:
      - uses: actions/checkout@v3

      - uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python }}

      - uses: actions/setup-java@v3
        with:
          java-version: '17'
          distribution: 'temurin'

      - uses: vemonet/setup-spark@v1
        with:
          spark-version: ${{ matrix.spark }}
          hadoop-version: '3'

      - run: spark-submit app.py data.json

Here is the result detail for each run in Github Actions:

The spark-submit.yaml file also specifies the matrix strategy to test the PySpark script on multiple versions of Python and Spark. In this example, the PySpark script will be tested on:

  • Python 3.10, Spark 3.3.2.
  • Python 3.10, Spark 3.4.0.
  • Python 3.11, Spark 3.3.2.
  • Python 3.11, Spark 3.4.0.

This is useful for ensuring that your PySpark script works correctly on different versions of Python and Spark. Additionally, you can add Spark unit tests and run them via Github Actions to ensure that all tests pass before merging PRs.

That's it. You can find all code in my repo: https://github.com/duyet/spark-in-github-actions

References

DataData, Data Engineering, Spark, Github