Spark – Getting Started

Creating a Spark application in Scala

Creating a Spark application in Scala in stand-alone mode on a localhost is pretty straight forward. Spark supports APIs in Scala, Python, Java and R. Spark is written in Scala, so the Scala API has advantages. The challenge however, is in setting up a Scala development environment. The notes here works as the next step after setting up a Scala environment detailed in Scala – Getting Started.



Download Spark.
Choose a Spark release: 2.4.5 (Feb 05 2020)
Choose a package type : Pre-built for Apache Hadoop 2.7

~/spark-examples$ mkdir externals/spark
~/spark-examples$ tar -xzf ~/Downloads/spark-2.4.5-bin-hadoop2.7.tgz -C externals/spark

Windows Extract to d:\spark-examples\externals\spark.


Due to the bug SPARK-2356, Spark produces an error.

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

To fix it, download winutils.exe from Hortonworks or Github. It’s also included in the example code. Since it is specifically used for Spark, I would put it under the Spark directory, for example, %SPARK_HOME%\hadoop\bin\winutils.exe.
Set the environment variable HADOOP_HOME to the parent directory of bin/winutils.exe


SPARK-8333 causes an exception while deleting a Spark temp directory on Windows, so deploying it is not recommended. It’s a conflict of hadoop with Windows filesystem. It can be suppressed by modifying the file.

# SPARK-8333: Exception while deleting Spark temp dir

Spark configuration files are in the $SPARK_HOME/conf directory. Since Spark produces a lot of log information, you may choose to reduce the verbosity by changing the logging level.

spark-examples$ cd externals/spark/spark-2.4.5-bin-hadoop2.7/conf
conf$ cp

To show only errors set log4j.rootCategory=ERROR,console in the properties file.

Check the setup by running Spark shell, which is the same as Scala REPL slightly modified. Use :q to exit the shell.

/spark-examples$ spark-shell
20/04/09 02:09:25 WARN Utils: Your hostname, ubuntu resolves to a loopback address:; using instead (on interface ens33)
20/04/09 02:09:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/04/09 02:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at
Spark context available as 'sc' (master = local[*], app id = local-1586423376133).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :q


The example code can be accessed on GitHub at

+-- apps
|   +-- sparkApp
|       +-- build.sbt
|       +-- project
|       |   +--
|       +-- src
|           +-- main
|           |   +-- java
|           |   +-- resources
|           |   +-- scala
|           |       +-- cw
|           |           +-- WordCounter.scala
|           +-- test
|               +-- java
|               +-- resources
|               +-- scala
+-- external
|   +-- java
|   |   +-- jdk1.8.0_241
|   +-- sbt
|   |   +-- sbt-0.13.18
|   +-- scala
|       +-- scala-2.11.12
|   +-- spark
|       +-- spark--2.4.4-bin-hadoop2.7
+-- env-vars.bat

The code in WordCounter.scala has to be modified to specify your home directory.

val textFile = sc.textFile("file:///home/cw/spark-examples/externals/spark/spark-2.4.5-bin-hadoop2.7/")


Create a jar file to be run on Spark.

~/spark-examples/apps/sparkApp$ sbt package
[info] Loading global plugins from /home/cw/.sbt/1.0/plugins
[info] Loading project definition from /home/cw/spark-examples/apps/sparkApp/project
[info] Set current project to sparkapp (in build file:/home/cw/spark-examples/apps/sparkApp/)
[info] Compiling 1 Scala source to /home/cw/spark-examples/apps/sparkApp/target/scala-2.11/classes...
[info] Done packaging.
[success] Total time: 5 s, completed Apr 9, 2020 7:02:59 AM

The earlier versions of sbt would output the jar name and path which used to be convenient, but it’s no longer the case.
Packaging /home/cw/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar


spark-examples/apps/sparkApp$ spark-submit --class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
+-- WordCounter
    +-- part-00000
    +-- _SUCCESS


Specify the --driver-java-options for debugging the Spark driver.

spark-examples/apps/sparkApp$  spark-submit \
--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 \
--class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar

Setup IntelliJ for remote debugging.

Run -> Debug -> Edit Configurations -> + button (top left hand corner) -> Remote
Name: Debug SparkApp
Debugger mode: Attach to remote JVM
Host: localhost
Port: 5005
Command line arguments for remote JVM: -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005


It’s nice to be able to build Spark applications in stand-alone mode locally. It makes getting started so much easier.