Spark – Getting Started

Creating a Spark application in Scala

Creating a Spark application in Scala in stand-alone mode on a localhost is pretty straight forward. Spark supports APIs in Scala, Python, Java and R. Spark is written in Scala, so the Scala API has advantages. The challenge however, is in setting up a Scala development environment. The notes here works as the next step after setting up a Scala environment detailed in Scala – Getting Started.

Setup
Source
Build
Run
Debug
Conclusion

Setup

Download Spark.
Choose a Spark release: 2.4.5 (Feb 05 2020)
Choose a package type : Pre-built for Apache Hadoop 2.7

Linux
~/spark-examples$ mkdir externals/spark
~/spark-examples$ tar -xzf ~/Downloads/spark-2.4.5-bin-hadoop2.7.tgz -C externals/spark

SPARK_HOME=~/spark-examples/externals/spark/spark-2.4.5-bin-hadoop2.7
PATH=$SPARK_HOME/bin:$PATH 
Windows Extract to d:\spark-examples\externals\spark.

SPARK_HOME=d:\spark-examples\externals\spark\spark-2.4.5-bin-hadoop2.7
PATH=%SPARK_HOME%\bin:%PATH%

Due to the bug SPARK-2356, Spark produces an error.

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

To fix it, download winutils.exe from Hortonworks or Github. It’s also included in the example code. Since it is specifically used for Spark, I would put it under the Spark directory, for example, %SPARK_HOME%\hadoop\bin\winutils.exe.
Set the environment variable HADOOP_HOME to the parent directory of bin/winutils.exe

set HADOOP_HOME=%SPARK_HOME%\hadoop

SPARK-8333 causes an exception while deleting a Spark temp directory on Windows, so deploying it is not recommended. It’s a conflict of hadoop with Windows filesystem. It can be suppressed by modifying the log4j.properties file.

# SPARK-8333: Exception while deleting Spark temp dir
log4j.logger.org.apache.spark.util.ShutdownHookManager=OFF
log4j.logger.org.apache.spark.SparkEnv=ERROR

Spark configuration files are in the $SPARK_HOME/conf directory. Since Spark produces a lot of log information, you may choose to reduce the verbosity by changing the logging level.

spark-examples$ cd externals/spark/spark-2.4.5-bin-hadoop2.7/conf
conf$ cp log4j.properties.template log4j.properties

To show only errors set log4j.rootCategory=ERROR,console in the properties file.

Check the setup by running Spark shell, which is the same as Scala REPL slightly modified. Use :q to exit the shell.

/spark-examples$ spark-shell
20/04/09 02:09:25 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.192.128 instead (on interface ens33)
20/04/09 02:09:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/04/09 02:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.192.128:4040
Spark context available as 'sc' (master = local[*], app id = local-1586423376133).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
scala> :q

Source

The example code can be accessed on GitHub at https://github.com/cognitivewaves/spark-examples.

spark-examples
+-- apps
|   +-- sparkApp
|       +-- build.sbt
|       +-- project
|       |   +-- build.properties
|       +-- src
|           +-- main
|           |   +-- java
|           |   +-- resources
|           |   +-- scala
|           |       +-- cw
|           |           +-- WordCounter.scala
|           +-- test
|               +-- java
|               +-- resources
|               +-- scala
+-- external
|   +-- java
|   |   +-- jdk1.8.0_241
|   +-- sbt
|   |   +-- sbt-0.13.18
|   +-- scala
|       +-- scala-2.11.12
|   +-- spark
|       +-- spark--2.4.4-bin-hadoop2.7
+-- env-vars.bat
+-- env-vars.sh

The code in WordCounter.scala has to be modified to specify your home directory.

val textFile = sc.textFile("file:///home/cw/spark-examples/externals/spark/spark-2.4.5-bin-hadoop2.7/README.md")
sortedCounts.saveAsTextFile("file:///home/cw/spark-examples/tmp/WordCounter")

Build

Create a jar file to be run on Spark.

~/spark-examples/apps/sparkApp$ sbt package
[info] Loading global plugins from /home/cw/.sbt/1.0/plugins
[info] Loading project definition from /home/cw/spark-examples/apps/sparkApp/project
[info] Set current project to sparkapp (in build file:/home/cw/spark-examples/apps/sparkApp/)
[info] Compiling 1 Scala source to /home/cw/spark-examples/apps/sparkApp/target/scala-2.11/classes...
[info] Done packaging.
[success] Total time: 5 s, completed Apr 9, 2020 7:02:59 AM

The earlier versions of sbt would output the jar name and path which used to be convenient, but it’s no longer the case.
Packaging /home/cw/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar

Run

spark-examples/apps/sparkApp$ spark-submit --class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
tmp
+-- WordCounter
    +-- part-00000
    +-- _SUCCESS

Debug

Specify the --driver-java-options for debugging the Spark driver.

spark-examples/apps/sparkApp$  spark-submit \
--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 \
--class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar

Setup IntelliJ for remote debugging.

Run -> Debug -> Edit Configurations -> + button (top left hand corner) -> Remote
Name: Debug SparkApp
Debugger mode: Attach to remote JVM
Host: localhost
Port: 5005
Command line arguments for remote JVM: -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005

Conclusion

It’s nice to be able to build Spark applications in stand-alone mode locally. It makes getting started so much easier.