Creating a Spark application in Scala
Creating a Spark application in Scala in stand-alone mode on a localhost
is pretty straight forward. Spark supports APIs in Scala, Python, Java and R. Spark is written in Scala, so the Scala API has advantages. The challenge however, is in setting up a Scala development environment. The notes here works as the next step after setting up a Scala environment detailed in Scala – Getting Started.
Setup
Source
Build
Run
Debug
Conclusion
Setup
Download Spark.
Choose a Spark release: 2.4.5 (Feb 05 2020)
Choose a package type : Pre-built for Apache Hadoop 2.7
Linux |
~/spark-examples$ mkdir externals/spark ~/spark-examples$ tar -xzf ~/Downloads/spark-2.4.5-bin-hadoop2.7.tgz -C externals/spark SPARK_HOME=~/spark-examples/externals/spark/spark-2.4.5-bin-hadoop2.7 PATH=$SPARK_HOME/bin:$PATH |
Windows | Extract to d:\spark-examples\externals\spark .
SPARK_HOME=d:\spark-examples\externals\spark\spark-2.4.5-bin-hadoop2.7 PATH=%SPARK_HOME%\bin:%PATH% Due to the bug SPARK-2356, Spark produces an error. ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. To fix it, download set HADOOP_HOME=%SPARK_HOME%\hadoop SPARK-8333 causes an exception while deleting a Spark temp directory on Windows, so deploying it is not recommended. It’s a conflict of hadoop with Windows filesystem. It can be suppressed by modifying the # SPARK-8333: Exception while deleting Spark temp dir log4j.logger.org.apache.spark.util.ShutdownHookManager=OFF log4j.logger.org.apache.spark.SparkEnv=ERROR |
Spark configuration files are in the $SPARK_HOME/conf
directory. Since Spark produces a lot of log information, you may choose to reduce the verbosity by changing the logging level.
spark-examples$ cd externals/spark/spark-2.4.5-bin-hadoop2.7/conf conf$ cp log4j.properties.template log4j.properties
To show only errors set log4j.rootCategory=ERROR,console
in the properties file.
Check the setup by running Spark shell, which is the same as Scala REPL slightly modified. Use :q to exit the shell.
/spark-examples$ spark-shell 20/04/09 02:09:25 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.192.128 instead (on interface ens33) 20/04/09 02:09:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 20/04/09 02:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.192.128:4040 Spark context available as 'sc' (master = local[*], app id = local-1586423376133). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241) Type in expressions to have them evaluated. Type :help for more information. scala> scala> :q
Source
The example code can be accessed on GitHub at https://github.com/cognitivewaves/spark-examples.
spark-examples +-- apps | +-- sparkApp | +-- build.sbt | +-- project | | +-- build.properties | +-- src | +-- main | | +-- java | | +-- resources | | +-- scala | | +-- cw | | +-- WordCounter.scala | +-- test | +-- java | +-- resources | +-- scala +-- external | +-- java | | +-- jdk1.8.0_241 | +-- sbt | | +-- sbt-0.13.18 | +-- scala | +-- scala-2.11.12 | +-- spark | +-- spark--2.4.4-bin-hadoop2.7 +-- env-vars.bat +-- env-vars.sh
The code in WordCounter.scala
has to be modified to specify your home directory.
val textFile = sc.textFile("file:///home/cw/spark-examples/externals/spark/spark-2.4.5-bin-hadoop2.7/README.md") sortedCounts.saveAsTextFile("file:///home/cw/spark-examples/tmp/WordCounter")
Build
Create a jar
file to be run on Spark.
~/spark-examples/apps/sparkApp$ sbt package
[info] Loading global plugins from /home/cw/.sbt/1.0/plugins
[info] Loading project definition from /home/cw/spark-examples/apps/sparkApp/project
[info] Set current project to sparkapp (in build file:/home/cw/spark-examples/apps/sparkApp/)
[info] Compiling 1 Scala source to /home/cw/spark-examples/apps/sparkApp/target/scala-2.11/classes...
[info] Done packaging.
[success] Total time: 5 s, completed Apr 9, 2020 7:02:59 AM
The earlier versions of sbt would output the jar
name and path which used to be convenient, but it’s no longer the case.
Packaging /home/cw/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
Run
spark-examples/apps/sparkApp$ spark-submit --class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
tmp +-- WordCounter +-- part-00000 +-- _SUCCESS
Debug
Specify the --driver-java-options
for debugging the Spark driver.
spark-examples/apps/sparkApp$ spark-submit \
--driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 \
--class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
Setup IntelliJ for remote debugging.
Run -> Debug -> Edit Configurations -> + button (top left hand corner) -> Remote Name: Debug SparkApp Debugger mode: Attach to remote JVM Host: localhost Port: 5005 Command line arguments for remote JVM: -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
Conclusion
It’s nice to be able to build Spark applications in stand-alone mode locally. It makes getting started so much easier.