Creating a Spark application in Scala
Creating a Spark application in Scala in stand-alone mode on a
localhost is pretty straight forward. Spark supports APIs in Scala, Python, Java and R. Spark is written in Scala, so the Scala API has advantages. The challenge however, is in setting up a Scala development environment. The notes here works as the next step after setting up a Scala environment detailed in Scala – Getting Started.
Choose a Spark release: 2.4.5 (Feb 05 2020)
Choose a package type : Pre-built for Apache Hadoop 2.7
~/spark-examples$ mkdir externals/spark ~/spark-examples$ tar -xzf ~/Downloads/spark-2.4.5-bin-hadoop2.7.tgz -C externals/spark SPARK_HOME=~/spark-examples/externals/spark/spark-2.4.5-bin-hadoop2.7 PATH=$SPARK_HOME/bin:$PATH
Due to the bug SPARK-2356, Spark produces an error.
ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
To fix it, download
SPARK-8333 causes an exception while deleting a Spark temp directory on Windows, so deploying it is not recommended. It’s a conflict of hadoop with Windows filesystem. It can be suppressed by modifying the
# SPARK-8333: Exception while deleting Spark temp dir log4j.logger.org.apache.spark.util.ShutdownHookManager=OFF log4j.logger.org.apache.spark.SparkEnv=ERROR
Spark configuration files are in the
$SPARK_HOME/conf directory. Since Spark produces a lot of log information, you may choose to reduce the verbosity by changing the logging level.
spark-examples$ cd externals/spark/spark-2.4.5-bin-hadoop2.7/conf conf$ cp log4j.properties.template log4j.properties
To show only errors set
log4j.rootCategory=ERROR,console in the properties file.
Check the setup by running Spark shell, which is the same as Scala REPL slightly modified. Use :q to exit the shell.
/spark-examples$ spark-shell 20/04/09 02:09:25 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.192.128 instead (on interface ens33) 20/04/09 02:09:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 20/04/09 02:09:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.192.128:4040 Spark context available as 'sc' (master = local[*], app id = local-1586423376133). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241) Type in expressions to have them evaluated. Type :help for more information. scala> scala> :q
The example code can be accessed on GitHub at https://github.com/cognitivewaves/spark-examples.
spark-examples +-- apps | +-- sparkApp | +-- build.sbt | +-- project | | +-- build.properties | +-- src | +-- main | | +-- java | | +-- resources | | +-- scala | | +-- cw | | +-- WordCounter.scala | +-- test | +-- java | +-- resources | +-- scala +-- external | +-- java | | +-- jdk1.8.0_241 | +-- sbt | | +-- sbt-0.13.18 | +-- scala | +-- scala-2.11.12 | +-- spark | +-- spark--2.4.4-bin-hadoop2.7 +-- env-vars.bat +-- env-vars.sh
The code in
WordCounter.scala has to be modified to specify your home directory.
val textFile = sc.textFile("file:///home/cw/spark-examples/externals/spark/spark-2.4.5-bin-hadoop2.7/README.md") sortedCounts.saveAsTextFile("file:///home/cw/spark-examples/tmp/WordCounter")
jar file to be run on Spark.
~/spark-examples/apps/sparkApp$ sbt package [info] Loading global plugins from /home/cw/.sbt/1.0/plugins [info] Loading project definition from /home/cw/spark-examples/apps/sparkApp/project [info] Set current project to sparkapp (in build file:/home/cw/spark-examples/apps/sparkApp/) [info] Compiling 1 Scala source to /home/cw/spark-examples/apps/sparkApp/target/scala-2.11/classes... [info] Done packaging. [success] Total time: 5 s, completed Apr 9, 2020 7:02:59 AM
The earlier versions of sbt would output the
jar name and path which used to be convenient, but it’s no longer the case.
spark-examples/apps/sparkApp$ spark-submit --class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
tmp +-- WordCounter +-- part-00000 +-- _SUCCESS
--driver-java-options for debugging the Spark driver.
spark-examples/apps/sparkApp$ spark-submit \ --driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 \ --class cw.WordCounter ~/spark-examples/apps/sparkApp/target/scala-2.11/sparkapp_2.11-0.1.0-SNAPSHOT.jar
Setup IntelliJ for remote debugging.
Run -> Debug -> Edit Configurations -> + button (top left hand corner) -> Remote Name: Debug SparkApp Debugger mode: Attach to remote JVM Host: localhost Port: 5005 Command line arguments for remote JVM: -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
It’s nice to be able to build Spark applications in stand-alone mode locally. It makes getting started so much easier.