I provide hereby a template for starting a Spark project with Maven. Code examples are in both Java and Scala. It provides a template for Spark Core, spark SQL and Spark Streaming.
This is a good starting point for any Spark project that you want to start.
The project template can be found in this GitHub repository.
project structure
Project structure is of a parent pom and a child module.
See picture below:
pom.xml explained
All dependencies except for the Spark libraries are defined in <dependencyManagement> tag in the parent pom.
The Spark libraries are defined in two profiles: spark-prod and spark-dev.
‘spark-prod’ contains all spark libraries in scope provided. This provides a fat jar that does not contain the spark libraries in it and hence can be deployed in a cluster using the command spark-submit.
‘spark-dev’ contains all spark libraries in scope compile. This allows debugging it in IntelliJ or eclipse in local mode.
In order to debug the project one needs to set the profile to -Pspark-dev or set it in the IDE profile. For example, in IntelliJ you set it like this:
Source templates
There is one Java source template for Spark core called HelloWorldJava.java.
There are three Scala source templates for Spark core, Spark SQL and Spark Streaming respectively:
- HelloWorldScala.scala
- HelloWorldSqlScala.scala
- HelloWorldStreaming.scala
There is one test class Hello_Test.scala. This test uses the testing framework ScalaTest
Trouble Shooting
If you get the following error while trying to run any of the classes:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf at com.tikalk.HelloWorldScala$.main(HelloWorldScala.scala:13) at com.tikalk.HelloWorldScala.main(HelloWorldScala.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
It is due to forgetting to set the profile to ‘spark-dev’.
Set the profile to ‘spark-dev’ as explained in the section: pom.xml explained.
The project template can be found in this GitHub repository.