Development Setup
Follow these steps to setup your machine for spark/scala development locally.
Windows Setup:
lets assume you are on x64 machine. Make sure you have good amount of RAM and cores to have a decent experience developing and running spark Apps.
Recommended RAM : 8GB +
Recommended Cores : 4 Cores +
Windows 10 or Server 2017 R2 OS
Prereqs
- Java SDK 1.8.x +
- ENVIRONMENT VAR: JAVA_HOME = java installation directory
- Scala 2.12.x +
- ENVIRONMENT VAR: SCALA_HOME = scala installation directory
- IntellJ IDEA (Community Edition)
- Install Scala Plugin : File->Settings->Plugins->Search for Scala -> install the jetbrains version of scala plugin
- Optional Setup. Set Spark and Hadoop if you want to work on Spark-Shell, etc without using IntelliJ IDEA
- Spark binaries
- unzip it to a wellknown location. Say : C:\Spark\
- ENVIRONMENT VARS:
- SPARK_HOME: C:\Spark
- PATH : $PATH;$SPARK_HOME\bin
- Hadoop binaries
- you dont need the entire hadoop binaries, just hte winutils.exe in your HADOOP_HOME\bin path
- Setup a wellknown location. Say: C:\Hadoop\bin
- download winutils.exe to above location
- ENVIRONMENT VARS:
- HADOOP_HOME: C:\Hadoop
- PATH : $Path; $HADOOP_HOME\bin;
- Spark binaries
Creating a new Spark project in IntelliJ IDEA Maven archtetype
- Start IntellIJ -> New-> Project
- Select Maven option -> Check Create from archetype -> org.scala-tools.archetypes:scala-archetype-simple->Next
- Enter groupId (ex: com.myproject.myscenario, org.myorg.myproject, etc)
Enter artifactId (this is the name your jar will be created with) -> Next -> Next Until you hit the Project location step
Enter project name and location and click finish
Ensure you have set project to "Auto import" dependencies. This usually pops when the project is loading.
Let the project finish loading
go to pom.xml in the root of the project and modify the following
change scala.version to 2.10.+ or higher
Add following block under <dependencies>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.2</version>
</dependency>
- Let the project finish importing these new dependencies
- Delete MySpec.scala file from $root\src\test\scala\<your-groupid> folder. This is unnecessary and throws a bunch of compilation errors
- Open your main App.scala file under $root\src\main\scala\<your-group-id> and modify it as follows
object App {
def main(args: Array[String]) : Unit = {
println("Hello World!")
}
}
- right click the App object in the project window or right click on the editor when your cursor is under main, and select "Run App" (ctrl+shift+F10)
- You should see compilation succeed and the output "Hello world!" shown on the Run window at the bottom of IDE.
You are now all set to start doing some Spark Stuff!
First Spark App
We will now do our development inside that main function we wrote above.
Lets do some basic stuff
Setup spark context
val conf = new SparkConf()
.setAppName("MyTestApp")
.setMaster("local[*]")
val sc = SparkContext.getOrCreate(conf)
println(sc.getConf.toDebugString)
/* OUTPUT of the APP will resemble:
spark.app.id=local-1493072120762
spark.app.name=MyTestApp
spark.driver.host=<your IP address>
spark.driver.port=50881
spark.executor.id=driver
spark.externalBlockStore.folderName=spark-43843c91-6646-4d5e-8294-e6bba77aa287
spark.master=local[*]
*/
/*
In the log garbage, You will also see the Spark Web UI endpoint:
17/04/24 22:15:21 INFO SparkUI: Stopped Spark web UI at http://10.89.1.15:4040
*/
NOTE: The spark UI is created when sparkContext starts, and is cleared when SparkContext stops.
So, dont go looking for it once the job is complete.
Now lets write some basic spark apps which do some parallel computing
Spark App to find Sum of numbers from 1 to 1M
val sc = SparkContext.getOrCreate(conf)
val numbers = sc.parallelize(1 to 1000000)
val sum = numbers.sum()
println(sum) // output: 5.000005E11
Looking at the spark UI for the above App: Since Spark UI is destroyed as soon as the sparkContext is closed, we need a way to keep it alive until we peek into the UI. To do this, lets add readLine at the end of our main function.
...
println("press Enter to exit...")
readLine() // will halt the program until we press enter in the Run window in IntellJ Idea
println("Exiting.")
}
Now, when we run the program and it stops at "press Enter to exit..." , fire up your browser to http://<ip>:4040 and you should see the majestic spark UI in all its glory.