How to build Spark source code and run its built-in example in IntelliJ

One way to dig and debug into the Spark's codebase is to build Spark source code and run its built-in examples (e.g., LogQuery and GroupByTest) in IntelliJ. This gives you full control over Spark source code. For example, you can hack into the codebase such as adding your code on top of Spark, adding breakpoints and observing code behavior. However, setting up Spark sourcecode and running its example in IntelliJ is a little bit tricky. This tutorial shows you how to build Spark source code and run its LogQuery example in local mode in IntelliJ.

Requirements:

  1. IntelliJ with Scala Plugin and ANTLR Plugin installed

Steps:

  1. Clone Spark repo (the spark release version at the time of writing is 2.2.0)
    git clone git@github.com:apache/spark.git
  2. Open IntelliJ and import downloaded Spark project folder
    • Select Maven when asking for Import project from external mode

    • Select Import Maven projects automatically

 

    • For the rest of import configurations, use default configurations by simply clicking Next
  1. Go to Terminal, and run following commands to compile Spark
    cd spark_root_folder
    
    ./build/mvn -DskipTests clean package 
  2. Go back to IntelliJ and navigate to File -> Project Structure. Then, select spark-streaming-flume-sink module and set scala-2.11 -> src_manages -> * (in target folder) to Sources while excluding others in this path on the right panel as shown below
  3. In IntelliJ, go to View -> Tool Windows -> Maven Project and click Generate Sources and Update Folders for All Projects
  4. Again, navigate back to File -> Project Structure. Then, select spark-catalyst module and set generated_sources  -> antlr4 -> * (in target ->generated_sources folder) to Sources while excluding others in this path on the right panel as shown below
  5. Go to pom.xml file under examples module and comment out <provided> under properties and <scope>provided</scope> under dependency such as following
    <properties>
        <sbt.project.name>examples</sbt.project.name>
        <build.testJarPhase>none</build.testJarPhase>
        <build.copyDependenciesPhase>package</build.copyDependenciesPhase>
        <!--<flume.deps.scope>provided</flume.deps.scope>-->
        <!--<hadoop.deps.scope>provided</hadoop.deps.scope>-->
        <!--<hive.deps.scope>provided</hive.deps.scope>-->
        <!--<parquet.deps.scope>provided</parquet.deps.scope>-->
    </properties>
    
    <dependency>
          <groupId>org.spark-project.spark</groupId>
          <artifactId>unused</artifactId>
          <version>1.0.0</version>
          <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_${scala.binary.version}</artifactId>
          <version>${project.version}</version>
          <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-streaming_${scala.binary.version}</artifactId>
          <version>${project.version}</version>
          <!--<scope>provided</scope>-->
        </dependency>
        .....
    <dependency>
  6. Navigate back to File -> Project Structure and select spark-examples module. On the right panel, click dependencies tab. Change provided scope into compile for all dependencies. 
  7. In IntelliJ, go to Run -> Edit Configurations and create a new application with the following configuration (note the VM options: -DSpark.master=local is required)
  8. The final step is to run created application

 

Possible Exceptions:

  1. If you encounter Exception in thread "main" java.lang.NoClassDefFoundError when you run the application, double check that no dependency is in provided scope, as described in step 6 and 7 above. 

Comments