Creating Scalding project with Maven

In this post I am going to show how to start a Scalding project with Maven.

Setup scala in your IDE

First step is to make sure your IDE (IntelliJ or Eclipse) supports Scala.

Create the scalding project

  • Create a new maven Scala project.
  • Use this pom.xml:
  • <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
     <modelVersion>4.0.0</modelVersion>
    
     <groupId>com.mysite</groupId>
     <artifactId>scalding-test</artifactId>
     <packaging>jar</packaging>
     <version>1.1.1-SNAPSHOT</version>
    <properties>
     <scala.version>2.10.4</scala.version>
     <main>scalding.examples.WordCountJob</main>
     <targetJdk>1.6</targetJdk>
     <slf4j.log4j12.version>1.6.2</slf4j.log4j12.version>
     <junit.version>4.8.2</junit.version>
     <maven-jar-plugin-version>2.3.1</maven-jar-plugin-version>
     <cascading.version>2.5.2</cascading.version>
     <hadoop.version>2.3.0-cdh5.0.3</hadoop.version>
     </properties>
    <repositories>
     <repository>
     <id>conjars.org</id>
     <url>http://conjars.org/repo</url>
     </repository>
     <repository>
     <id>cloudera-releases</id>
     <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
     <releases>
     <enabled>true</enabled>
     </releases>
     <snapshots>
     <enabled>false</enabled>
     </snapshots>
     </repository>
    </repositories>
    
    <dependencies>
     <dependency>
     <groupId>org.scala-lang</groupId>
     <artifactId>scala-library</artifactId>
     <version>${scala.version}</version>
     </dependency>
     <dependency>
     <groupId>com.twitter</groupId>
     <artifactId>scalding-core_2.10</artifactId>
     <version>0.12.0</version>
     </dependency>
     
     <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>${hadoop.version}</version>
     </dependency>
     <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-common</artifactId>
     <version>${hadoop.version}</version>
     </dependency>
     <dependency>
     <groupId>cascading</groupId>
     <artifactId>cascading-hadoop</artifactId>
     <version>${cascading.version}</version>
     </dependency>
     <dependency>
     <groupId>cascading</groupId>
     <artifactId>cascading-core</artifactId>
     <version>${cascading.version}</version>
     </dependency>
     <dependency>
     <groupId>cascading</groupId>
     <artifactId>cascading-local</artifactId>
     <version>${cascading.version}</version>
     </dependency>
     <dependency>
     <groupId>org.slf4j</groupId>
     <artifactId>slf4j-api</artifactId>
     <version>${slf4j.log4j12.version}</version>
     </dependency>
     <dependency>
     <groupId>org.slf4j</groupId>
     <artifactId>slf4j-log4j12</artifactId>
     <version>${slf4j.log4j12.version}</version>
     </dependency>
    </dependencies>
    <build>
     <resources>
     <resource>
     <directory>src/main/resources</directory>
     <filtering>true</filtering>
     </resource>
     </resources>
     <plugins>
     <!--<plugin>-->
     <!--<artifactId>maven-compiler-plugin</artifactId>-->
     <!--<configuration>-->
     <!--<source>1.8</source>-->
     <!--<target>1.8</target>-->
     <!--<showDeprecation>false</showDeprecation>-->
     <!--<showWarnings>false</showWarnings>-->
     <!--</configuration>-->
     <!--</plugin>-->
     <!--<plugin>-->
     <!--<groupId>org.scala-tools</groupId>-->
     <!--<artifactId>maven-scala-plugin</artifactId>-->
     <!--<executions>-->
     <!--<execution>-->
     <!--<id>scala-compile-first</id>-->
     <!--<phase>process-resources</phase>-->
     <!--<goals>-->
     <!--<goal>add-source</goal>-->
     <!--<goal>compile</goal>-->
     <!--</goals>-->
     <!--<configuration>-->
     <!--<args>-->
     <!--<arg>-make:transitive</arg>-->
     <!--<arg>-dependencyfile</arg>-->
     <!--<arg>${project.build.directory}/.scala_dependencies</arg>-->
     <!--</args>-->
     <!--</configuration>-->
     <!--</execution>-->
     <!--<execution>-->
     <!--<id>scala-test-compile</id>-->
     <!--<phase>process-test-resources</phase>-->
     <!--<goals>-->
     <!--<goal>testCompile</goal>-->
     <!--</goals>-->
     <!--</execution>-->
     <!--</executions>-->
     <!--</plugin>-->
     <plugin>
     <groupId>org.apache.maven.plugins</groupId>
     <artifactId>maven-shade-plugin</artifactId>
     <version>2.2</version>
     <executions>
     <execution>
     <phase>package</phase>
     <goals>
     <goal>shade</goal>
     </goals>
     <configuration>
     <finalName>scalding-test-jar-with-libs-${project.version}</finalName>
     <shadedArtifactAttached>true</shadedArtifactAttached>
     <transformers>
     <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
     <resource>META-INF/spring.handlers</resource>
     </transformer>
     <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
     <resource>META-INF/spring.schemas</resource>
     </transformer>
     <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
     <mainClass>${main}</mainClass>
     </transformer>
     </transformers>
     <filters>
     <filter>
     <artifact>*:*</artifact>
     <excludes>
     <exclude>META-INF/*.SF</exclude>
     <exclude>META-INF/*.DSA</exclude>
     <exclude>META-INF/*.RSA</exclude>
     </excludes>
     </filter>
     </filters>
     </configuration>
     </execution>
     </executions>
     </plugin>
     <plugin>
     <groupId>org.apache.maven.plugins</groupId>
     <artifactId>maven-assembly-plugin</artifactId>
     <version>2.4</version>
     <executions>
     <execution>
     <id>assembly-zip</id>
     <goals>
     <goal>single</goal>
     </goals>
     <phase>verify</phase>
     <configuration>
     <finalName>scalding-test-${project.version}</finalName>
     <attach>false</attach>
     <appendAssemblyId>false</appendAssemblyId>
     <descriptor>src/main/assembly/assembly-descriptor.xml</descriptor>
     </configuration>
     </execution>
     </executions>
     </plugin>
     <plugin>
     <groupId>org.codehaus.mojo</groupId>
     <artifactId>build-helper-maven-plugin</artifactId>
     <version>1.5</version>
     <executions>
     <execution>
     <id>attach-artifacts</id>
     <phase>package</phase>
     <goals>
     <goal>attach-artifact</goal>
     </goals>
     <configuration>
     <artifacts>
     <artifact>
     <file>${project.build.directory}/scalding-test-jar-with-libs-${project.version}.jar</file>
     <type>jar</type>
     <classifier>dist</classifier>
     </artifact>
     <artifact>
     <file>${project.build.directory}/scalding-test-${project.version}.zip</file>
     <type>zip</type>
     <classifier>dist-zip</classifier>
     </artifact>
     </artifacts>
     </configuration>
     </execution>
     </executions>
     </plugin>
     </plugins>
    </build>
    </project>

Use this scala class:

  • package scalding.examples
    import com.twitter.scalding._
    import org.apache.hadoop
    
    object JobRunner {
     def main(args : Array[String]) {
     hadoop.util.ToolRunner.run(new hadoop.conf.Configuration, new Tool, args);
     }
    }
    
    class WordCountJob(args : Args) extends Job(args) {
     TypedPipe.from(TextLine(args("input")))
     .flatMap { line => line.split("""s+""") }
     .groupBy { word => word }
     .size
     .write(TypedTsv(args("output")))
    }

Run the Scalding project

  • Create a running profile in your IDE
  • Set the Main class to: scalding.examples.JobRunner
  • Add the following program arguments: scalding.examples.WordCountJob –hdfs  –input hdfs://<your-hadoop-server>:<hdfs-port(e.x., 8020)>/user/myuser/input_path –output hdfs://<your-hadoop-server>:<hdfs-port(e.x., 8020)>/user/myuser/output_path/someOutputFile.tsv

Run the program and check the file in HDFS: /user/myuser/output_path/someOutputFile.tsv

Summary

In this post I showed you how you can build your first scalding project using simple maven project with no need to use sbt tool.