In this post I am going to show how to start a Scalding project with Maven.
Setup scala in your IDE
First step is to make sure your IDE (IntelliJ or Eclipse) supports Scala.
- Install Scala plugin and scala SDK for your IDE (scala version 2.10.4) (see: IntelliJ or Scala IDE for Eclipse)
Create the scalding project
- Create a new maven Scala project.
- Use this pom.xml:
-
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mysite</groupId> <artifactId>scalding-test</artifactId> <packaging>jar</packaging> <version>1.1.1-SNAPSHOT</version> <properties> <scala.version>2.10.4</scala.version> <main>scalding.examples.WordCountJob</main> <targetJdk>1.6</targetJdk> <slf4j.log4j12.version>1.6.2</slf4j.log4j12.version> <junit.version>4.8.2</junit.version> <maven-jar-plugin-version>2.3.1</maven-jar-plugin-version> <cascading.version>2.5.2</cascading.version> <hadoop.version>2.3.0-cdh5.0.3</hadoop.version> </properties> <repositories> <repository> <id>conjars.org</id> <url>http://conjars.org/repo</url> </repository> <repository> <id>cloudera-releases</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> </repositories> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>com.twitter</groupId> <artifactId>scalding-core_2.10</artifactId> <version>0.12.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>cascading</groupId> <artifactId>cascading-hadoop</artifactId> <version>${cascading.version}</version> </dependency> <dependency> <groupId>cascading</groupId> <artifactId>cascading-core</artifactId> <version>${cascading.version}</version> </dependency> <dependency> <groupId>cascading</groupId> <artifactId>cascading-local</artifactId> <version>${cascading.version}</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>${slf4j.log4j12.version}</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>${slf4j.log4j12.version}</version> </dependency> </dependencies> <build> <resources> <resource> <directory>src/main/resources</directory> <filtering>true</filtering> </resource> </resources> <plugins> <!--<plugin>--> <!--<artifactId>maven-compiler-plugin</artifactId>--> <!--<configuration>--> <!--<source>1.8</source>--> <!--<target>1.8</target>--> <!--<showDeprecation>false</showDeprecation>--> <!--<showWarnings>false</showWarnings>--> <!--</configuration>--> <!--</plugin>--> <!--<plugin>--> <!--<groupId>org.scala-tools</groupId>--> <!--<artifactId>maven-scala-plugin</artifactId>--> <!--<executions>--> <!--<execution>--> <!--<id>scala-compile-first</id>--> <!--<phase>process-resources</phase>--> <!--<goals>--> <!--<goal>add-source</goal>--> <!--<goal>compile</goal>--> <!--</goals>--> <!--<configuration>--> <!--<args>--> <!--<arg>-make:transitive</arg>--> <!--<arg>-dependencyfile</arg>--> <!--<arg>${project.build.directory}/.scala_dependencies</arg>--> <!--</args>--> <!--</configuration>--> <!--</execution>--> <!--<execution>--> <!--<id>scala-test-compile</id>--> <!--<phase>process-test-resources</phase>--> <!--<goals>--> <!--<goal>testCompile</goal>--> <!--</goals>--> <!--</execution>--> <!--</executions>--> <!--</plugin>--> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.2</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <finalName>scalding-test-jar-with-libs-${project.version}</finalName> <shadedArtifactAttached>true</shadedArtifactAttached> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>META-INF/spring.handlers</resource> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>META-INF/spring.schemas</resource> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>${main}</mainClass> </transformer> </transformers> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.4</version> <executions> <execution> <id>assembly-zip</id> <goals> <goal>single</goal> </goals> <phase>verify</phase> <configuration> <finalName>scalding-test-${project.version}</finalName> <attach>false</attach> <appendAssemblyId>false</appendAssemblyId> <descriptor>src/main/assembly/assembly-descriptor.xml</descriptor> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>build-helper-maven-plugin</artifactId> <version>1.5</version> <executions> <execution> <id>attach-artifacts</id> <phase>package</phase> <goals> <goal>attach-artifact</goal> </goals> <configuration> <artifacts> <artifact> <file>${project.build.directory}/scalding-test-jar-with-libs-${project.version}.jar</file> <type>jar</type> <classifier>dist</classifier> </artifact> <artifact> <file>${project.build.directory}/scalding-test-${project.version}.zip</file> <type>zip</type> <classifier>dist-zip</classifier> </artifact> </artifacts> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
Use this scala class:
-
package scalding.examples import com.twitter.scalding._ import org.apache.hadoop object JobRunner { def main(args : Array[String]) { hadoop.util.ToolRunner.run(new hadoop.conf.Configuration, new Tool, args); } } class WordCountJob(args : Args) extends Job(args) { TypedPipe.from(TextLine(args("input"))) .flatMap { line => line.split("""s+""") } .groupBy { word => word } .size .write(TypedTsv(args("output"))) }
Run the Scalding project
- Create a running profile in your IDE
- Set the Main class to: scalding.examples.JobRunner
- Add the following program arguments: scalding.examples.WordCountJob –hdfs –input hdfs://<your-hadoop-server>:<hdfs-port(e.x., 8020)>/user/myuser/input_path –output hdfs://<your-hadoop-server>:<hdfs-port(e.x., 8020)>/user/myuser/output_path/someOutputFile.tsv
Run the program and check the file in HDFS: /user/myuser/output_path/someOutputFile.tsv
Summary
In this post I showed you how you can build your first scalding project using simple maven project with no need to use sbt tool.
One thought on “Creating Scalding project with Maven”