Talend MapR and Spark Tutorial

This post will document my experience with using Talend 6.2 and its Big Data Batch job with the MapR distribution of Spark version 1.6.

Basic HDFS operations with Talend and MapR's Spark

Right click on Big Data Batch and select Create Big Data Batch job.

matthew_moisen_talend_spark_create_job.jpg

Change the framework to Spark" and click Finish

matthew_moisen_talend_spark_framework.jpg

I added a tHDFSConfiguration, and a tFileInputDelimited connected to a tLogRow. Unlike a standard job, which uses a tHDFSInput component, a Big Data Batch tFileInputDelimited actually points to HDFS.

matthew_moisen_talend_spark_image.jpg

Note how, unlike in a standard job, the tHDFSConfiguration doesn't need a onSubjobOk to connect to the tFileInputDelimited.

Open the tHDFSConfiguration component tab and enter in your MapR username in both the username and group tabs, as well as changing the Distribution and Version.

matthew_moisen_talend_spark_thdfsconfiguration.jpg

In tFileInputDelimited, check the Storage "Define a storage configuration component and" and select the tHDFSconfiguration component. Add a file to the "Folder/File" text box.

matthew_moisen_talend_spark_tfileinputdelimited.jpg

Now, let's run the job. Click on the Run tab, and then change to Spark Configuration. I had some issues getting this to work, but after trial and error I found the right settings.

matthew_moisen_talend_spark_spark_configuration.jpg

I changed Spark Mode to YARN Client.

I needed to set my resource manager, scheduler, and jobhistory address. I found these values in two files:

yarn-site.xml resource manager scheduler

mapred-site.xml * jobhistory

For me, these files are located in /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/

In yarn-site.xml, I had multiple resource managers to chose from. I copied all the values from each into a notepad and tested it on each one. Here is an example configuration:

 
  
    yarn.resourcemanager.scheduler.address.rm1
    host:8030
  
  
    yarn.resourcemanager.resource-tracker.address.rm1
    host:8031
  
  
    yarn.resourcemanager.address.rm1
    host:8032
  
  
    yarn.resourcemanager.admin.address.rm1
    host:8033
  
  
    yarn.resourcemanager.webapp.address.rm1
    host:8088
  
  
    yarn.resourcemanager.webapp.https.address.rm1
    host:8090
  

The Resource Manager value should be the yarn.resourcemanager.address.rm1 value.

The Scheduler should be the yarn.resourcemanager.scheduler.address.rm1 value.

For jobhistory, go into mapred-site.xml and pick this value:

  
    mapreduce.jobhistory.address
    host:10020
  

Initially, I didn't include this information in the Spark Configuration tab and received an error:

matthew_moisen_talend_spark_authentication.jpg

For example:

[ERROR]: test3.testspark_0_1.TestSpark - TalendJob: 'TestSpark' - Failed with exit code: 1.
java.lang.RuntimeException: The HDFS and the Spark configurations must have the same user name.

To fix this, I added my account's name under the Authentication tab:

matthew_moisen_talend_spark_authentication_username.jpg

While running the job, it appeared to be hanging and I thought something was incorrect, but sooner or later it started.

Later on, I played around with some of the settings, and it appears that I don't actually need the "Force MapR Ticket authentication" checkbox or settings. I don't quite understand this, as my cluster requires the MapR ticket authentication. Perhaps everything was fine because I already have export MAPR_TICKETFILE_LOCATION set in .bashrc.


Comments

Add Comment

Name

Email

Comment

Are you human? + nine = 14