This post will document my experience with using Talend 6.2 and its Big Data Batch job with the MapR distribution of Spark version 1.6.
Right click on Big Data Batch and select Create Big Data Batch job.
Change the framework to Spark" and click Finish
I added a tHDFSConfiguration, and a tFileInputDelimited connected to a tLogRow. Unlike a standard job, which uses a tHDFSInput component, a Big Data Batch tFileInputDelimited actually points to HDFS.
Note how, unlike in a standard job, the tHDFSConfiguration doesn't need a onSubjobOk to connect to the tFileInputDelimited.
Open the tHDFSConfiguration component tab and enter in your MapR username in both the username and group tabs, as well as changing the Distribution and Version.
In tFileInputDelimited, check the Storage "Define a storage configuration component and" and select the tHDFSconfiguration component. Add a file to the "Folder/File" text box.
Now, let's run the job. Click on the Run tab, and then change to Spark Configuration. I had some issues getting this to work, but after trial and error I found the right settings.
I changed Spark Mode to YARN Client.
I needed to set my resource manager, scheduler, and jobhistory address. I found these values in two files:
yarn-site.xml resource manager scheduler
mapred-site.xml * jobhistory
For me, these files are located in
yarn-site.xml, I had multiple resource managers to chose from. I copied all the values from each into a notepad and tested it on each one. Here is an example configuration:
yarn.resourcemanager.scheduler.address.rm1 host:8030 yarn.resourcemanager.resource-tracker.address.rm1 host:8031 yarn.resourcemanager.address.rm1 host:8032 yarn.resourcemanager.admin.address.rm1 host:8033 yarn.resourcemanager.webapp.address.rm1 host:8088 yarn.resourcemanager.webapp.https.address.rm1 host:8090
The Resource Manager value should be the
The Scheduler should be the
For jobhistory, go into
mapred-site.xml and pick this value:
Initially, I didn't include this information in the Spark Configuration tab and received an error:
[ERROR]: test3.testspark_0_1.TestSpark - TalendJob: 'TestSpark' - Failed with exit code: 1. java.lang.RuntimeException: The HDFS and the Spark configurations must have the same user name.
To fix this, I added my account's name under the Authentication tab:
While running the job, it appeared to be hanging and I thought something was incorrect, but sooner or later it started.
Later on, I played around with some of the settings, and it appears that I don't actually need the "Force MapR Ticket authentication" checkbox or settings. I don't quite understand this, as my cluster requires the MapR ticket authentication. Perhaps everything was fine because I already have
export MAPR_TICKETFILE_LOCATION set in