This post will document my experience with using Talend 6.2 and its Big Data Batch job with the MapR distribution of Spark version 1.6.
Right click on Big Data Batch and select Create Big Data Batch job.
Change the framework to Spark" and click Finish
I added a tHDFSConfiguration, and a tFileInputDelimited connected to a tLogRow. Unlike a standard job, which uses a tHDFSInput component, a Big Data Batch tFileInputDelimited actually points to HDFS.
Note how, unlike in a standard job, the tHDFSConfiguration doesn't need a onSubjobOk to connect to the tFileInputDelimited.
Open the tHDFSConfiguration component tab and enter in your MapR username in both the username and group tabs, as well as changing the Distribution and Version.
In tFileInputDelimited, check the Storage "Define a storage configuration component and" and select the tHDFSconfiguration component. Add a file to the "Folder/File" text box.
Now, let's run the job. Click on the Run tab, and then change to Spark Configuration. I had some issues getting this to work, but after trial and error I found the right settings.
I changed Spark Mode to YARN Client.
I needed to set my resource manager, scheduler, and jobhistory address. I found these values in two files:
yarn-site.xml * resource manager * scheduler
mapred-site.xml * jobhistory
For me, these files are located in /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/
In yarn-site.xml
, I had multiple resource managers to chose from. I copied all the values from each into a notepad and tested it on each one. Here is an example configuration:
yarn.resourcemanager.scheduler.address.rm1 host:8030 yarn.resourcemanager.resource-tracker.address.rm1 host:8031 yarn.resourcemanager.address.rm1 host:8032 yarn.resourcemanager.admin.address.rm1 host:8033 yarn.resourcemanager.webapp.address.rm1 host:8088 yarn.resourcemanager.webapp.https.address.rm1 host:8090
The Resource Manager value should be the yarn.resourcemanager.address.rm1
value.
The Scheduler should be the yarn.resourcemanager.scheduler.address.rm1
value.
For jobhistory, go into mapred-site.xml
and pick this value:
mapreduce.jobhistory.address host:10020
Initially, I didn't include this information in the Spark Configuration tab and received an error:
For example:
[ERROR]: test3.testspark_0_1.TestSpark - TalendJob: 'TestSpark' - Failed with exit code: 1. java.lang.RuntimeException: The HDFS and the Spark configurations must have the same user name.
To fix this, I added my account's name under the Authentication tab:
While running the job, it appeared to be hanging and I thought something was incorrect, but sooner or later it started.
Later on, I played around with some of the settings, and it appears that I don't actually need the "Force MapR Ticket authentication" checkbox or settings. I don't quite understand this, as my cluster requires the MapR ticket authentication. Perhaps everything was fine because I already have export MAPR_TICKETFILE_LOCATION
set in .bashrc
.
Name: sandi
Creation Date: 2017-08-03
The problem I am facing is I cannot use a tMap for left outer join when I have more than two input links (one source& more than one lookup). It gives me error in the tMap , says - all the lookup should have same context ? Can you help?
Name: sandi
Creation Date: 2017-08-03
The exact error is : all lookup tables must have same expressions, this expressions does not exist in other lookup