Running mrjob with the proper environment with Cloudera Manager

sysadminThis article details how to run Python mrjob scripts in the odd environment created by Cloudera Manager parcels, and how to properly set the $HADOOP_HOME environment variable.

If you’ve installed Cloudera Manager 4.x and used it to install the hadoop client on a gateway, you’ve probably run into the following error when trying to run an mrjob Python script:

Traceback (most recent call last):
File "mrjobtest.py", line 12, in
MRWordCounter.run()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 516, in run
mr_job.execute()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 532, in execute
self.run_job()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 581, in run_job
with self.make_runner() as runner:
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 559, in make_runner
return HadoopJobRunner(**self.hadoop_job_runner_kwargs())
File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 141, in __init__
'you must set $HADOOP_HOME, or pass in hadoop_home explicitly')
Exception: you must set $HADOOP_HOME, or pass in hadoop_home explicitly

If you try setting HADOOP_HOME to “/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop/” you’ll then get this error:

no configs found; falling back on auto-configuration
Traceback (most recent call last):
File "mrjobtest.py", line 12, in
MRWordCounter.run()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 516, in run
mr_job.execute()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 532, in execute
self.run_job()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 581, in run_job
with self.make_runner() as runner:
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 559, in make_runner
return HadoopJobRunner(**self.hadoop_job_runner_kwargs())
File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 159, in __init__
self._opts['hadoop_home'])
Exception: Couldn't find streaming jar in /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop, bailing out

You could use the –hadoop-streaming-jar argument of mrjob to set the path to the streaming jar, but there’s a better way.

Mrjob requires you to have $HADOOP_HOME set before it even calls hadoop, so the fact that hadoop sets up the environment for itself is wasted.

However, you can set up $HADOOP_HOME, but leave it “null” so that hadoop will replace it, then use the hadoop-bin parameter of mrjob to have hadoop set up the proper environment.

This way the environment will be set up correctly, and you don’t have to specify the location of the streaming jar:

export HADOOP_HOME=/
python mrjobtest.py -r hadoop --hadoop-bin /usr/bin/hadoop hdfs:///wordcounttest -o hdfs:///mrjobtestoutput