How to deploy a nutch job on google dataproc? -

i have been trying deploy nutch job (with custom plugins) on google hadoop dataproc cluster i've been encountering many errors (some basic suspect).

i need step step explicit guide on how this. guide should include how set permissions , access file both in gs bucket , on local file system (windows 7).

i have tried configuration no success:

region: global cluster: first-cluster job type: hadoop jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-snapshot.job main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-<p>asia/ deploy/bin/nutch arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

i have tried:

region: global  cluster: first-cluster job type: hadoop jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-snapshot.job main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/ deploy/bin/crawl arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

and:

region: global cluster: first-cluster job type: hadoop jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/depl‌oy/apache-nutch-1.12‌-snapshot.job main class or jar: org.apache.nutch.crawl.crawl arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed‌.txt, -depth 4

follow up: have made progress getting error now:

 17/07/28 18:59:11 info crawl.injector: injector: starting @ 2017-07-28 18:59:11 17/07/28 18:59:11 info crawl.injector: injector: crawldb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls 17/07/28 18:59:11 info crawl.injector: injector: urldir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawldb 17/07/28 18:59:11 info configuration.deprecation: mapred.temp.dir deprecated. instead, use mapreduce.cluster.temp.dir 17/07/28 18:59:11 info crawl.injector: injector: converting injected urls crawl db entries. 17/07/28 18:59:11 info gcs.googlehadoopfilesystembase: ghfs version: 1.6.1-hadoop2 17/07/28 18:59:11 error crawl.injector: injector: java.lang.illegalargumentexception: wrong fs: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m     @ org.apache.hadoop.fs.filesystem.checkpath(filesystem.java:648)     @ org.apache.hadoop.hdfs.distributedfilesystem.getpathname(distributedfilesystem.java:194)     @ org.apache.hadoop.hdfs.distributedfilesystem.access$000(distributedfilesystem.java:106)     @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1305)     @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1301)     @ org.apache.hadoop.fs.filesystemlinkresolver.resolve(filesystemlinkresolver.java:81)     @ org.apache.hadoop.hdfs.distributedfilesystem.getfilestatus(distributedfilesystem.java:1301)     @ org.apache.hadoop.fs.filesystem.exists(filesystem.java:1426)     @ org.apache.nutch.crawl.injector.inject(injector.java:298)     @ org.apache.nutch.crawl.injector.run(injector.java:379)     @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:70)     @ org.apache.nutch.crawl.injector.main(injector.java:369)     @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)     @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:62)     @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43)     @ java.lang.reflect.method.invoke(method.java:498)     @ com.google.cloud.hadoop.services.agent.job.shim.hadooprunclassshim.main(hadooprunclassshim.java:19)

i know has filesystem. how access gcp filesystem , hadoop file?

follow up: have made progress config:

{ "reference": { "projectid": "ageless-valor-174413", "jobid": "108a7d43-671a-4f61-8ba8-b87010a8a823" }, "placement": { "clustername": "first-cluster", "clusteruuid": "f3795563-bd44-4896-bec7-0eb81a3f685a" }, "status": { "state": "error", "details": "google cloud dataproc agent reports job failure. if logs available, can found in 'gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput'.", "statestarttime": "2017-07-28t18:59:13.518z" }, "statushistory": [ { "state": "pending", "statestarttime": "2017-07-28t18:58:57.660z" }, { "state": "setup_done", "statestarttime": "2017-07-28t18:59:00.811z" }, { "state": "running", "statestarttime": "2017-07-28t18:59:02.347z" } ], "driveroutputresourceuri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput", "drivercontrolfilesuri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/", "hadoopjob": { "mainclass": "org.apache.nutch.crawl.injector", "args": [ "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls/", "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawldb/" ], "jarfileuris": [ "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-snapshot.job" ], "loggingconfig": {} } }

but getting error now:

 17/07/28 18:59:11 info crawl.injector: injector: starting @ 2017-07-28 18:59:11 17/07/28 18:59:11 info crawl.injector: injector: crawldb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls 17/07/28 18:59:11 info crawl.injector: injector: urldir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawldb 17/07/28 18:59:11 info configuration.deprecation: mapred.temp.dir deprecated. instead, use mapreduce.cluster.temp.dir 17/07/28 18:59:11 info crawl.injector: injector: converting injected urls crawl db entries. 17/07/28 18:59:11 info gcs.googlehadoopfilesystembase: ghfs version: 1.6.1-hadoop2 17/07/28 18:59:11 error crawl.injector: injector: java.lang.illegalargumentexception: wrong fs: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m     @ org.apache.hadoop.fs.filesystem.checkpath(filesystem.java:648)     @ org.apache.hadoop.hdfs.distributedfilesystem.getpathname(distributedfilesystem.java:194)     @ org.apache.hadoop.hdfs.distributedfilesystem.access$000(distributedfilesystem.java:106)     @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1305)     @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1301)     @ org.apache.hadoop.fs.filesystemlinkresolver.resolve(filesystemlinkresolver.java:81)     @ org.apache.hadoop.hdfs.distributedfilesystem.getfilestatus(distributedfilesystem.java:1301)     @ org.apache.hadoop.fs.filesystem.exists(filesystem.java:1426)     @ org.apache.nutch.crawl.injector.inject(injector.java:298)     @ org.apache.nutch.crawl.injector.run(injector.java:379)     @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:70)     @ org.apache.nutch.crawl.injector.main(injector.java:369)     @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)     @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:62)     @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43)     @ java.lang.reflect.method.invoke(method.java:498)     @ com.google.cloud.hadoop.services.agent.job.shim.hadooprunclassshim.main(hadooprunclassshim.java:19)

Search This Blog

RT

How to deploy a nutch job on google dataproc? -

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

javascript - Replicate keyboard event with html button -