How to deploy a nutch job on google dataproc? -
i have been trying deploy nutch job (with custom plugins) on google hadoop dataproc cluster i've been encountering many errors (some basic suspect).
i need step step explicit guide on how this. guide should include how set permissions , access file both in gs bucket , on local file system (windows 7).
i have tried configuration no success:
region: global cluster: first-cluster job type: hadoop jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-snapshot.job main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-<p>asia/ deploy/bin/nutch arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed.txt, -depth 4
i have tried:
region: global cluster: first-cluster job type: hadoop jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-snapshot.job main class or jar: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/ deploy/bin/crawl arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed.txt, -depth 4
and:
region: global cluster: first-cluster job type: hadoop jar files gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-snapshot.job main class or jar: org.apache.nutch.crawl.crawl arguments: gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/seed.txt, -depth 4
follow up: have made progress getting error now:
17/07/28 18:59:11 info crawl.injector: injector: starting @ 2017-07-28 18:59:11 17/07/28 18:59:11 info crawl.injector: injector: crawldb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls 17/07/28 18:59:11 info crawl.injector: injector: urldir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawldb 17/07/28 18:59:11 info configuration.deprecation: mapred.temp.dir deprecated. instead, use mapreduce.cluster.temp.dir 17/07/28 18:59:11 info crawl.injector: injector: converting injected urls crawl db entries. 17/07/28 18:59:11 info gcs.googlehadoopfilesystembase: ghfs version: 1.6.1-hadoop2 17/07/28 18:59:11 error crawl.injector: injector: java.lang.illegalargumentexception: wrong fs: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m @ org.apache.hadoop.fs.filesystem.checkpath(filesystem.java:648) @ org.apache.hadoop.hdfs.distributedfilesystem.getpathname(distributedfilesystem.java:194) @ org.apache.hadoop.hdfs.distributedfilesystem.access$000(distributedfilesystem.java:106) @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1305) @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1301) @ org.apache.hadoop.fs.filesystemlinkresolver.resolve(filesystemlinkresolver.java:81) @ org.apache.hadoop.hdfs.distributedfilesystem.getfilestatus(distributedfilesystem.java:1301) @ org.apache.hadoop.fs.filesystem.exists(filesystem.java:1426) @ org.apache.nutch.crawl.injector.inject(injector.java:298) @ org.apache.nutch.crawl.injector.run(injector.java:379) @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:70) @ org.apache.nutch.crawl.injector.main(injector.java:369) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:62) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43) @ java.lang.reflect.method.invoke(method.java:498) @ com.google.cloud.hadoop.services.agent.job.shim.hadooprunclassshim.main(hadooprunclassshim.java:19)
i know has filesystem. how access gcp filesystem , hadoop file?
follow up: have made progress config:
{ "reference": { "projectid": "ageless-valor-174413", "jobid": "108a7d43-671a-4f61-8ba8-b87010a8a823" }, "placement": { "clustername": "first-cluster", "clusteruuid": "f3795563-bd44-4896-bec7-0eb81a3f685a" }, "status": { "state": "error", "details": "google cloud dataproc agent reports job failure. if logs available, can found in 'gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput'.", "statestarttime": "2017-07-28t18:59:13.518z" }, "statushistory": [ { "state": "pending", "statestarttime": "2017-07-28t18:58:57.660z" }, { "state": "setup_done", "statestarttime": "2017-07-28t18:59:00.811z" }, { "state": "running", "statestarttime": "2017-07-28t18:59:02.347z" } ], "driveroutputresourceuri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/driveroutput", "drivercontrolfilesuri": "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/google-cloud-dataproc-metainfo/f3795563-bd44-4896-bec7-0eb81a3f685a/jobs/108a7d43-671a-4f61-8ba8-b87010a8a823/", "hadoopjob": { "mainclass": "org.apache.nutch.crawl.injector", "args": [ "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls/", "https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawldb/" ], "jarfileuris": [ "gs://dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/apache-nutch-1.12-snapshot.job" ], "loggingconfig": {} } }
but getting error now:
17/07/28 18:59:11 info crawl.injector: injector: starting @ 2017-07-28 18:59:11 17/07/28 18:59:11 info crawl.injector: injector: crawldb: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls 17/07/28 18:59:11 info crawl.injector: injector: urldir: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/crawldb 17/07/28 18:59:11 info configuration.deprecation: mapred.temp.dir deprecated. instead, use mapreduce.cluster.temp.dir 17/07/28 18:59:11 info crawl.injector: injector: converting injected urls crawl db entries. 17/07/28 18:59:11 info gcs.googlehadoopfilesystembase: ghfs version: 1.6.1-hadoop2 17/07/28 18:59:11 error crawl.injector: injector: java.lang.illegalargumentexception: wrong fs: https://console.cloud.google.com/storage/browser/dataproc-60f583ce-a087-42b1-a62e-4319584631d3-asia/deploy/urls, expected: hdfs://first-cluster-m @ org.apache.hadoop.fs.filesystem.checkpath(filesystem.java:648) @ org.apache.hadoop.hdfs.distributedfilesystem.getpathname(distributedfilesystem.java:194) @ org.apache.hadoop.hdfs.distributedfilesystem.access$000(distributedfilesystem.java:106) @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1305) @ org.apache.hadoop.hdfs.distributedfilesystem$22.docall(distributedfilesystem.java:1301) @ org.apache.hadoop.fs.filesystemlinkresolver.resolve(filesystemlinkresolver.java:81) @ org.apache.hadoop.hdfs.distributedfilesystem.getfilestatus(distributedfilesystem.java:1301) @ org.apache.hadoop.fs.filesystem.exists(filesystem.java:1426) @ org.apache.nutch.crawl.injector.inject(injector.java:298) @ org.apache.nutch.crawl.injector.run(injector.java:379) @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:70) @ org.apache.nutch.crawl.injector.main(injector.java:369) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method) @ sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl.java:62) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:43) @ java.lang.reflect.method.invoke(method.java:498) @ com.google.cloud.hadoop.services.agent.job.shim.hadooprunclassshim.main(hadooprunclassshim.java:19)
Comments
Post a Comment