java - PropertyTypeException in Apache Tika Metadata -


i'm using crawler4j extract pages , pdf files. checked byte array got valid , can output pdf file.

with byte array, following:

//tika specific types bodycontenthandler handler = new bodycontenthandler(); metadata metadata = new metadata(); inputstream inputstream; parsecontext pcontext = new parsecontext(); pdfparser pdfparser = new pdfparser();  ...  byte[] contentdata = null; contentdata = page.getcontentdata(); //crawler4j content, delivers valid pdf //path path = paths.get("c:\\test\\local.pdf"); //use line read local pdf  //default fields: string title = "pdf title"; string content = ""; string suggestions = ""; // try {     ////contentdata = files.readallbytes(path); //use line read local pdf     inputstream = new bytearrayinputstream(contentdata);     pdfparser.parse(inputstream, handler, metadata,pcontext); //this line crashes     content = "pdf suggestions";     suggestions = handler.tostring(); } catch (exception e) {     logger.warn("error parsing tika.", e); } 

i marked crashing line. resulting exception following:

warn 2017-07-26 11:17:51,302 [thread-5] de.searchadapter.crawler.solrparser.parser.file.pdffileparser - error parsing tika. org.apache.tika.metadata.propertytypeexception: xmpmm:documentid : simple @ org.apache.tika.metadata.metadata.add(metadata.java:305) @ org.apache.tika.parser.image.xmp.jempboxextractor.addmetadata(jempboxextractor.java:209) @ org.apache.tika.parser.image.xmp.jempboxextractor.extractxmpmm(jempboxextractor.java:150) @ org.apache.tika.parser.pdf.pdfparser.extractmetadata(pdfparser.java:239) @ org.apache.tika.parser.pdf.pdfparser.parse(pdfparser.java:154) @ de.searchadapter.crawler.solrparser.parser.file.pdffileparser.parse(pdffileparser.java:82) @ de.searchadapter.crawler.solrparser.solrparser.parse(solrparser.java:36) @ de.searchadapter.crawler.solrjadapter.indexdocs(solrjadapter.java:58) @ de.searchadapter.crawler.webcrawler.onbeforeexit(webcrawler.java:63) @ edu.uci.ics.crawler4j.crawler.crawlcontroller$1.run(crawlcontroller.java:309) @ java.lang.thread.run(thread.java:745)

the code above pdffileparser. i'm not setting property, i'm puzzled error comes from.

additional info: pdf file seems use unknown font, following warning comes up:

11:17:50.963 [thread-5] warn o.a.pdfbox.pdmodel.font.pdsimplefont - no unicode mapping f_i (30) in font ggoloe+thesansc5-plain

edit: edited code, read local pdf files. tried pdf file , didn't error. seems results of failing font.


Comments

Popular posts from this blog

python - Selenium remoteWebDriver (& SauceLabs) Firefox moseMoveTo action exception -

html - How to custom Bootstrap grid height? -

transpose - Maple isnt executing function but prints function term -