java - PropertyTypeException in Apache Tika Metadata -
i'm using crawler4j extract pages , pdf files. checked byte array got valid , can output pdf file.
with byte array, following:
//tika specific types bodycontenthandler handler = new bodycontenthandler(); metadata metadata = new metadata(); inputstream inputstream; parsecontext pcontext = new parsecontext(); pdfparser pdfparser = new pdfparser(); ... byte[] contentdata = null; contentdata = page.getcontentdata(); //crawler4j content, delivers valid pdf //path path = paths.get("c:\\test\\local.pdf"); //use line read local pdf //default fields: string title = "pdf title"; string content = ""; string suggestions = ""; // try { ////contentdata = files.readallbytes(path); //use line read local pdf inputstream = new bytearrayinputstream(contentdata); pdfparser.parse(inputstream, handler, metadata,pcontext); //this line crashes content = "pdf suggestions"; suggestions = handler.tostring(); } catch (exception e) { logger.warn("error parsing tika.", e); } i marked crashing line. resulting exception following:
warn 2017-07-26 11:17:51,302 [thread-5] de.searchadapter.crawler.solrparser.parser.file.pdffileparser - error parsing tika. org.apache.tika.metadata.propertytypeexception: xmpmm:documentid : simple @ org.apache.tika.metadata.metadata.add(metadata.java:305) @ org.apache.tika.parser.image.xmp.jempboxextractor.addmetadata(jempboxextractor.java:209) @ org.apache.tika.parser.image.xmp.jempboxextractor.extractxmpmm(jempboxextractor.java:150) @ org.apache.tika.parser.pdf.pdfparser.extractmetadata(pdfparser.java:239) @ org.apache.tika.parser.pdf.pdfparser.parse(pdfparser.java:154) @ de.searchadapter.crawler.solrparser.parser.file.pdffileparser.parse(pdffileparser.java:82) @ de.searchadapter.crawler.solrparser.solrparser.parse(solrparser.java:36) @ de.searchadapter.crawler.solrjadapter.indexdocs(solrjadapter.java:58) @ de.searchadapter.crawler.webcrawler.onbeforeexit(webcrawler.java:63) @ edu.uci.ics.crawler4j.crawler.crawlcontroller$1.run(crawlcontroller.java:309) @ java.lang.thread.run(thread.java:745)
the code above pdffileparser. i'm not setting property, i'm puzzled error comes from.
additional info: pdf file seems use unknown font, following warning comes up:
11:17:50.963 [thread-5] warn o.a.pdfbox.pdmodel.font.pdsimplefont - no unicode mapping f_i (30) in font ggoloe+thesansc5-plain
edit: edited code, read local pdf files. tried pdf file , didn't error. seems results of failing font.
Comments
Post a Comment