pdf - Document contains at least one immense term - Solr indexing error -


i facing issue 1 of pdf files presume fails indexed solr due large file size. have seen replies online advising change field type of 'content' 'text_general', have been using while particular pdf still cannot indexed.

error produced:

exception writing document id abc.com/files/hugepdf.pdf index; possible analysis error: document contains @ least 1 immense term in field="content" (whose utf8 encoding longer max length 32766), of skipped. please correct analyzer not produce such terms. prefix of first immense term is: '[66, 65, 82, 73, 78, 71, 32, 71, 76, 79, 66, 65, 76, 32, 79, 80, 80, 79, 82, 84, 85, 78, 73, 84, 73, 69, 83, 32, 85, 77]...', original message: bytes can @ 32766 in length; got 110482. perhaps document has indexed string field (solr.strfield) large

current schema of 'text_general'

  <fieldtype name="text_general" class="solr.textfield" positionincrementgap="100" multivalued="true">       <analyzer type="index">         <tokenizer class="solr.standardtokenizerfactory"/>         <filter class="solr.lowercasefilterfactory"/>         <filter class="solr.englishminimalstemfilterfactory"/>         <filter class="solr.truncatetokenfilterfactory" prefixlength="100"/>        </analyzer>       <analyzer type="query">         <tokenizer class="solr.standardtokenizerfactory"/>         <filter class="solr.lowercasefilterfactory"/>         <filter class="solr.englishminimalstemfilterfactory"/>       </analyzer>       <analyzer type="multiterm">         <tokenizer class="solr.standardtokenizerfactory"/>         <filter class="solr.lowercasefilterfactory"/>       </analyzer>   </fieldtype> 

do note added 'truncatetokenfilterfactory' filter helped solve issues large pdf files. pdf exception.

questions

  1. what way make possible index such pdfs?
  2. on indexing failure, indexes not added solr (which wastes effort takes long time (a couple of hours) due 1 pdf file exceeds max size. there way around add successful indexes while rejecting specific indexes?

indexing pdf content known 'nightmare'. never 100% correct text extraction. suspect issue here extraction not working pdf, , returning huge pile of garbage. truncating not best approach, ignoring better. using 'text_general' not @ all.

some general guidelines be:

  1. do text extraction out of solr. yes handy use solr cell, real world pdf , volumes, worst case process hang (that worse dying). out of solr, in multiple threads, speed , make solr more reliable (less stress on it).
  2. use fallback library. probabaly usuing pdfbox (if using cell). if fails extract file, use second library (there several)

Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -