pdf - Document contains at least one immense term

pdf - Document contains at least one immense term - Solr indexing error -

August 15, 2015

i facing issue 1 of pdf files presume fails indexed solr due large file size. have seen replies online advising change field type of 'content' 'text_general', have been using while particular pdf still cannot indexed.

error produced:

exception writing document id abc.com/files/hugepdf.pdf index; possible analysis error: document contains @ least 1 immense term in field="content" (whose utf8 encoding longer max length 32766), of skipped. please correct analyzer not produce such terms. prefix of first immense term is: '[66, 65, 82, 73, 78, 71, 32, 71, 76, 79, 66, 65, 76, 32, 79, 80, 80, 79, 82, 84, 85, 78, 73, 84, 73, 69, 83, 32, 85, 77]...', original message: bytes can @ 32766 in length; got 110482. perhaps document has indexed string field (solr.strfield) large

current schema of 'text_general'

  <fieldtype name="text_general" class="solr.textfield" positionincrementgap="100" multivalued="true">       <analyzer type="index">         <tokenizer class="solr.standardtokenizerfactory"/>         <filter class="solr.lowercasefilterfactory"/>         <filter class="solr.englishminimalstemfilterfactory"/>         <filter class="solr.truncatetokenfilterfactory" prefixlength="100"/>        </analyzer>       <analyzer type="query">         <tokenizer class="solr.standardtokenizerfactory"/>         <filter class="solr.lowercasefilterfactory"/>         <filter class="solr.englishminimalstemfilterfactory"/>       </analyzer>       <analyzer type="multiterm">         <tokenizer class="solr.standardtokenizerfactory"/>         <filter class="solr.lowercasefilterfactory"/>       </analyzer>   </fieldtype>

do note added 'truncatetokenfilterfactory' filter helped solve issues large pdf files. pdf exception.

questions

what way make possible index such pdfs?
on indexing failure, indexes not added solr (which wastes effort takes long time (a couple of hours) due 1 pdf file exceeds max size. there way around add successful indexes while rejecting specific indexes?

indexing pdf content known 'nightmare'. never 100% correct text extraction. suspect issue here extraction not working pdf, , returning huge pile of garbage. truncating not best approach, ignoring better. using 'text_general' not @ all.

some general guidelines be:

do text extraction out of solr. yes handy use solr cell, real world pdf , volumes, worst case process hang (that worse dying). out of solr, in multiple threads, speed , make solr more reliable (less stress on it).
use fallback library. probabaly usuing pdfbox (if using cell). if fails extract file, use second library (there several)

Search This Blog

RT

pdf - Document contains at least one immense term - Solr indexing error -

Comments

Post a Comment

Popular posts from this blog

python - Selenium remoteWebDriver (& SauceLabs) Firefox moseMoveTo action exception -

html - How to custom Bootstrap grid height? -

Ansible warning on jinja2 braces on when -