pdf - Document contains at least one immense term - Solr indexing error -
i facing issue 1 of pdf files presume fails indexed solr due large file size. have seen replies online advising change field type of 'content' 'text_general', have been using while particular pdf still cannot indexed.
error produced:
exception writing document id abc.com/files/hugepdf.pdf index; possible analysis error: document contains @ least 1 immense term in field="content" (whose utf8 encoding longer max length 32766), of skipped. please correct analyzer not produce such terms. prefix of first immense term is: '[66, 65, 82, 73, 78, 71, 32, 71, 76, 79, 66, 65, 76, 32, 79, 80, 80, 79, 82, 84, 85, 78, 73, 84, 73, 69, 83, 32, 85, 77]...', original message: bytes can @ 32766 in length; got 110482. perhaps document has indexed string field (solr.strfield) large
current schema of 'text_general'
<fieldtype name="text_general" class="solr.textfield" positionincrementgap="100" multivalued="true"> <analyzer type="index"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.englishminimalstemfilterfactory"/> <filter class="solr.truncatetokenfilterfactory" prefixlength="100"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.lowercasefilterfactory"/> <filter class="solr.englishminimalstemfilterfactory"/> </analyzer> <analyzer type="multiterm"> <tokenizer class="solr.standardtokenizerfactory"/> <filter class="solr.lowercasefilterfactory"/> </analyzer> </fieldtype>
do note added 'truncatetokenfilterfactory' filter helped solve issues large pdf files. pdf exception.
questions
- what way make possible index such pdfs?
- on indexing failure, indexes not added solr (which wastes effort takes long time (a couple of hours) due 1 pdf file exceeds max size. there way around add successful indexes while rejecting specific indexes?
indexing pdf content known 'nightmare'. never 100% correct text extraction. suspect issue here extraction not working pdf, , returning huge pile of garbage. truncating not best approach, ignoring better. using 'text_general' not @ all.
some general guidelines be:
- do text extraction out of solr. yes handy use solr cell, real world pdf , volumes, worst case process hang (that worse dying). out of solr, in multiple threads, speed , make solr more reliable (less stress on it).
- use fallback library. probabaly usuing pdfbox (if using cell). if fails extract file, use second library (there several)
Comments
Post a Comment