java - Custom Email Filter in Solr doesn't work -
i have created solr filter retrieve email specific text , return email!
this code:
public final class normalizeaffliationfilter extends tokenfilter { private chartermattribute chartermattr; protected normalizeaffliationfilter(tokenstream ts) { super(ts); this.chartermattr = addattribute(chartermattribute.class); } @override public boolean incrementtoken() throws ioexception { if (!input.incrementtoken()) { return false; } string token =chartermattr.tostring(); pattern pattern = pattern.compile("([a-z0-9_.-]+)@([a-z0-9_.-]+[a-z])"); matcher matcher = pattern.matcher(token); stringbuilder sb = new stringbuilder(); while(matcher.find()){ sb.append(matcher.group()); } sb.append(" "); string email = sb.tostring(); chartermattr.setempty(); chartermattr.copybuffer(email.tochararray(), 0, email.length()); return true; }
i've added field type , field in schema.xml
<fieldtype name="emailnormalized" class="solr.textfield"> <analyzer type="query"> <tokenizer class="solr.classictokenizerfactory"/> <filter class="ir.pandapp.normalizeaffliationfilterfactory"/> <filter class="solr.lowercasefilterfactory"/> </analyzer> <analyzer type="index"> <tokenizer class="solr.classictokenizerfactory"/> <filter class="ir.pandapp.normalizeaffliationfilterfactory"/> <filter class="ir.pandapp.normalizeaffliationfilterfactory"/> <filter class="solr.lowercasefilterfactory"/> </analyzer> </fieldtype> <field name="mods.affiliation" type="emailnormalized" indexed="true" stored="true" multivalued="true"/>
i've added sysout in code logs , works! gets token , email token return!
i've tested on analysis in solr:
after of when search in solr, doesn't work!
like if field value is:"aaaaemail:something@something.com" , search:"aaaa" returns doc!
but should return when search:"something@something.com". have checked schema browser has indexed emails(the correct form). i've got no idea check next! know missing?
no custom code required. need invert described in remove email address solr indexing
as such make use of uax29urlemailtokenizer add type meta data tokens of text , use typetokenfilter let types pass of liking. in case <email>
.
alter field type emailnormalized
in schema.xml follows
<fieldtype name="emailnormalized" class="solr.textfield"> <analyzer> <tokenizer class="solr.uax29urlemailtokenizerfactory"/> <filter class="solr.typetokenfilterfactory" types="email_type.txt" usewhitelist="true"/> <filter class="solr.lowercasefilterfactory" /> </analyzer> </fieldtype>
create file named email_type.txt in conf folder, should same place schema.xml resident. file needs 1 line of content
<email>
should have trouble delimiters used tokenizer, can tweak using patternreplacecharfilter. charfilters may go before tokenizer. work sample text have in image, replacing colons blank.
<fieldtype name="emailnormalized" class="solr.textfield"> <analyzer> <charfilter class="solr.patternreplacecharfilterfactory" pattern=":" replacement=" "/> <tokenizer class="solr.uax29urlemailtokenizerfactory"/> <filter class="solr.typetokenfilterfactory" types="email_type.txt" usewhitelist="true"/> <filter class="solr.lowercasefilterfactory" /> </analyzer> </fieldtype>
Comments
Post a Comment