Named Entity Recognition (NER) in Solr

Named Entity Recognition (NER) in Solr
Named Entity Recognition, or NER for short, is a powerful paradigm which causes entities to be recognized within text. Typically these objects can be places, organizations or people. For example, given the phrase “Jon works at Searchbox”, a good NER would return that Jon is a person and Searchbox is an organization. Why is this powerful, especially in Solr? Using this information we can not only propose better suggestions for users searching for things, but using Solr faceting capability we’ll have the ability to facet directly on organizations (or people) without having to manually identify them in all of the documents.

In this blog post, extending from our two previous slideshares on how to develop search components and request handlers, we’ll teach you how to directly embed Stanford’s NER library into a production ready plugin which provides all of the mentioned benefits. We of course provide the full source code packaging here.

I prefer to start with developing request handler because it allows for much easier testing, by simply passing a query via the url parameter you can generate many different results (or errors) that greatly speed debugging. Compare this with wanting to test a processor factory: each test you’d like to perform needs to have a document created and pushed (via curl?) to the end point. Not terribly difficult, but definitely not the easiest approach to be taken : )

We start by creating a maven based project in NetBeans, and make a class called NerHandler which extends RequestHandlerBase. There is only one function which needs to be overridden to have a successful request handler:

@Override
public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception {

 

But this is a bit too bare minimum for production, and especially so for our case. After all, when we load the model (which takes about 10 seconds on my laptop), we don’t want to do that for every single request, we’d rather load it once at load time and have it be used every time there after directly from memory. For this purpose we also override the init function. To figure out what we should be doing in our init function, I’ve looked at the NERDemo.java file from the Stanford downloadable zip. Which provides the most simplistic usage for the library that they’ve created. Its also important to note, while reading through the website, that the library is itself thread-safe, meaning we can call it in parallel with different requests and these requests won’t crash in the backend (usually as a result of global variables used inside of a single class).

So, first things first, we can see the loading of the NER model from a compressed archive:
 

String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);

 
If we look in the directory, we see that there are a few different models provided (and on the website models for German/Chinese), so we may want to be able to change the model type at initialization and then given that file name we need to load it to create the model in memory. Back in our java class, we override the init function and include this functionality:
 
    @Override
    public void init(NamedList params) {
        try {
            String classifierName = (String) params.get("classifier");
            if (classifierName != null) {
                serializedClassifier = classifierName;
            }
            nerprocessor = new NerSingleton(serializedClassifier);
        } catch (Exception ex) {
            LOGGER.error(ex.toString());
        }
        super.init(params);
    }
 
This sets up our class level private variables:
 
private String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";
NerSingleton nerprocessor;
 
We can see that we use a default classifier if one is not specified, otherwise we try to use the one given to us from the solrconfig file. The NerSingleton class is a very simple 3 line class which creates a single object in memory from which all other threads can call it:
 
public class NerSingleton {
    private static AbstractSequenceClassifier classifier; //loaded at initlization time

    public NerSingleton (String loadfrom){
        classifier = CRFClassifier.getClassifierNoExceptions(loadfrom);
    }

    public static AbstractSequenceClassifier getInstance(){
        return classifier;
    }
}
 

We can see that the constructor is loading our model from a defined string, this is exactly the same line as we saw in the NerDemo.java. All things good so far!

Next we need to handle the actual invocation of the requesthandler, i.e. get the query, process it and return the appropriate answer. We’ll step through each bit a chunk at a time:

 

@Override
    public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception {
        numRequests++;
        long startTime = System.currentTimeMillis();

        HashMap> output = new HashMap>();

        SolrParams params = req.getParams();
        String q = params.get(CommonParams.Q);

 

The first two lines are for some basic statistics which we keep (as discussed below). We then create a date store for our ouput, we can see that it will be a hashmap of String vs Hashset<String>. This is because each token will be assigned a class (String) and we want to group all of those tokens together, without duplicates, so we can store them in the HashSet. Next we get the query parameter, q, which we will use to do the actual computation.

 

    List< List< CoreLabel > > out = NerSingleton.getInstance().classify(q);
            for (List sentence : out) {
                for (CoreLabel word : sentence) {
                    String tokenclass = word.get(CoreAnnotations.AnswerAnnotation.class);
                    String token = word.word();
                    if (output.get(tokenclass) == null) {
                        output.put(tokenclass, new HashSet());
                    }
                    output.get(tokenclass).add(token);
                }
            }

 

Here we see that we use the NerSingleton class that we’ve loaded during the init method (so there is no loading delay) and run the classification on q. After that, we iterate through each detected sentence and then each detected word and get the appropriate class. If that class doesn’t already exist in our output variable we create it and then finally store it. While this may seem a bit complex, it very easily parallels the code we see in the NerDemo.java:

 

String fileContents = IOUtils.slurpFile(args[1]);
   List<List<CoreLabel>> out = classifier.classify(fileContents);
   for (List<CoreLabel> sentence : out) {
     for (CoreLabel word : sentence) {
        System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
      }
      System.out.println();
    }
 

At this point, our output variable contains all of our results and we just need to parse them and stick them into a solr response so that the core knows how to render an xml (or json, csv, etc) response:

 

  NamedList< ArrayList < String > > results = new NamedList<ArrayList<String>>();
            for (String key : output.keySet()) {
                results.add(key, new ArrayList<String>(output.get(key)));
            }
            rsp.add("results", results);

 

Here we iteratate through each token class we saw, create a group for them, and then add the entire hashset as an array list, a type which is supported by solr’s backend. Lastly we add our results to the response and send it away. Of course there is a bit of bookkeeping, just incase there are errors:

 

       } catch (Exception e) {
            numErrors++;
            LOGGER.error(e.getMessage());
        } finally {
            totalTime += System.currentTimeMillis() - startTime;
        }

 

For those of us that are interested in some statistics, we override a few additional methods to get a really nice plugin:

 

  @Override
    public String getDescription() {
        return "Searchbox NER";
    }

    @Override
    public String getVersion() {
        return "1.0";
    }

    @Override
    public String getSource() {
        return "http://www.searchbox.com";
    }

    @Override
    public NamedList <Object>  getStatistics() {
        NamedList all = new SimpleOrderedMap<Object>();
        all.add("requests", "" + numRequests);
        all.add("errors", "" + numErrors);
        all.add("totalTime(ms)", "" + totalTime);
        return all;
    }

 

This allows us to set the description, version and source information which will appear in the admin front end of Solr. Lastly we have the statistics function which returns all of the fancy information we’ve been keeping track of.

Now to set it up, since we’re planning on using the default classifier for the time being, we simply need to add the following line to our solr config:

 

<requestHandler name="/ner" class="com.searchbox.ner.NerHandler" />

 

Add the compiled jar to the classpath so that Solr can load our super cool Ner plugin and finally put the classifier file where Solr is expecting it, in this case we put them in solr-4.2.1/example/classifiers, and away we go! During bootup of solr, if we check the log we can see our model was happily loaded:

Loading classifier from /salsasvn/solr-4.2.1/example/classifiers/english.all.3class.distsim.crf.ser.gz … done [6.1 sec].

But the question is, does it work? Well for example, if we send some basic solr text taken from the wiki page:

In 2004, Solr was created by Yonik Seeley at CNET Networks as an in-house project to add search capability for the company website. Yonik Seeley along with Grant Ingersoll and Erik Hatcher went on to launch LucidWorks (was Lucid Imagination), a company providing commercial support, consulting and training for Apache Solr search technologies.

And look at the results

http://192.168.56.101:8983/solr/ner/ner?q=In%202004,%20Solr%20was%20create…..

 

<response>
	<lst name="responseHeader">
		<int name="status">0</int>
		<int name="QTime">95</int>
	</lst>
	<lst name="results">
		<arr name="ORGANIZATION">
			<str>Networks</str>
			<str>CNET</str>
		</arr>
		<arr name="PERSON">
			<str>Hatcher</str>
			<str>Grant</str>
			<str>Yonik</str>
			<str>Seeley</str>
			<str>Erik</str>
			<str>Ingersoll</str>
		</arr>
		<arr name="O">
			<str>to</str>
			<str>In</str>
			<str>2004</str>
			<str>for</str>
...

 

We can see that the CNet Networks was correctly identified as an organization and the people associated are properly identified also, Yonik, Erik and Grant. Total time, 95 ms! Great stuff!

Now the real power isn’t in request handlers, but being able to modify our documents and add fields with such information so that we can do some faceting type work. In order to do this, we make a processor factory, which looks extremely similar to our request handler. We create a new class called NerProcessorFactory and have it extend UpdateRequestProcessorFactory. From there we again override the init function:

 

  @Override
    public void init(NamedList params) {
        String classifierName = (String) params.get("classifier");
        if (classifierName != null) {
            serializedClassifier = classifierName;
        }

        nerprocessor = new NerSingleton(serializedClassifier);

        NamedList queryFieldsTop = (NamedList) params.get("queryFields");

        if (queryFieldsTop == null) {
            throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
                    "queryField must be set in the configuration of NerProcessorfactory");
        }
        queryfields = queryFieldsTop.getAll("queryField");
        super.init(params);
    }

 

The only major difference is that now we need to specify in the solr config what fields we would like to analyze to produce the output.  We discuss the formatting later. Next we need to have the processor factory create an instance:

 

    @Override
    public UpdateRequestProcessor getInstance(SolrQueryRequest sqr, SolrQueryResponse sqr1, UpdateRequestProcessor next) {
        return new NerProcessor(next, queryfields);
    }

 

We create a subclass internally which handles the actual work, we can see that it takes the query fields as its constructor:

 

    public NerProcessor(UpdateRequestProcessor next, List queryfields) {
        super(next);
        this.queryfields = queryfields;
    }
 

Lastly we need to actually do the work in the overrided processAdd method:

 

@Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
        HashMap<String, HashSet<String>> output = new HashMap<String, HashSet<String>>();
        SolrInputDocument doc = cmd.getSolrInputDocument();

        StringBuilder sb = new StringBuilder();
        for (String field : queryfields) {
            Collection mv = doc.getFieldValues(field);
            if (mv != null) {
                for (Object v : mv) {
                    sb.append(v.toString() + ". ");
                }
            }
        }

        String q = sb.toString();

 

First we combined all of the requested query fields into a single string for ease of computation, we use a string builder to create our final q string (quite similar to the request handler!).  Then we handle the processing the same way as before, filling our output variable:

 

try {
            List<List<CoreLabel>> out = NerSingleton.getInstance().classify(q);
            for (List<CoreLabel> sentence : out) {
                for (CoreLabel word : sentence) {
                    String tokenclass = word.get(CoreAnnotations.AnswerAnnotation.class);
                    String token = word.word();
                    if (output.get(tokenclass) == null) {
                        output.put(tokenclass, new HashSet<String>());
                    }
                    output.get(tokenclass).add(token);
                }
            }

 

The only difference is that this time we want to add the fields to a document instead of returning them directly as an xml value, this is again surprisingly similar to the requesthandler

 

    for (String key : output.keySet()) {
                doc.addField(key, new ArrayList<String>(output.get(key)));

            }

 
And then lastly we need to add super.processAdd otherwise our process will die J To install this is a bit trickier as we need to add it to an update chain, but you can accomplish this with a copy->paste of this code:

 

<updateRequestProcessorChain name="mychain" >
   <processor class="com.searchbox.ner.NerProcessorFactory" >
     <lst name="queryFields">
       <str name="queryField">content</str>
     </lst>
   </processor>
   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

 

Here we see that we’re using the field content to determine the language, though we could specify as many fields as we wished. Next we just need to add the chain the update request handler like so:

 

<requestHandler name="/update" class="solr.UpdateRequestHandler">
       <lst name="defaults">
         <str name="update.chain">mychain</str>
       </lst>
  </requestHandler>

 

And we’re good to go! After indexing some documents (via curl, dataimporthandlers, etc), we can do a query and see if we have any results:

 

http://192.168.56.101:8983/solr/ner/select?q=*%3A*&fl=ORGANIZATION%2CPERSON&wt=xml&indent=true&facet=true&facet.field=ORGANIZATION

 

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="facet">true</str>
            <str name="fl">ORGANIZATION,PERSON</str>
            <str name="indent">true</str>
            <str name="q">*:*</str>
            <str name="facet.field">ORGANIZATION</str>
            <str name="wt">xml</str>
        </lst>
    </lst>
    <result name="response" numFound="390" start="0">
        <doc>
            <arr name="PERSON">
                <str>Sauyet</str>
                <str>Dave</str>
                <str>Scott</str>
                <str>Fuller</str>
            </arr>
        </doc>
        <doc />
        <doc>
            <arr name="ORGANIZATION">
                <str>BCCI</str>
            </arr>
            <arr name="PERSON">
                <str>Gregg</str>
                <str>Jaeger</str>
                <str>Jon</str>
                <str>Livesey</str>
            </arr>
        </doc>
        <doc>
            <arr name="PERSON">
                <str>Russell</str>
                <str>Hemingway</str>
                <str>Gregg</str>
                <str>James</str>
                <str>Jim</str>
                <str>Allah</str>
                <str>Hoban</str>
                <str>Hogan</str>
            </arr>
        </doc>
        <doc>
            <arr name="ORGANIZATION">
                <str>State</str>
                <str>Iowa</str>
                <str>University</str>
            </arr>
            <arr name="PERSON">
                <str>Warren</str>
                <str>Bruce</str>
                <str>Cobb</str>
                <str>Kurt</str>
                <str>Salem</str>
                <str>Mike</str>
            </arr>
        </doc>
        <doc />
        <doc>
            <arr name="PERSON">
                <str>David</str>
                <str>Einstien</str>
                <str>McAloon</str>
                <str>Einstein</str>
            </arr>
        </doc>
        <doc>
            <arr name="PERSON">
                <str>Bill</str>
            </arr>
        </doc>
        <doc>
            <arr name="PERSON">
                <str>Bill</str>
                <str>Hausmann</str>
                <str>Maddi</str>
            </arr>
        </doc>
        <doc>
            <arr name="PERSON">
                <str>Mozumder</str>
                <str>Bill</str>
                <str>Bobby</str>
                <str>Conner</str>
            </arr>
        </doc>
    </result>
</response>

 

There it is! Our documents now have a PERSON and ORGANIZATION field, which are correctly populated from the index data. Now the question is, can we use this information for better/easier information finding for our end users, and the answer is of course a resounding yes. By faceting on this field:

 

http://192.168.56.101:8983/solr/ner/select?q=*%3A*&fl=id&wt=xml&indent=true&facet=true&facet.field=ORGANIZATION

 

We can get some rather cool facets which previously would have been very difficult to extract from the text:

 

<lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
        <lst name="ORGANIZATION">
            <int name="University">26</int>
            <int name="of">22</int>
            <int name="Inc.">15</int>
            <int name="Claire">10</int>
            <int name="Clinic">10</int>
            <int name="Eau">10</int>
            <int name="Islam">10</int>
            <int name="Midelfort">10</int>
            <int name="Department">9</int>

 

Allowing users to more rapidly dig down into exactly what they’re looking for! Well that’s all for this blog post, good luck and happy Entity Recognizing! (P.S. don’t forget that there are other classifiers included in the zip file, so picking the correct one should prove to be a good experiment to ensure that you know how the above described systems work).

As we mentioned above, the full source code for this tutorial is available here. There are, of course, a few limitations in this version such as entities are split (“Stephane Gamard” becomes “Stephane” and “Gamard”), and we load 2 classifiers. So we also provide production level code which is more optimized and elegant, should you need it, here.

Comments (2)

  1. Artem Lukanin

    August 1, 2013 at 12:34 pm

    Please, correct your code in the article. You have “List> out = classifier.classify(fileContents);” instead of “List<list> out = classifier.classify(fileContents);”

  2. Artem Lukanin

    August 1, 2013 at 12:35 pm

    instead of List< list> out = classifier.classify(fileContents);

Leave a Comment