Classifying Content with Apache SOLR

8 March, 2016

In my previous blog, Applying Machine Learning for Smart Content Solutions, we emphasised the importance of applying machine learning techniques to organise content. One drawback we encountered with the previous approach with Apache Mahout classification is that you need a set of classified content to train the initial model, and once the machine learning model is trained for a given data set, it becomes outdated when new content is added, thus it requires scheduled re-training.

In typical content management projects, you start off with an empty repository which you want the system to learn as your users create content and manually classify them. The ideal user experience for content classification within content and document management systems would be:

User creates or uploads a document
System suggests a classification for the content
User either accepts it or changes the classification
Content is saved and system learns from any change or confirmation by the user.

We built a content classification solution using Apache Solr to achieve this. Apache Solr is a widely used open source search engine and is available out of the box for content and document management systems such as Alfresco and Drupal.

In this blog, we will discuss its approach as well as its pros and cons.

How to organise content with text classification?

Apache Solr provides document management features such as faceted search, metadata management and recommendations for effective information retrieval.

However, with the rapid growth of the volume of the content, it becomes difficult to organise the content using “only” manually created metadata such as labels or categories for the documents.

This is where machine learning techniques such as text classification comes into play.

Different categories can be automatically assigned to new documents using text categorisation based on their content similarity. Then, those categories can used in various content organisation and retrieval applications such as faceted search, content recommendation and document taxonomy.

Lucene classification

Apache Solr is a popular, scalable and fault tolerant open source enterprise search platform built on Apache Lucene. Enterprise content management systems such as Alfresco and Drupal use Apache Solr to provide search capabilities to the end user.

Lucene’s classification module provides two classification algorithms namely K-Nearest Neighbour (KNN) and Naive Bayes to enable text classification using the content and associated metadata.

K-nearest neighbour algorithm uses Apache Solr More Like This (MLT) feature to classify new text documents, based on the categories of existing similar documents with content and available metadata. Naive Bayes algorithm is using exact term frequency information already available to classify new documents using probabilistic approach (calculating conditional probabilities in Baysian statistics).

There is a suggestion to implement MaxEnt based classifier which takes word correlation into account, as a future enhancement in Lucene classification.

Pros and cons

The advantage of Lucene document classification is that all the available documents, including the most recent, are considered when it comes to assigning a category for new document. The simplicity of the implementation is another benefit.

Moreover, we can improve the semantic consistency of organising documents using this approach. For example, if there are articles about the sport Cricket, users might use different terms as categories for documents (E.g., sport, Sports, games, game etc.). This approach will come up with already available categories for new documents thus reducing this inconsistency.

However, since most of the required processing to train the system with existing documents also takes place during inference time, the computation time can be significantly higher.

If the Apache Solr classification module incorrectly classifies a document and a user doesn’t correct it, then that error will influence the accuracy of the categorization of new documents as both human generated categories and all the machine generated categories are taken into consideration to classify new documents. To overcome this disadvantage, we can provide the machine generated categories as a suggestion (not as set category) to end user. This is a way of incorporating user feedback into a machine learning application. If end user feels that the category is incorrect he can reject setting the suggested category for the document added.

We tested Lucene classification using 20 Newsgroup dataset which consist of a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

During the evaluation, we found that KNN algorithm is much faster than Naive Bayes algorithm when it comes to inferring categories for new documents. When the input size is increasing, computation time of Naive Bayes algorithm is also largely increasing (approximately by 2X). Comparatively, computation time for KNN slightly increases with input size.

Further, when it comes to accuracy, KNN is more accurate than Naive bayes algorithm. Accuracy of the KNN algorithm is around 80% for Newsgroup dataset, whereas accuracy for Naive Bayes algorithm is around 70%. For KNN, K-fold cross validation can also be performed for different K values to find the optimal K value for better accuracy. As a norm, in KNN, K is set to to the square root of the number of training patterns/samples.

Accordingly, we have decided that KNN algorithm is more suitable for our solution than Naive Bayes algorithm.

Implementation

A custom Apache Solr Update request processor was implemented to enable document classification. Each document added to Apache Solr, goes through Apache Solr update request processors. So, we have developed a custom update request processor (SOLR plug-in) to classify each document.

Once the document is received by the custom Apache Solr update request processor, it will be classified by Lucene classifier using either Naive Bayes or KNN algorithm and the suggested category value can be applied to a new Apache Solr field for classification.

Some more improvements to the Apache Solr/ Lucene classification module on using selected fields for classification is available here.

More information on Lucene classification can be found in this presentation.

Classifying Content with Apache SOLR

How to organise content with text classification?

Lucene classification

Pros and cons

Implementation

Related content

Why organisations are interested in process automation — Flowable

Camunda’s Bernd Rücker on process automation

MICO enterprise cross media search

Local councils to lead the open source movement

Subscribe to our monthly newsletter, including updates about our content and events