Tag Archives: lemmatizer

Lemmatizer / Stemmer for Russian: how to use in your code

This post will guide you through the usage of the Lemmatizer library for the Russian language, that can be ordered through sending a request at [email protected]

First off, what is lemmatizer? When you have lots of data in morphologically rich languages (i.e. natural languages with a lot of variation per word, that is expressed through the word endings / prefixes), you usually would like to find out the base form of a word, also called lemma (hence, lemmatizer). Along with that, you try to resolve the Part of speech (POS), i.e. whether the word is a noun, verb, adjective or something else. Once you found out both the base form and a POS tag you can store that in your databases for further processing. Let say, you system is a search engine over texts in Russian. In order to increase the recall of your search engine you would like to maximize the document coverage of a user query, no matter in what word forms has the query been formulated. Let’s imagine the user query is:

рестораны Москвы

(restaurants of Moscow)

The first word is in plural of ресторан (restaurant) and the second word is genetive of Москва (Moscow). Let’s run both words through the lemmatizer API:

    
import info.semanticanalyzer.morph.ru.MorphAnalyzer;
import info.semanticanalyzer.morph.ru.MorphAnalyzerConfig;
import info.semanticanalyzer.morph.ru.MorphAnalyzerLoader;
import info.semanticanalyzer.morph.ru.PartOfSpeech;
import info.semanticanalyzer.tok.GenericFlexTokenizer;
import info.semanticanalyzer.tok.Token;
import info.semanticanalyzer.tok.Tokenizer;
import info.semanticanalyzer.util.Charsets;
import info.semanticanalyzer.util.IOUtils;
import info.semanticanalyzer.morph.ru.MorphDesc;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.List;
import java.util.Properties;

public class LemmatizerRuTest {

    public void testBlogPostExample() throws RuntimeException {
        File propeFile = new File("conf/lemmatizer-ru.properties");
        Properties properties = new Properties();
        properties.load(new StringReader(IOUtils.readFile(propeFile, Charsets.UTF_8)));
        MorphAnalyzer analyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));

        String phrase = "рестораны Москвы";

        Tokenizer tokenizer = new GenericFlexTokenizer(new StringReader(phrase.toLowerCase()), true);
        Token reusableToken = Token.newReusableToken();
        try {
    		while ( (reusableToken = tokenizer.getNextToken(reusableToken)) != null ) {
    			String token = reusableToken.getText();
    			MorphDesc morphDescription = analyzer.analyzeBest(token);
    			if (morphDescription != null && morphDescription.getLemma() != null) {
                                info("Most frequent lemma of '" + token + "' is " + morphDescription.getLemma());
                                info("Its POS tag: " + morphDescription.getPos());
    			}
    		}
    	} catch (IOException e) {
    		throw new RuntimeException("testBlogPostExample failed: " + e.getMessage());
    	}
    }

    private void info(String msg) {
    	System.out.println("INFO " + msg);
    }
}

The code above takes the original user query and tokenizes it using the GenericFlexTokenizer, that suits generic Russian texts and is part of the lemmatizer package. If you are more into mass media processing, then there is TwitterTokenizer at your sevice. Then in the while loop each token is analyzed and most frequent lemma and its POS tag are extracted and printed onto standard output (console). The frequency is based on lemma’s weight that is encoded in the lemmatizer’s dictionary. If, however, you don’t want the most frequent lemma, you could list all of lemma candidates via calling method analyzer.analyze(). The code produces the following output:

INFO Most frequent lemma of 'рестораны' is ресторан
INFO Its POS tag: NOUN
INFO Most frequent lemma of 'москвы' is москва
INFO Its POS tag: NOUN

Now having both base forms “ресторан” and “москва” you can search over your documents and find hits like: лучший ресторан в Москве (best restaurant in Moscow), самый уютный ресторан Москвы (the coziest restaurant of Moscow). You could also expand the original words into synonyms and match documents using another condition: POS tag. This would bring you results with more hits, but constrained on part of speech of the original user query that should increase the precision of your search.