Tag Archives: java

Annotating sentiment with RussianSentimentAnalyzer API in Java

Hello!

In this post we will show how easy it is to start using RussianSentimentAnalyzer API on mashape from your Java code.

package com.semanticanalyzer;

import com.mashape.unirest.http.HttpResponse;
import com.mashape.unirest.http.JsonNode;
import com.mashape.unirest.http.Unirest;
import com.mashape.unirest.http.exceptions.UnirestException;

public class RussianSentimentAnalyzerMashapeClient {

    private final static String mashapeKey = "[PUT_YOUR_MASHAPE_KEY_HERE]";

    public static void main(String[] args) throws UnirestException {

        String textToAnnotate = "'ВТБ кстати неплохой банк)'";
        String targetObject = "'ВТБ'";

        // These code snippets use an open-source library. http://unirest.io/java
        HttpResponse response = Unirest.post("https://russiansentimentanalyzer.p.mashape.com/rsa/sentiment/polarity/json/")
                .header("X-Mashape-Key", mashapeKey)
                .header("Content-Type", "application/json")
                .header("Accept", "application/json")
                .body("{'text':" + textToAnnotate + ",'object_keywords':" + targetObject + ",'output_format':'json'}")
                .asJson();

        System.out.println("Input text = " + textToAnnotate + "\n" + "Target object:" + targetObject);
        System.out.println("RussianSentimentAnalyzer response:" + response.getBody().toString());
    }
}

In the code snippet above we’ve used the mashape’s Unirest API, that makes HTTP requesting in Java super easy.

All you really need to care about is to register at mashape.com, sign up for RussianSentimentAnalyzer API and insert your unique mashape key into the code, in place of “PUT_YOUR_MASHAPE_KEY_HERE”, as a value of the mashapeKey variable.

If everything has been set right, execute the code and you should see the following output:

Input text = 'ВТБ кстати неплохой банк)'
Target object:'ВТБ'
RussianSentimentAnalyzer response:{"sentiment":"POSITIVE","synonyms":"[ВТБ]"}

Now you can easily hook the API up into your cool Java app and annotate texts in Russian for sentiment!

You’ll find the code on our github here: https://github.com/semanticanalyzer/nlproc_sdk_sample_code

Keep calm and use an API

RussianSentimentAnalyzer: API на mashape

анализ тональности

В предыдущем посте мы аннонсировали API анализатора тональности на русском языке. API находится в стадии тестирования по приглашению. Для того чтобы получить приглашение, нужно зарегистрироваться на https://www.mashape.com и скинуть Ваш user id на почту: info[at]semanticanalyzer.info.

К API прилагается документация, а также примеры интеграции на Java, Node, PHP, Python, Objective-C, Ruby и .NET.

RussianSentimentAnalyzer API

Lemmatizer / Stemmer for Russian: how to use in your code

This post will guide you through the usage of the Lemmatizer library for the Russian language, that can be ordered through sending a request at [email protected]

First off, what is lemmatizer? When you have lots of data in morphologically rich languages (i.e. natural languages with a lot of variation per word, that is expressed through the word endings / prefixes), you usually would like to find out the base form of a word, also called lemma (hence, lemmatizer). Along with that, you try to resolve the Part of speech (POS), i.e. whether the word is a noun, verb, adjective or something else. Once you found out both the base form and a POS tag you can store that in your databases for further processing. Let say, you system is a search engine over texts in Russian. In order to increase the recall of your search engine you would like to maximize the document coverage of a user query, no matter in what word forms has the query been formulated. Let’s imagine the user query is:

рестораны Москвы

(restaurants of Moscow)

The first word is in plural of ресторан (restaurant) and the second word is genetive of Москва (Moscow). Let’s run both words through the lemmatizer API:

    
import info.semanticanalyzer.morph.ru.MorphAnalyzer;
import info.semanticanalyzer.morph.ru.MorphAnalyzerConfig;
import info.semanticanalyzer.morph.ru.MorphAnalyzerLoader;
import info.semanticanalyzer.morph.ru.PartOfSpeech;
import info.semanticanalyzer.tok.GenericFlexTokenizer;
import info.semanticanalyzer.tok.Token;
import info.semanticanalyzer.tok.Tokenizer;
import info.semanticanalyzer.util.Charsets;
import info.semanticanalyzer.util.IOUtils;
import info.semanticanalyzer.morph.ru.MorphDesc;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.List;
import java.util.Properties;

public class LemmatizerRuTest {

    public void testBlogPostExample() throws RuntimeException {
        File propeFile = new File("conf/lemmatizer-ru.properties");
        Properties properties = new Properties();
        properties.load(new StringReader(IOUtils.readFile(propeFile, Charsets.UTF_8)));
        MorphAnalyzer analyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));

        String phrase = "рестораны Москвы";

        Tokenizer tokenizer = new GenericFlexTokenizer(new StringReader(phrase.toLowerCase()), true);
        Token reusableToken = Token.newReusableToken();
        try {
    		while ( (reusableToken = tokenizer.getNextToken(reusableToken)) != null ) {
    			String token = reusableToken.getText();
    			MorphDesc morphDescription = analyzer.analyzeBest(token);
    			if (morphDescription != null && morphDescription.getLemma() != null) {
                                info("Most frequent lemma of '" + token + "' is " + morphDescription.getLemma());
                                info("Its POS tag: " + morphDescription.getPos());
    			}
    		}
    	} catch (IOException e) {
    		throw new RuntimeException("testBlogPostExample failed: " + e.getMessage());
    	}
    }

    private void info(String msg) {
    	System.out.println("INFO " + msg);
    }
}

The code above takes the original user query and tokenizes it using the GenericFlexTokenizer, that suits generic Russian texts and is part of the lemmatizer package. If you are more into mass media processing, then there is TwitterTokenizer at your sevice. Then in the while loop each token is analyzed and most frequent lemma and its POS tag are extracted and printed onto standard output (console). The frequency is based on lemma’s weight that is encoded in the lemmatizer’s dictionary. If, however, you don’t want the most frequent lemma, you could list all of lemma candidates via calling method analyzer.analyze(). The code produces the following output:

INFO Most frequent lemma of 'рестораны' is ресторан
INFO Its POS tag: NOUN
INFO Most frequent lemma of 'москвы' is москва
INFO Its POS tag: NOUN

Now having both base forms “ресторан” and “москва” you can search over your documents and find hits like: лучший ресторан в Москве (best restaurant in Moscow), самый уютный ресторан Москвы (the coziest restaurant of Moscow). You could also expand the original words into synonyms and match documents using another condition: POS tag. This would bring you results with more hits, but constrained on part of speech of the original user query that should increase the precision of your search.