Tag Archives: java

Sample code for grouping articles into themes

In this post we would like to share with you Java code snippets, that allow for loading data into our Topic API. The idea of the topic API is that it allows you to group your articles / posts / tweets / documents into topical themes and also search in the content.

In order to navigate oceans of textual data and extract useful structures from your content lakes search is one of the most common way to empower your journey. But once you have found thousands and thousands of matches, you still have the problem of the data overload. Topical grouping can help.

All code in this post you can find in our public GitHub repository. We further assume, that your article content is stored in a MySQL database. Using mybatis we load the articles with https://github.com/semanticanalyzer/nlproc_sdk_sample_code/blob/master/src/main/resources/mappers/ArticleEntryMapper.xml.

TopicLoader class takes care of doing it all: loading articles from the DB, forming a JSON request to the Topic API and uploading the relevant fields.

This main class takes single command line argument: resource_id, which matches onto the field of articles DB table in https://github.com/semanticanalyzer/nlproc_sdk_sample_code/blob/master/src/main/resources/mappers/ArticleEntryMapper.xml.

The method uploadDBEntries will load the articles entries from the DB and upload to the Topic API one by one. Note, that if your content is not very large in size (tweet size), then you can upload several posts in one single request (up to 50 at a time).

To upload an article to the Topic API we use the following code:

    private void uploadArticleToTopicAPI(ArticleEntry x) {
        GsonBuilder builder = new GsonBuilder();
        Gson gson = builder.create();

        try {
            String body = "[" + gson.toJson(x) + "]";

            System.out.println("Sending body for id: " + x.getId());

            HttpResponse<JsonNode> response = Unirest.post("https://dmitrykey-insiderapi-v1.p.mashape.com/articles/uploadJson")
                    .header("X-Mashape-Key", mashapeKey)
                    .header("Content-Type", "application/json")
                    .header("Accept", "application/json")

            System.out.println("TopicAPI response:" + response.getBody().toString());
        } catch (UnirestException e) {
            log.error("Error: {}", e);
            System.err.println("Error: " + e.getMessage());

Remember that you need to obtain the mashapeKey by subscribing to the API and checking the documentation, where you will find the key already pre-inserted: https://market.mashape.com/dmitrykey/topicapi.

After the upload is complete, the articles end up in a search engine on the backend of the Topic API. You can start triggering the search requests and getting back nice themes. In this example below I have uploaded about 10,000 Russian texts and gotten topics:

Что Говорят # What they say
Большие деньги # Big money
Беларуский Бренд # Belorussian brand
Для Ребенка # For a kid
Как Выглядит работа # How a job looks like
Как Заработать # How to earn money
В Минском Масс-маркете # In a grocery store of Minsk
Женщины # Women

Annotating sentiment with RussianSentimentAnalyzer API in Java


In this post we will show how easy it is to start using RussianSentimentAnalyzer API on mashape from your Java code.

package com.semanticanalyzer;

import com.mashape.unirest.http.HttpResponse;
import com.mashape.unirest.http.JsonNode;
import com.mashape.unirest.http.Unirest;
import com.mashape.unirest.http.exceptions.UnirestException;

public class RussianSentimentAnalyzerMashapeClient {

    private final static String mashapeKey = "[PUT_YOUR_MASHAPE_KEY_HERE]";

    public static void main(String[] args) throws UnirestException {

        String textToAnnotate = "'ВТБ кстати неплохой банк)'";
        String targetObject = "'ВТБ'";

        // These code snippets use an open-source library. http://unirest.io/java
        HttpResponse response = Unirest.post("https://russiansentimentanalyzer.p.mashape.com/rsa/sentiment/polarity/json/")
                .header("X-Mashape-Key", mashapeKey)
                .header("Content-Type", "application/json")
                .header("Accept", "application/json")
                .body("{'text':" + textToAnnotate + ",'object_keywords':" + targetObject + ",'output_format':'json'}")

        System.out.println("Input text = " + textToAnnotate + "\n" + "Target object:" + targetObject);
        System.out.println("RussianSentimentAnalyzer response:" + response.getBody().toString());

In the code snippet above we’ve used the mashape’s Unirest API, that makes HTTP requesting in Java super easy.

All you really need to care about is to register at mashape.com, sign up for RussianSentimentAnalyzer API and insert your unique mashape key into the code, in place of “PUT_YOUR_MASHAPE_KEY_HERE”, as a value of the mashapeKey variable.

If everything has been set right, execute the code and you should see the following output:

Input text = 'ВТБ кстати неплохой банк)'
Target object:'ВТБ'
RussianSentimentAnalyzer response:{"sentiment":"POSITIVE","synonyms":"[ВТБ]"}

Now you can easily hook the API up into your cool Java app and annotate texts in Russian for sentiment!

You’ll find the code on our github here: https://github.com/semanticanalyzer/nlproc_sdk_sample_code

Keep calm and use an API

RussianSentimentAnalyzer: API на mashape

анализ тональности

В предыдущем посте мы аннонсировали API анализатора тональности на русском языке. API находится в стадии тестирования по приглашению. Для того чтобы получить приглашение, нужно зарегистрироваться на https://www.mashape.com и скинуть Ваш user id на почту: info[at]semanticanalyzer.info.

К API прилагается документация, а также примеры интеграции на Java, Node, PHP, Python, Objective-C, Ruby и .NET.

RussianSentimentAnalyzer API

Lemmatizer / Stemmer for Russian: how to use in your code

This post will guide you through the usage of the Lemmatizer library for the Russian language, that can be ordered through sending a request at [email protected]

First off, what is lemmatizer? When you have lots of data in morphologically rich languages (i.e. natural languages with a lot of variation per word, that is expressed through the word endings / prefixes), you usually would like to find out the base form of a word, also called lemma (hence, lemmatizer). Along with that, you try to resolve the Part of speech (POS), i.e. whether the word is a noun, verb, adjective or something else. Once you found out both the base form and a POS tag you can store that in your databases for further processing. Let say, you system is a search engine over texts in Russian. In order to increase the recall of your search engine you would like to maximize the document coverage of a user query, no matter in what word forms has the query been formulated. Let’s imagine the user query is:

рестораны Москвы

(restaurants of Moscow)

The first word is in plural of ресторан (restaurant) and the second word is genetive of Москва (Moscow). Let’s run both words through the lemmatizer API:

import info.semanticanalyzer.morph.ru.MorphAnalyzer;
import info.semanticanalyzer.morph.ru.MorphAnalyzerConfig;
import info.semanticanalyzer.morph.ru.MorphAnalyzerLoader;
import info.semanticanalyzer.morph.ru.PartOfSpeech;
import info.semanticanalyzer.tok.GenericFlexTokenizer;
import info.semanticanalyzer.tok.Token;
import info.semanticanalyzer.tok.Tokenizer;
import info.semanticanalyzer.util.Charsets;
import info.semanticanalyzer.util.IOUtils;
import info.semanticanalyzer.morph.ru.MorphDesc;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.List;
import java.util.Properties;

public class LemmatizerRuTest {

    public void testBlogPostExample() throws RuntimeException {
        File propeFile = new File("conf/lemmatizer-ru.properties");
        Properties properties = new Properties();
        properties.load(new StringReader(IOUtils.readFile(propeFile, Charsets.UTF_8)));
        MorphAnalyzer analyzer = MorphAnalyzerLoader.load(new MorphAnalyzerConfig(properties));

        String phrase = "рестораны Москвы";

        Tokenizer tokenizer = new GenericFlexTokenizer(new StringReader(phrase.toLowerCase()), true);
        Token reusableToken = Token.newReusableToken();
        try {
    		while ( (reusableToken = tokenizer.getNextToken(reusableToken)) != null ) {
    			String token = reusableToken.getText();
    			MorphDesc morphDescription = analyzer.analyzeBest(token);
    			if (morphDescription != null &amp;&amp; morphDescription.getLemma() != null) {
                                info("Most frequent lemma of '" + token + "' is " + morphDescription.getLemma());
                                info("Its POS tag: " + morphDescription.getPos());
    	} catch (IOException e) {
    		throw new RuntimeException("testBlogPostExample failed: " + e.getMessage());

    private void info(String msg) {
    	System.out.println("INFO " + msg);

The code above takes the original user query and tokenizes it using the GenericFlexTokenizer, that suits generic Russian texts and is part of the lemmatizer package. If you are more into mass media processing, then there is TwitterTokenizer at your sevice. Then in the while loop each token is analyzed and most frequent lemma and its POS tag are extracted and printed onto standard output (console). The frequency is based on lemma’s weight that is encoded in the lemmatizer’s dictionary. If, however, you don’t want the most frequent lemma, you could list all of lemma candidates via calling method analyzer.analyze(). The code produces the following output:

INFO Most frequent lemma of 'рестораны' is ресторан
INFO Most frequent lemma of 'москвы' is москва

Now having both base forms “ресторан” and “москва” you can search over your documents and find hits like: лучший ресторан в Москве (best restaurant in Moscow), самый уютный ресторан Москвы (the coziest restaurant of Moscow). You could also expand the original words into synonyms and match documents using another condition: POS tag. This would bring you results with more hits, but constrained on part of speech of the original user query that should increase the precision of your search.