Category Archives: Information retrival

Sample code for grouping articles into themes

In this post we would like to share with you Java code snippets, that allow for loading data into our Topic API. The idea of the topic API is that it allows you to group your articles / posts / tweets / documents into topical themes and also search in the content.

In order to navigate oceans of textual data and extract useful structures from your content lakes search is one of the most common way to empower your journey. But once you have found thousands and thousands of matches, you still have the problem of the data overload. Topical grouping can help.

All code in this post you can find in our public GitHub repository. We further assume, that your article content is stored in a MySQL database. Using mybatis we load the articles with https://github.com/semanticanalyzer/nlproc_sdk_sample_code/blob/master/src/main/resources/mappers/ArticleEntryMapper.xml.

TopicLoader class takes care of doing it all: loading articles from the DB, forming a JSON request to the Topic API and uploading the relevant fields.

This main class takes single command line argument: resource_id, which matches onto the field of articles DB table in https://github.com/semanticanalyzer/nlproc_sdk_sample_code/blob/master/src/main/resources/mappers/ArticleEntryMapper.xml.

The method uploadDBEntries will load the articles entries from the DB and upload to the Topic API one by one. Note, that if your content is not very large in size (tweet size), then you can upload several posts in one single request (up to 50 at a time).

To upload an article to the Topic API we use the following code:

 
    private void uploadArticleToTopicAPI(ArticleEntry x) {
        GsonBuilder builder = new GsonBuilder();
        Gson gson = builder.create();

        try {
            String body = "[" + gson.toJson(x) + "]";

            System.out.println("Sending body for id: " + x.getId());

            HttpResponse<JsonNode> response = Unirest.post("https://dmitrykey-insiderapi-v1.p.mashape.com/articles/uploadJson")
                    .header("X-Mashape-Key", mashapeKey)
                    .header("Content-Type", "application/json")
                    .header("Accept", "application/json")
                    .body(body)
                    .asJson();

            System.out.println("TopicAPI response:" + response.getBody().toString());
        } catch (UnirestException e) {
            log.error("Error: {}", e);
            System.err.println("Error: " + e.getMessage());
        }
    }

Remember that you need to obtain the mashapeKey by subscribing to the API and checking the documentation, where you will find the key already pre-inserted: https://market.mashape.com/dmitrykey/topicapi.

After the upload is complete, the articles end up in a search engine on the backend of the Topic API. You can start triggering the search requests and getting back nice themes. In this example below I have uploaded about 10,000 Russian texts and gotten topics:

Что Говорят # What they say
Большие деньги # Big money
Беларуский Бренд # Belorussian brand
Для Ребенка # For a kid
Как Выглядит работа # How a job looks like
Как Заработать # How to earn money
В Минском Масс-маркете # In a grocery store of Minsk
Женщины # Women

Text analytics APIs: simplified pricing

We focus a lot on unifying access to our text analytics APIs. One of such areas is pricing. We obviously want more users to have access to our systems at meaningful prices.

In the course of the last month we have unified and decreased prices for all our APIs. Here are the changes:

RSA API (entity level sentiment detection for Russian):
Overage fee for Basic plan is USD $0,02 (was: USD $0,05). This matches the overage fee on all other plans.
PRO plan is now USD $99 instead of $299.
ULTRA plan is now USD $199 instead of $350.

FUXI API (sentiment detection for Chinese):
We changed our subscription plan Basic to allow for 15,000 texts a month for just $10 instead of 500 / day.
PRO plan allows you to 100,000 texts a month for $99.

Topic API (searchable topics for texts in Russian)
Basic plan allows for sending 1,000 messages for $19. Remember, that one message can contain up to 50 texts. If you were only uploading texts you could upload 50,000 of them.

The following APIs continue to be FREE:

ConnectedWords (find semantically similar English words to the ones given)
SemanticCloud (frequency word clouds for Russian along with lemmas)

Our team is always listening to you, our users — let us know, what APIs you would like to have in addition, what features to existing APIs and what volumes you would like to handle.

Enjoy the journey of extracting signal from your textual lakes!

NEW API: ConnectedWords

Hello and Happy New Year!

New Year – New API. We have launched new API called ConnectedWords. We have trained a neural network using word2vec approach on a number of English texts. As input you can supply an array of keywords for which you’d like to get another list of connected or related words.

 

Available end-points:

Here is an example:

For word “launch” the API produces the following connected words:

[
“launched 0.5948931514907372”,
“ariane 0.5640206606244647”,
“icbm 0.532163213444619”,
“canaveral 0.5222400316699805”,
“rocket 0.5168188279637889”,
“launcher 0.5066764146199603”,
“suborbital 0.4987842348018603”,
“landing 0.49743730683360354”,
“expendable 0.49456818497947097”,
“agena 0.49325088465809586”,
“orbiter 0.4930563861239534”,
“shuttle 0.48127536803463045”,
“unmanned 0.47977178154360445”,
“launches 0.47013505662020805”,
“sputnik 0.4690193780888272”,
“bomarc 0.46608954818339043”,
“mission 0.4622460565342408”,
“redstone 0.4509777243147255”,
“gliders 0.4493604525398496”,
“missile 0.4388378398880377”,
“abort 0.4322835796211848”,
“rockets 0.4255249811253634”,
“lgm 0.42401975940492775”,
“launching 0.42055305756491634”,
“spacecraft 0.42044358977136653”,
“warhead 0.4203600640856848”,
“manned 0.4196165464952628”,
“skylab 0.417352627778655”,
“spaceflight 0.41261142646271765”,
“payloads 0.41167406251520333”,
“operational 0.41030200304930986”,
“refueling 0.41015588246409607”,
“orbit 0.4054650313323691”,
“extravehicular 0.4040691414909361”,
“icbms 0.4037563327101452”,
“hotol 0.4027989227897706”,
“sts 0.400049473907643”,
“saturn 0.399919637824496”,
“payload 0.398525218766963”,
“bm 0.3965859062493564”
]

How can one use the API?

1. Making your search engine smarter: expand the result set to documents containing related words. This helps you solve the issue of zero hit searches.

2. Spice up your writing. Are you a journalist / blogger / student and would like to add a flavour to your text? Send in a few words and get a set of words, that might help make your texts more interesting and engaging.

In the future we would like to add support for other languages and train on different types of texts, like social media, news, blogs etc. If you have more ideas for how to make the system more useful for your needs, get in touch!

Mashape

Happy and Prosperous New Year 2017!

Insider wishes our users and fans a very Happy and Prosperous New Year 2017!

And remember, Insider is there to help you with your limitless natural language processing needs with our text analytics APIs!

Like us on facebook to stay always informed of API landscape and our offerings! 

Happy New Year! 

Research project on traditional and social media

Last month Insider has contributed to common research project with two other companies: ContextMedia (with 20+ years of traditional media analytics) and YouControl (with access to government data). Target of the research was to build a bio and semantic portrait of the Ukrainian politician Dmytro Svyatash in light of the law on car import in Ukraine. The interactive research results can be found here (in Russian).

Insider has used two own tools for unstructured text analytics: Insider API for realtime semantic topic creation (screenshots and description of the system are here) and RSA API for entity level sentiment analysis.

The resulting system, that was prototyped in under a week, allowed for:

  1. Navigating through years of data from 2002 to current moment using keyword searches.
  2. Understanding the sentiment distribution in the found corpora and for given search.
  3. Researching quantitative search trends using visual trend chart.
  4. Sifting through the produced semantic topics, grouping various news items together in search results.
  5. Getting the heart beat of twitter.

InsiderUI

In the process we relied on best open source tools, including Apache Tika, using which allowed us to swiftly convert HTML news articles into JSON format, preserving all important attributes of a news item: title, contents. We crafted and applied additionally own NER for extracting date of a publication to properly place it on the time scale.

Want to do a similar research on your own data? Get in touch: [email protected].