Category Archives: Выступления / доклады

Industry Day: connecting with language technology students

This month we had a pleasure to participate in the Industry Day — yearly event organized for language technology students at the University of Helsinki.

The purpose of the event is to share the industry perspective of how NLP and Machine Learning are used in text applications in the industry. Among other companies presented were Kielikone, Silo.AI, AlphaSense, Utopia Analytics, Semantix, Sanoma, Lingsoft.

Our CEO Dmitry Kan talked about three major research and development areas at Insider:

  • Multi-lingual entity-level sentiment analysis for Russian, Chinese, English
  • Recommendation engines for social media and public events with thousands of participants
  • Detecting review noise: the peculiar task of automatically discovering negative reviews with full 5 stars.

Array of texts and quality improvements / Массив текстов и улучшения качества

Hello and Happy New Year!

(на русском — читайте ниже)

We are happy to let you know of three major changes to the RSA API for entity level sentiment analysis of Russian texts:

Mashape

You can now send in an array of up to 10 texts. Use the new end-point: https://russiansentimentanalyzer.p.mashape.com/rsa/sentiment/polarity/jsons/

Example of input with two texts:
 
[
  {
    "text": "Гиперответственный классный исполнитель :)\nОтдельный респект за подхваченное в 22-00 задание!",
    "article_id": 1,
    "include_strength": true
  },
  {
    "text": "быстро доставил,но претензии остались",
    "article_id": 2,
    "include_strength": true
  },
  {
    "text": "погода отличная"
  }
]

Response from the API will have polarity labels tagged with original article_id values, if provided (otherwise follows the input order of texts):

  
[
    {
        "sentiment": "POSITIVE",
        "strength": 1,
        "article_id": "1"
    },
    {
        "sentiment": "NEUTRAL",
        "strength": 0,
        "article_id": "2"
    },
    {
        "sentiment": "POSITIVE",
        "strength": 1
    }
]
  1. We have tuned the quality for both positive and negative tonality.
  2. We are back to Fremium model allowing you to send 100 texts a month for free.

Enjoy!

Insider team

Привет и с Новым годом!

Мы рады сообщить о трёх важных улучшениях в RSA API — системе объектного анализа тональности текстов на русском языке:

  1. Теперь за один запрос можно прислать до 10 текстов. Используйте новый энд-пойнт: https://russiansentimentanalyzer.p.mashape.com/rsa/sentiment/polarity/jsons/
    Пример запроса:
 
[
  {
    "text": "Гиперответственный классный исполнитель :)\nОтдельный респект за подхваченное в 22-00 задание!",
    "article_id": 1,
    "include_strength": true
  },
  {
    "text": "быстро доставил,но претензии остались",
    "article_id": 2,
    "include_strength": true
  },
  {
    "text": "погода отличная"
  }
]

Ответ системы будет содержать оригинальные article_id либо следовать изначальному порядку текстов:

  
[
    {
        "sentiment": "POSITIVE",
        "strength": 1,
        "article_id": "1"
    },
    {
        "sentiment": "NEUTRAL",
        "strength": 0,
        "article_id": "2"
    },
    {
        "sentiment": "POSITIVE",
        "strength": 1
    }
]
  1. Было улучшено качество распознавания позитивной и негативной тональности.
  2. Мы вернули модель Fremium, позволяющей присылать до 100 текстов в месяц бесплатно!
Mashape

Команда Insider

AI education. What market requires?

When you start looking at the field of AI (Artificial Intelligence) as a business leader or software developer you can get lost at first.

In this online seminar between Carine Simon, MIT (Boston, USA), Borys Pratsyuk, Ciklum, Valeria Zabolotna, UNIT.City (Kiev, Uktrain) and Dmitry Kan, Insider (Helsinki, Finland) you will learn:

  1. What formal AI education programs exist at MIT
  2. What industry expects of hires for AI role
  3. How to get started with AI as a practitioner — frameworks, hardware, communities

Seminar host: Misha Feldman

Hope you will enjoy the video and do let us know, if it was helpful for you!

Finding sentiment in Ruby

Dialogue is the largest conference on computational linguistics in Russia. Historically, it has been supported by Abbyy, Yandex, Moscow State University as well as the Higher School of Economics and the Moscow Institute of Physics and Technology. This year, as part of the conference the sentiment analysis track is held. In this post we will show the training / test formats of tweets and illustrate how they can be analyzed with our RSA API in ruby.

The code in this post is using mashape key token that can be obtained by registering a user account on http://market.mashape.com/. After registering, signup for the freemium plan of the RSA API. Then you will have a token that is uniquely identifying your access to this exact API under this exact subscription plan.

 

The training and test data provided by the sentiment track organizers is the following, illustrating with a single tweet text: “Отказ от повышения налогов сохранит и даже ускорит рост ВВП РФ – Sberbank CIB.”

  
      <table name="bank_train_2016">
            <column name="id">70</column>
            <column name="twitid">492546512652500000</column>
            <column name="date">1406267214</column>
            <column name="text">Отказ от повышения налогов сохранит и даже ускорит рост ВВП РФ - Sberbank CIB</column>
            <column name="sberbank">1</column>
            <column name="vtb">NULL</column>
            <column name="gazprom">NULL</column>
            <column name="alfabank">NULL</column>
            <column name="bankmoskvy">NULL</column>
            <column name="raiffeisen">NULL</column>
            <column name="uralsib">NULL</column>
            <column name="rshb">NULL</column>
        </table>

The task is to analyze for sentiment entries in xml tags with name “text” and update target bank name entity with -1 (NEGATIVE), 0 (NEUTRAL) or 1 (POSITIVE) flag.

The following is the code that reads an xml file from the first command line parameter, type of entities from the second parameter (banks or telecom) and updates the input file with automatically calculated sentiment values using the RSA API.

    
require 'rubygems' 
require 'nokogiri'
require 'unirest'

if ARGV.length < 2
    puts "Need xml file as input and type of entities: banks or telecom"
    exit
end


supported_entities = ['banks', 'telecom']
supported_entities_telecom = ['beeline', 'mts', 'megafon', 'tele2', 'rostelecom', 'komstar', 'skylink']
supported_entities_banks   = ['sberbank', 'vtb', 'gazprom', 'alfabank', 'bankmoskvy', 'raiffeisen', 'uralsib', 'rshb']

entities_type = ARGV[1]
if not supported_entities.include?(entities_type)
  puts "Unsupported entities type requested. Supported once are: " + supported_entities.to_s
  exit
end

if entities_type == 'banks'
  target_entities = supported_entities_banks
elsif entities_type == 'telecom'
  target_entities = supported_entities_telecom
else
  puts "FATAL ERROR: request unsupported entities type: " + entities_type
  exit
end


def get_sentiment(text)
  # These code snippets use an open-source library.
  response = Unirest.post "https://russiansentimentanalyzer.p.mashape.com/rsa/sentiment/polarity/json/",
    headers:{
    "X-Mashape-Key" => "[INSERT_TOKEN_HERE]",
    "Content-Type" => "application/json",
    "Accept" => "text/plain"
    },
    parameters: { :text => text, :object_keywords => "", :output_format => "" }.to_json

  puts "get_sentiment, text=" + text + " SENTIMENT=" + response.body["sentiment"]

  if response.body['sentiment'] == "POSITIVE"
    return 1
  elsif response.body['sentiment'] == "NEGATIVE"
    return -1
  else
    return 0
  end
end


file_name = ARGV[0]
@doc = Nokogiri::XML(File.open(file_name))
columns = @doc.xpath("//database/table/column")
sentiment_tag = -2
columns.each { |column| 
     if column['name'] == 'text'
         sentiment_tag = get_sentiment(column.content)
     end
     
     if target_entities.include?(column['name'])
         if sentiment_tag > -2 and column.content != 'NULL'
           puts "updating " +  column['name'] + " with " + sentiment_tag.to_s
           column.content = sentiment_tag
           sentiment_tag = -2
         end
     end
}

File.open(file_name, 'w') {|f| f.write(@doc) }

Распознавание адресов в текстах на русском языке

StreetDetectorLogo_1024x512

Мы рады сообщить о запуске нового API для обработки текста — StreetDetector API. Система позволяет извлекать улицы и номера домов из разнородных текстов на русском языке.

Основные возможности:

  1. Поддержка русской морфологии.
  2. Распознавание адресов в различных вариациях: Ленинский 22; ул. Льва Толстого, 16
  3. Извлечение всех адресов в данном тексте:

У Басманного тупика пробка. На проезде Апакова д.5 ремонт дороги.

[
  {
    "buildingNumber": "",
    "streetName": "Басманный тупик"
  },
  {
    "buildingNumber": "5",
    "streetName": "Апакова, проезд"
  }
]
Мы надеемся, что StreetDetector API будет полезен разработчикам самых различных систем, имеющих дело с текстами (отзывами пользователей, официальными документами и т.д.), а бесплатного теста в 300 сообщений будет достаточно, чтобы оценить качество API.

ApacheCon North America 2015 and supporting open source

At SemanticAnalyzer we use a number of open source tools and systems to build solutions for our clients. It has been our pleasure to be present at the recently held in Austin, Texas ApacheCon. We have met a bunch of new open source developers as well as folks behind well-known companies, such as Pivotal, GoogleA9, Cloudera, SkyMind, LinkedIn, Microsoft, IBM, NASA, RedHat. The list is endless.

The weather was warm and city welcoming.

Austin downtown

Austin downtown

Most of the time went to the various engineering and scientific talks on a vast array of all topics open source & Apache. Beside, it was 20th birthday of the Apache HTTP Server!

Apache HTTP Server 20th birthday

Roman Shaposhnik of Pivotal gave a talk on the Apache Incubator: Where It Is Coming From and Where It Is Going. Apache Incubator is the place, where many big Apache projects appear and graduate and.. some don’t.

Roman Shaposhnik on Apache Incubator

Roman Shaposhnik on Apache Incubator

But that’s fine, because we are all after quality of code, community around it and adoption by the companies, organizations and individuals. The talk focused on how the project gets accepted to the Incubator, how it is assessed during the incubation, what are the formal grading / reporting. And project chickens and pigs. Check the slides to see what that means :)

Shane Curcuru has given guidance on very important topic of How to Keep Your Apache Project’s Independence. I think this topic is equally important for any open source project, would you agree, Shane? Trademarks and branding, keeping your project in shape by actively seeking new contributors for long hanging bugs, dealing with difficult parties (corporations), legal action, talking to outside laywers and so much more in Shane’s very intriguing presentation, that uncovers the internal kitchen of running an Apache project from the product point of view.

Radek Maciaszek of DataMine Lab talked about streaming data and dealing with it in R and Apache Storm. Streaming data arises in problems like fraud detection, online advertising and network traffic generally. Usually, when you deal with data, the foremost important target is to study the properties of your data, find outliers, mass of values and noise. The interesting part of Radek’s presentation was beta distributions that are especially useful for analyzing the patterns of the streaming data. Prototyping with R and Storm looked rather easy and the recommended package to use is: hdp://cran.r-­‐project.org/web/packages/Storm.

IBM Watson researchers and engineers have reserved few slots, where our CEO Dmitry Kan has been moderating. Modeling through concepts, semantics and search is so close to what we do at SemanticAnalyzer, that we were especially motivated to participate these presentations. It turned out, that early versions of IBM Watson would run for dozens of minutes and sometimes dozens of hours. This was not at all acceptable for realtime nature of the Jeopardy game. And so it had to be put to scale. UIMA DUCC has been employed for the processing pipeline, since Apache UIMA has been used already for data semantic enriching and reasoning. Here is the live demo of the UIMA DUCC: http://uima-ducc-demo.apache.org:42133/jobs.jsp

Andriy Redko shared really practical fu for embedding search capabilities into your web app. This especially useful, if you’ve built some API and would like to provide search capabilities in the processed data. All this can be achieved with no sweat using the Apache CXF and Lucene. The demo has been impressive with realtime pdf to text conversion with TIka, indexing with Lucene and searching in a friendly UI. Check it out.

Chris Mattman of NASA has given a very dynamic breadth-first talk about various Apache projects that deal with data extraction / analysis. In his talk “If You Have The Content, Then Apache Has The Technology!” Chris has walked the audience through various projects Apache has to offer for your content: be it data extraction, data representation (like triples), data mining / machine learning etc. They were looked at from very practical view point: how quickly can you build and jump in to use them for your task at hand. Some projects did receive thumbs up from the Apache Tika creator! Some could use help on improving. Very practical and useful. And, thanks a lot Chris for the pull request on luke you have sent prior to the conference!

The new rocking open source technology, currently incubating at Apache, is Apache Ignite. Dmitriy Setryakyan of GridGain has presented on it in a very informative and structured fashion. Apache Ignite is essentially an in-memory data fabric, than according to Dmitriy runs as fast on the virtualized servers (read AWS) as it runs on the bare metal. The same technology was also presented on a keynote by Nikita Ivanov, CTO at GridGain.

IMG_20150414_175131

There has been a lot more packed into 7 (!) parallel session tracks, than can be possibly covered here. Go check the talk schedule yourself and enjoy the videos / slides.

The culmination of the 3 day conference was in a form of the 5 minute lightning talks hosted by Jim Jagielski and Shane Curcuru, warmed up by the great beer. Despite the beer, it was not that _easy_ to stand up in front of the 100+ audience. Dmitry Kan decided to present on luke and announce its elasticsearch support, that got enabled during the conference. Here is the video, we hope you enjoy it. Fast-forward to 6:32 to listen Dmitry’s presentation on luke:

The ApacheCon was an extraordinary event from the technical stand point, and a very warm, friendly and relaxed one on human and social networking side. We highly recommend you to participate in the next ApacheCon Europe in Budapest September-October 2015!

Автоматический анализ тональности для русского языка: видео

В предыдущем посте мы рассказали о нашем с YouScan докладе на AI Ukraine’14 о системе автоматического анализа тональности для русского языка SentiScan. Подоспело видео доклада.

Смотрите, комментируйте, лайкайте и обязательно делитесь с друзьями:

Вам нужно автоматически разметить тональность? У нас есть для вас такая возможность! Вы можете протестировать нашу технологию прямо сейчас, нажав на кнопку:

Mashape

Наш API разметки объектной тональности поддерживает как маленькие, так и солидные объёмы сообщений в день: от твитов до блог-постов и статей.

Интеграция в вашу систему текстовой аналитики предельно простая, а вот и документация: