Category Archives: Выступления / доклады

Finding sentiment in Ruby

Dialogue is the largest conference on computational linguistics in Russia. Historically, it has been supported by Abbyy, Yandex, Moscow State University as well as the Higher School of Economics and the Moscow Institute of Physics and Technology. This year, as part of the conference the sentiment analysis track is held. In this post we will show the training / test formats of tweets and illustrate how they can be analyzed with our RSA API in ruby.

The code in this post is using mashape key token that can be obtained by registering a user account on After registering, signup for the freemium plan of the RSA API. Then you will have a token that is uniquely identifying your access to this exact API under this exact subscription plan.


The training and test data provided by the sentiment track organizers is the following, illustrating with a single tweet text: “Отказ от повышения налогов сохранит и даже ускорит рост ВВП РФ – Sberbank CIB.”

      <table name="bank_train_2016">
            <column name="id">70</column>
            <column name="twitid">492546512652500000</column>
            <column name="date">1406267214</column>
            <column name="text">Отказ от повышения налогов сохранит и даже ускорит рост ВВП РФ - Sberbank CIB</column>
            <column name="sberbank">1</column>
            <column name="vtb">NULL</column>
            <column name="gazprom">NULL</column>
            <column name="alfabank">NULL</column>
            <column name="bankmoskvy">NULL</column>
            <column name="raiffeisen">NULL</column>
            <column name="uralsib">NULL</column>
            <column name="rshb">NULL</column>

The task is to analyze for sentiment entries in xml tags with name “text” and update target bank name entity with -1 (NEGATIVE), 0 (NEUTRAL) or 1 (POSITIVE) flag.

The following is the code that reads an xml file from the first command line parameter, type of entities from the second parameter (banks or telecom) and updates the input file with automatically calculated sentiment values using the RSA API.

require 'rubygems' 
require 'nokogiri'
require 'unirest'

if ARGV.length < 2
    puts "Need xml file as input and type of entities: banks or telecom"

supported_entities = ['banks', 'telecom']
supported_entities_telecom = ['beeline', 'mts', 'megafon', 'tele2', 'rostelecom', 'komstar', 'skylink']
supported_entities_banks   = ['sberbank', 'vtb', 'gazprom', 'alfabank', 'bankmoskvy', 'raiffeisen', 'uralsib', 'rshb']

entities_type = ARGV[1]
if not supported_entities.include?(entities_type)
  puts "Unsupported entities type requested. Supported once are: " + supported_entities.to_s

if entities_type == 'banks'
  target_entities = supported_entities_banks
elsif entities_type == 'telecom'
  target_entities = supported_entities_telecom
  puts "FATAL ERROR: request unsupported entities type: " + entities_type

def get_sentiment(text)
  # These code snippets use an open-source library.
  response = "",
    "X-Mashape-Key" => "[INSERT_TOKEN_HERE]",
    "Content-Type" => "application/json",
    "Accept" => "text/plain"
    parameters: { :text => text, :object_keywords => "", :output_format => "" }.to_json

  puts "get_sentiment, text=" + text + " SENTIMENT=" + response.body["sentiment"]

  if response.body['sentiment'] == "POSITIVE"
    return 1
  elsif response.body['sentiment'] == "NEGATIVE"
    return -1
    return 0

file_name = ARGV[0]
@doc = Nokogiri::XML(
columns = @doc.xpath("//database/table/column")
sentiment_tag = -2
columns.each { |column| 
     if column['name'] == 'text'
         sentiment_tag = get_sentiment(column.content)
     if target_entities.include?(column['name'])
         if sentiment_tag > -2 and column.content != 'NULL'
           puts "updating " +  column['name'] + " with " + sentiment_tag.to_s
           column.content = sentiment_tag
           sentiment_tag = -2
}, 'w') {|f| f.write(@doc) }

Распознавание адресов в текстах на русском языке


Мы рады сообщить о запуске нового API для обработки текста — StreetDetector API. Система позволяет извлекать улицы и номера домов из разнородных текстов на русском языке.

Основные возможности:

  1. Поддержка русской морфологии.
  2. Распознавание адресов в различных вариациях: Ленинский 22; ул. Льва Толстого, 16
  3. Извлечение всех адресов в данном тексте:

У Басманного тупика пробка. На проезде Апакова д.5 ремонт дороги.

    "buildingNumber": "",
    "streetName": "Басманный тупик"
    "buildingNumber": "5",
    "streetName": "Апакова, проезд"
Мы надеемся, что StreetDetector API будет полезен разработчикам самых различных систем, имеющих дело с текстами (отзывами пользователей, официальными документами и т.д.), а бесплатного теста в 300 сообщений будет достаточно, чтобы оценить качество API.

ApacheCon North America 2015 and supporting open source

At SemanticAnalyzer we use a number of open source tools and systems to build solutions for our clients. It has been our pleasure to be present at the recently held in Austin, Texas ApacheCon. We have met a bunch of new open source developers as well as folks behind well-known companies, such as Pivotal, GoogleA9, Cloudera, SkyMind, LinkedIn, Microsoft, IBM, NASA, RedHat. The list is endless.

The weather was warm and city welcoming.

Austin downtown

Austin downtown

Most of the time went to the various engineering and scientific talks on a vast array of all topics open source & Apache. Beside, it was 20th birthday of the Apache HTTP Server!

Apache HTTP Server 20th birthday

Roman Shaposhnik of Pivotal gave a talk on the Apache Incubator: Where It Is Coming From and Where It Is Going. Apache Incubator is the place, where many big Apache projects appear and graduate and.. some don’t.

Roman Shaposhnik on Apache Incubator

Roman Shaposhnik on Apache Incubator

But that’s fine, because we are all after quality of code, community around it and adoption by the companies, organizations and individuals. The talk focused on how the project gets accepted to the Incubator, how it is assessed during the incubation, what are the formal grading / reporting. And project chickens and pigs. Check the slides to see what that means :)

Shane Curcuru has given guidance on very important topic of How to Keep Your Apache Project’s Independence. I think this topic is equally important for any open source project, would you agree, Shane? Trademarks and branding, keeping your project in shape by actively seeking new contributors for long hanging bugs, dealing with difficult parties (corporations), legal action, talking to outside laywers and so much more in Shane’s very intriguing presentation, that uncovers the internal kitchen of running an Apache project from the product point of view.

Radek Maciaszek of DataMine Lab talked about streaming data and dealing with it in R and Apache Storm. Streaming data arises in problems like fraud detection, online advertising and network traffic generally. Usually, when you deal with data, the foremost important target is to study the properties of your data, find outliers, mass of values and noise. The interesting part of Radek’s presentation was beta distributions that are especially useful for analyzing the patterns of the streaming data. Prototyping with R and Storm looked rather easy and the recommended package to use is: hdp://cran.r-­‐

IBM Watson researchers and engineers have reserved few slots, where our CEO Dmitry Kan has been moderating. Modeling through concepts, semantics and search is so close to what we do at SemanticAnalyzer, that we were especially motivated to participate these presentations. It turned out, that early versions of IBM Watson would run for dozens of minutes and sometimes dozens of hours. This was not at all acceptable for realtime nature of the Jeopardy game. And so it had to be put to scale. UIMA DUCC has been employed for the processing pipeline, since Apache UIMA has been used already for data semantic enriching and reasoning. Here is the live demo of the UIMA DUCC:

Andriy Redko shared really practical fu for embedding search capabilities into your web app. This especially useful, if you’ve built some API and would like to provide search capabilities in the processed data. All this can be achieved with no sweat using the Apache CXF and Lucene. The demo has been impressive with realtime pdf to text conversion with TIka, indexing with Lucene and searching in a friendly UI. Check it out.

Chris Mattman of NASA has given a very dynamic breadth-first talk about various Apache projects that deal with data extraction / analysis. In his talk “If You Have The Content, Then Apache Has The Technology!” Chris has walked the audience through various projects Apache has to offer for your content: be it data extraction, data representation (like triples), data mining / machine learning etc. They were looked at from very practical view point: how quickly can you build and jump in to use them for your task at hand. Some projects did receive thumbs up from the Apache Tika creator! Some could use help on improving. Very practical and useful. And, thanks a lot Chris for the pull request on luke you have sent prior to the conference!

The new rocking open source technology, currently incubating at Apache, is Apache Ignite. Dmitriy Setryakyan of GridGain has presented on it in a very informative and structured fashion. Apache Ignite is essentially an in-memory data fabric, than according to Dmitriy runs as fast on the virtualized servers (read AWS) as it runs on the bare metal. The same technology was also presented on a keynote by Nikita Ivanov, CTO at GridGain.


There has been a lot more packed into 7 (!) parallel session tracks, than can be possibly covered here. Go check the talk schedule yourself and enjoy the videos / slides.

The culmination of the 3 day conference was in a form of the 5 minute lightning talks hosted by Jim Jagielski and Shane Curcuru, warmed up by the great beer. Despite the beer, it was not that _easy_ to stand up in front of the 100+ audience. Dmitry Kan decided to present on luke and announce its elasticsearch support, that got enabled during the conference. Here is the video, we hope you enjoy it. Fast-forward to 6:32 to listen Dmitry’s presentation on luke:

The ApacheCon was an extraordinary event from the technical stand point, and a very warm, friendly and relaxed one on human and social networking side. We highly recommend you to participate in the next ApacheCon Europe in Budapest September-October 2015!

Автоматический анализ тональности для русского языка: видео

В предыдущем посте мы рассказали о нашем с YouScan докладе на AI Ukraine’14 о системе автоматического анализа тональности для русского языка SentiScan. Подоспело видео доклада.

Смотрите, комментируйте, лайкайте и обязательно делитесь с друзьями:

Вам нужно автоматически разметить тональность? У нас есть для вас такая возможность! Вы можете протестировать нашу технологию прямо сейчас, нажав на кнопку:


Наш API разметки объектной тональности поддерживает как маленькие, так и солидные объёмы сообщений в день: от твитов до блог-постов и статей.

Интеграция в вашу систему текстовой аналитики предельно простая, а вот и документация:

SemanticAnalyzer & YouScan на AI Ukraine’14

SemanticAnalyzer принял участие в международной конференциии по искусственному интеллекту AI Ukraine’14, проходившей в Харькове.

Фото аудитории

Фото аудитории

Это была прекраснейшая возможность развиртуализироваться со специалистами в машинном обучении, рекуррентных нейронных сетях, компьютерной лингвистике, рекомендательных системах и других сопряжённых областях науки и практики.

Наш CEO Дмитрий Кан сделал совместный доклад с CTO компании YouScan Леонидом Литвиненко о системе SentiScan, которую мы разработали для мониторинга тональности в соц. медиа.

Дмитрий Кан, CEO SemanticAnalyzer (фото: Александр Панченко)

Леонид Литвиненко, CTO YouScan

Леонид Литвиненко, CTO YouScan

В докладе мы сфокусировались на двух main messages, которые хотели сообщить аудитории:

1. Алгоритмы определения тональности по отношению к объекту мониторинга, основанные на правилах, дают контроль, необходимый в боевых условиях (production). При этом есть рабочие методы машинного обучения, которые позволяют пополнять тональные лексиконы, а также делать лингвистические исследования новых данных в поиске тональных трендов. А на уровне правил для определения тональности в наиболее сложных случаях (см. слайды) кроме прочих вещей всегда работает мультипасс (многопроходный алгоритм). Ну, помните, тот самый мультипасс, который был у Лилу из фильма Пятый элемент?

Leeloo multipass

2. Оценка качества алгоритма анализа тональности (и любого алгоритма ИИ в целом) является не менее важной, чем сам алгоритм. Мы тестируем качество самыми разнообразными способами, включая известные precision & recall, а также менее известные, такие как moving average precision & moving average recall, также sampling по наиболее частотным n-граммам. Инженеры YouScan разработали систему оценки с веб-интерфейсом и золотым тестовым сетом. Для поиска оптимальной конфигурации алгоритма мы используем A / B тестирование наборов параметров, которые влияют на качество: учёт заголовков, эмотиконов, слов тональных бустеров и т.д. Система SentiScan получает живой поток исправлений тональных меток от клиентов, на основе которого производятся регулярные улучшения алгоритма.

Дмитрий Кан

Дмитрий Кан, CEO SemanticAnalyzer Group

После доклада было много вопросов из аудитории, касающихся деталей работы алгоритма, а также аспектов системы SentiScan при работе с доменами (например, banking).

Дмитрий рассказывает о применении deep learning к задаче анализа тональности

Дмитрий рассказывает о применении deep learning к задаче анализа тональности

Мы ждём видео, а пока можно ещё раз пролистать слайды:

На конференции была отличная возможность пообщаться и обменяться опытом со всеми заинтересованными в компьютерной лингвистике и текстовой аналитике людьми. Уровень качества вопросов из аудитории на нашем и других докладах явно говорил о серьёзной подготовке специалистов.

Надеемся, что наш доклад был полезен молодому поколению компьютерных лингвистов, пробующих или желающих попробовать свои силы в создании прикладных систем анализа естественных языков.

Спасибо организаторам AI Ukraine’14 за отличный AI event и атмосферу & hopefully see you next year!


SemanticAnalyzer is part of success story of YouScan

YouScan (, our partner, wins in the Growth nomination (success stories) at the largest international web contest and investor forum WebReady in Russia and gets an award for LeadScanner project at the first Startup AddVenture conference in Kiev. Congrats to the team! SemanticAnalyzer is honored to be part of this success!

Articles in Russian:

Training on NLProc and Machine Learning

SemanticAnalyzer just did a training on NLProc (supposedly a better abbreviation for natural language processing than NLP) and Machine Learning for (Russian “Facebook” like social network owned by Mail.Ru Group) in person in Saint-Petersburg, Russia. has a nice office not far from Petrogradskaya subway station in Saint-Petersburg (the central office is in Moscow).

OK office SPb

If you feel like your project is somewhat stuck and needs a fresh look or you need to widen you knowledge in NLProc and / or machine learning, feel free to contact us on info[@] At the moment folks at SemanticAnalyzer can do this in Europe / western part of Russia. At SemanticAnalyzer we also offer a full package of services for natural language processing development in case there isn’t expertise in your house. This includes project scoping, breaking down by technical tasks, time estimation, development, testing / evaluation and delivery.