In today’s post, we will introduce our MTA (Multicriterial Text Analysis) software. The MTA product significantly helps users with decisions in the area of shopping for various products and services.
The product aims to help users get their head around the large amounts of opinions published on the internet on specific goods or services which they would like to buy or use. User reviews and ratings are scattered on various discussion forums, product review websites and portals dedicated to specific areas. It is difficult and time-consuming for an ordinary user to look up this information, familiarise with it and make own opinion on it.
To collect data, we use a set of tools (crawlers) to download user reviews and articles about the selected group of products or services. These crawlers are adjusted to the structure of defined websites from which they collect relevant data that can be helpful for analysis of topics and sentiments. We have a set of crawlers through which we have already downloaded more than a million user reviews.
When collecting data, we usually face several problems . One of the biggest ones is related to varied ways of naming products on different websites. Even though it is an identical product, there are distinctions in the name, which makes the product identification complicated. For instance, the product “Canon EOS 600D” is listed in all of the following sales names:
- “DSLR Canon EOS 600D camera”,
- “Canon EOS 600D SLR digital camera”,
- “Digital camera Canon EOS 600D SLR (18 mpx, 7,6 cm (3″) flip screen, Full HD”
- “Digital DSLR camera Canon EOS 600D (18 megapixels, 7,6cm (3inches) display, APS-C CMOS sensor, WLAN with NFC, Full HD, Digic 7) kit incl. EF-S 18-55mm, 1:4,0 – 5,6 IS STM, black”
It is important to correctly recognise which names identify the same product and connect the product published reviews. We use methods of machine learning in this process.
For further analysis, it is necessary to modify the obtained reviews. The first step is to divide them into individual sentences which usually include independent topics. Furthermore, we transform words into their basic form and remove diacritics. Additionally, it is beneficial to remove words which do not bare any required information value (such as prepositions, conjunctions etc.). To do this, we use our own POS analyser which assigns the word class to words in the sentence and we also use a dataset with stop words created by our own means. Documents edited this way are transformed into vector form, using Tf-idf methodology.
To analyse large amounts of unstructured data, we use methods of machine learning. Using these, we identify the most discussed topics in the data and we determine reviewers’ positive or negative sentiment towards individual features of the products. Using cluster methods (k-means), we divide reviews into clusters with the same topics. We are able to successfully identify clusters with a high degree of internal integrity where identified topics highly correlate with main parameters of the examined product segment. These clusters, created for a particular segment, based on professional articles, are further used to classify reviews of the individual products.
The easiest way how we present results of text analysis is a static report. This output includes product names, their discussed features and statistics on how often are the listed features perceived positively or negatively.
* excellent image sensor resolution,
* excellent focus sensitivity,
* comfortable grip,
* unrivalled image quality,
* rear buttons backlit,
* 4k uhd video 1920 x 1080 / record slow motion,
* pleasantly surprised with nikon d850,
* well-managed noise level 6400,
* gb high consumption,
* more expensive lenses,
* in order to utilise potential, it’s necessary to have adequate lenses, which means the best ones available,
* price quality doesn’t come cheap.
We are currently developing an interactive website application as well as an app for mobile devices. At the same time, for easy integration into already existing solutions, there will be API with regularly updated data.
Do not hesitate to contact us for more information or to provide us with feedback.
Data for text analysis is often available only in web presentations in an unstructured form. How to get the data as easily as possible?
For the purpose of downloading text from websites, there are specialised tools called scrapers or crawlers. For some programming languages, there are frameworks which considerably simplify the creation of the scraper tool for individual websites. We use one of the most popular frameworks called Scrapy written in Python.
As a practical example, we can mention a tool for collecting subsidy incentives as supporting documents for Dotační manager (Grant Manager), the largest portal about grants in the Czech Republic, which unites “calls” from various public sources. The tool automatically looks through the structure of a portal, such as Agency for business and innovations, finds a website of s subsidy call and mechanically processes it into a structured format. The tool can be run repeatedly so that it can also catch newly published calls. The whole tool including the source code is available at http://git.pef.mendelu.cz/MTA/oppik-scraper/.
The above example is fairly simple, however, the practice tends to be more complicated. Website structure of every portal varies, often, it is not unified even within one portal, it changes in time etc. In order not to write similar tools for individual sources again and again, we are developing our own robust crawler which is able to automatically get text data from various sources.
As we mentioned in the previous post, our team is working on a project to help you make decisions about buying different products and services. We try to help users create an objective view of the specific items they want to buy by analyzing published reviews of other users. Currently, we’ve downloaded enough reports and product articles in Czech and English language to analyze individual views. In the first phase it was necessary to adapt the obtained texts to the form suitable for analysis.
It was necessary to divide the documents into individual sentences, because users often present more ideas in one document and evaluate more criteria. The next step was to remove insignificant words that do not bring any or just little information value. For example, clutches, prepositions, web addresses, and so on. In this step, we also used our own POS analyzer, which assigns the words in sentence word types, and our own dataset with stop words. In particular, nouns, adjectives and verbs were interesting for us. Subsequently, we worded the words into their basic shapes, by specifying the roots of words.
We have transformed the edited documents into vector shape using tf-ifd and then split them into clusters with the same themes using k-means methods. We have managed to identify approximately diversified clusters with a high degree of internal integrity. Identified topics were related to the main parameters of the product segment surveyed.
The clusters created for the whole segment, based on expert articles, were then used to classify product reviews. From identified clusters for individual reviews, we chose those with the highest predictive value – and are presented as a suitable representative for a given set of reviews. The result of the analysis is shown in the example below.
We will inform you about the milestones we have achieved in analyzing text reviews. Let’s take a look at our research.
Our team is currently working on a project to help make decisions about buying different products. A huge amount of opinions and reviews of individual products can be found on the Internet.
These user reviews are distributed across a variety of discussion forums, product rating sites, or specific portals. For a regular user, it’s difficult to find the information needed, get a look at them, and make its own opinion.
In order to analyze large amounts of unstructured data, we have decided to use machine learning methods. We want to use these data to identify topics that are important to users and to determine their positive or negative attitudes towards individual product features.
We are currently working on creating crawlers for downloading user reviews and articles about the selected product group. These crawlers are tailored to the structure of specific sites. Crawlers from these sites get relevant data that can help in analyzing themes and attitudes. So far, we have created eight crawlers, which have helped us to download about half a million user reviews and expert articles about two thousand products in two languages (Czech and English).
We had to deal with several issues when acquiring the data. One of the main ones is the different way of labeling products on different sites. Although it is an identical product, there are differences in names that complicate product pairing. Another problem is limiting the number of accesses to some sites in the form of code captcha. The last issue that needs to be solved is the changing web structure that causes crawlers to fail.
We have a practically closed first phase of the project in which we have defined the task of creating data acquisition tools for subsequent analysis. In the next phase, using machine learning methods, we will work to uncover the topics discussed and attitudes of users.