In today’s post, we will introduce our MTA (Multicriterial Text Analysis) software. The MTA product significantly helps users with decisions in the area of shopping for various products and services.
The product aims to help users get their head around the large amounts of opinions published on the internet on specific goods or services which they would like to buy or use. User reviews and ratings are scattered on various discussion forums, product review websites and portals dedicated to specific areas. It is difficult and time-consuming for an ordinary user to look up this information, familiarise with it and make own opinion on it.
To collect data, we use a set of tools (crawlers) to download user reviews and articles about the selected group of products or services. These crawlers are adjusted to the structure of defined websites from which they collect relevant data that can be helpful for analysis of topics and sentiments. We have a set of crawlers through which we have already downloaded more than a million user reviews.
When collecting data, we usually face several problems . One of the biggest ones is related to varied ways of naming products on different websites. Even though it is an identical product, there are distinctions in the name, which makes the product identification complicated. For instance, the product “Canon EOS 600D” is listed in all of the following sales names:
- “DSLR Canon EOS 600D camera”,
- “Canon EOS 600D SLR digital camera”,
- “Digital camera Canon EOS 600D SLR (18 mpx, 7,6 cm (3″) flip screen, Full HD”
- “Digital DSLR camera Canon EOS 600D (18 megapixels, 7,6cm (3inches) display, APS-C CMOS sensor, WLAN with NFC, Full HD, Digic 7) kit incl. EF-S 18-55mm, 1:4,0 – 5,6 IS STM, black”
It is important to correctly recognise which names identify the same product and connect the product published reviews. We use methods of machine learning in this process.
For further analysis, it is necessary to modify the obtained reviews. The first step is to divide them into individual sentences which usually include independent topics. Furthermore, we transform words into their basic form and remove diacritics. Additionally, it is beneficial to remove words which do not bare any required information value (such as prepositions, conjunctions etc.). To do this, we use our own POS analyser which assigns the word class to words in the sentence and we also use a dataset with stop words created by our own means. Documents edited this way are transformed into vector form, using Tf-idf methodology.
To analyse large amounts of unstructured data, we use methods of machine learning. Using these, we identify the most discussed topics in the data and we determine reviewers’ positive or negative sentiment towards individual features of the products. Using cluster methods (k-means), we divide reviews into clusters with the same topics. We are able to successfully identify clusters with a high degree of internal integrity where identified topics highly correlate with main parameters of the examined product segment. These clusters, created for a particular segment, based on professional articles, are further used to classify reviews of the individual products.
The easiest way how we present results of text analysis is a static report. This output includes product names, their discussed features and statistics on how often are the listed features perceived positively or negatively.
* excellent image sensor resolution,
* excellent focus sensitivity,
* comfortable grip,
* unrivalled image quality,
* rear buttons backlit,
* 4k uhd video 1920 x 1080 / record slow motion,
* pleasantly surprised with nikon d850,
* well-managed noise level 6400,
* gb high consumption,
* more expensive lenses,
* in order to utilise potential, it’s necessary to have adequate lenses, which means the best ones available,
* price quality doesn’t come cheap.
We are currently developing an interactive website application as well as an app for mobile devices. At the same time, for easy integration into already existing solutions, there will be API with regularly updated data.
Do not hesitate to contact us for more information or to provide us with feedback.
Data for text analysis is often available only in web presentations in an unstructured form. How to get the data as easily as possible?
For the purpose of downloading text from websites, there are specialised tools called scrapers or crawlers. For some programming languages, there are frameworks which considerably simplify the creation of the scraper tool for individual websites. We use one of the most popular frameworks called Scrapy written in Python.
As a practical example, we can mention a tool for collecting subsidy incentives as supporting documents for Dotační manager (Grant Manager), the largest portal about grants in the Czech Republic, which unites “calls” from various public sources. The tool automatically looks through the structure of a portal, such as Agency for business and innovations, finds a website of s subsidy call and mechanically processes it into a structured format. The tool can be run repeatedly so that it can also catch newly published calls. The whole tool including the source code is available at http://git.pef.mendelu.cz/MTA/oppik-scraper/.
The above example is fairly simple, however, the practice tends to be more complicated. Website structure of every portal varies, often, it is not unified even within one portal, it changes in time etc. In order not to write similar tools for individual sources again and again, we are developing our own robust crawler which is able to automatically get text data from various sources.