In today’s post, we will introduce our MTA (Multicriterial Text Analysis) software. The MTA product significantly helps users with decisions in the area of shopping for various products and services.
The product aims to help users get their head around the large amounts of opinions published on the internet on specific goods or services which they would like to buy or use. User reviews and ratings are scattered on various discussion forums, product review websites and portals dedicated to specific areas. It is difficult and time-consuming for an ordinary user to look up this information, familiarise with it and make own opinion on it.
To collect data, we use a set of tools (crawlers) to download user reviews and articles about the selected group of products or services. These crawlers are adjusted to the structure of defined websites from which they collect relevant data that can be helpful for topic analysis and attitudes. We have a set of crawlers through which we have already downloaded more than a million user reviews.
When collecting data, we usually face a few problems. One of the biggest ones is related to varied ways of tagging products on different websites. Even though it is an identical product, there are distinctions in the name, which makes the product identification complicated. For instance, the product “Canon EOS 600D” is listed in all of the following sales names:
- “Canon EOS 600D reflex camera”,
- “Canon EOS 600D SLR digital camera”,
- “Digital camera Canon EOS 600D SLR (18 mpx, 7,6 cm (3″) pivoting display, Full HD”
- “Digital single-lens reflex camera Canon EOS 600D (18 megapixels, 7,6cm (3inches) display, APS-C CMOS sensor, WLAN with NFC, Full HD, Digic 7) kit incl. EF-S 18-55mm, 1:4,0 – 5,6 IS STM, black”
It is important to correctly recognise which names identify the same product and assign them with the published reviews. We use methods of machine learning in this process.
For further analysis, it is necessary to modify the obtained reviews. The first step is to divide them into individual sentences which usually include independent topics. Furthermore, we transform words into their basic form and remove diacritics. Additionally, it is applicable to remove words which do not bare any required information value (such as prepositions, conjunctions etc.). To do this, we use our own POS analyser which assigns the word class to words in the sentence and we also use a dataset with word traces created by our own means. Documents edited this way are transferred into vector form, using Tf-idf methodology.
To analyse large amounts of unstructured data, we use methods of machine learning. Using these, we identify the most discussed topics in the data and we determine reviewers’ positive or negative attitude towards individual features of the products. Using cluster methods (k-means), we divide reviews into clusters with the same topics. We are successfully able to identify clusters with the a high degree of internal integrity where are identified topics related to the main parameters of the product segment which is being looked into. These created clusters for a particular segment, based on professional articles, are further used for classification of reviews to individual products.
The easiest way how we present results of text analysis is a static report. This output includes product names, their discussed features and statistics on how often are the listed features perceived positively or negatively.
* excellent image sensor resolution,
* excellent focus sensitivity,
* comfortable grab,
* unrivalled image quality,
* rear buttons backlit,
* 4k uhd video 1920 x 1080 / record slow motion,
* pleasantly surprised with nikon d850,
* well-managed sum values 6400,
* gb high consumption,
* more expensive optics,
* use of potential is needed to have adequate quality optics which means the best mani,
* price quality is not free of charge.
We are currently developing an interactive website application as well as an app for mobile devices. At the same time, for easy integration into already existing solutions, there will be API with regularly updated data.
Do not hesitate to contact us for more information or to provide us with feedback.