From the outer space, the Earth is a perfect sphere, “structured”, but as the vision descends closer and closer to the sphere, it turns out to be flattening, and then again comes back to non-flat structures, varying in heights, width, depth, and what not, getting transformed into “unstructured”.
Above analogy may not be perfect to the data science, but to a certain extent maps the reality of unstructured data.
More than 80% of data, in the form of emails, intranet entries, reports, purchase orders, invoice receipts, etc is unstructured. With the advent of social networking and Internet, we have added more unstructured information in last two to five years than the data generated since the evolution of homo-sapiens. Not only this, but text, videos, images, audio files and GPS data all add to the heaps of unstructured data every minute.
The question remains the same! Does this information hold any value?
Hidden Value of unstructured data
A lot has been written and said about the exploration and derived value from this unstructured data. Definitely, there is enormous value hidden in this unstructured data but how to extract that information and channelize it to your benefit is the challenge.
Unstructured data is both an opportunity and a challenge for an Enterprise. Opportunity, because it can provide a detailed business insight and challenge because the tool, infrastructure, technology and expertise required to mine this information is complex, rare, and expensive.
Hence a perfect balance should be sought where an Enterprise takes a decision to deep dive into the sea of unstructured data mining. A clear and quantifiable business objective for data mining is the most important step in this direction.
Choice of resources, in terms of software technology, personnel, infrastructure and timings are the investments that should be aligned with the Business objective.
For this discussion, we will restrict ourselves to the value hidden in text mining, also known as text analytics.
Science of Text Mining
In simple language, it is the process of deriving high quality information from the textual data. In mid 1980’s the manual process to extract information from text data began, but with the improvement in technology, we now have software tools to perform this task, with speed and accuracy.
Each document like an email, web document, is nothing but sequence of characters and words.
Characters from Alphabet makes a word, combination of words makes a sentence, , sentences will make paragraphs, paragraphs eventually make chapters and articles etc.
But how do you extract the information from a language? How a Statistical procedure can be deployed to assess a KPI? How can Literature be termed as science?
An extensive research has been performed on languages and computers, especially the business language “English” and as a result a completely new practice vertical has emerged, called Natural Language Processing or NLP. It’s an offshoot of Artificial Intelligence, and sets the basis of how a computer interprets and interacts with the human (natural) language.
Using a Software Tool
Numerous software tools working on the same concept are available in the market today to perform the text mining task. So let us understand the working of these tools with an example.
Suppose a Company ‘X’ providing payment platform services to consumers and businesses wants to evaluate the satisfaction level of its customers and wants to identify the area of maximum improvement in the ‘eyes of the customer’. The data for this analysis is sourced from Customer Emails/Transcript of Call center interactions/Service satisfaction surveys/Complaints and Feedback
Text scripts from these sources are accumulated and fed to an efficient parsing engine like IBM SPSS Text Modeler, to get initial sense of the sentiment.
The tool categorizes the input text data (email/transcript/survey feedback) into various categories signifying the number of positive, negative, neutral, technical and other system generated categories.
Analyst can merge/ split the categories, add or remove product names, define rules for categorization, edit the system categorization rules, change the corpora or libraries accordingly, to classify the documents.
Categorization is process of collating the text based on the keywords, which identifies the category. There may be Level 1 categories, Level 2 categories and so on.
Output of this categorization task shows the strength of sentiments. A typical output will look like
The strength of correlation between the two categories is determined by the thickness of the line joining them. Interpretation can be derived as, most of the customers are complaining about Accounts, Payment, Time, Agent and Delivery. However, the worst are Account and Payments. If a customer is unhappy about account, he is unhappy about the payment as well.
Further drill down to Level 2 categorization may reveal that Under Account Category, most talked about is “Account Opening process” etc.
These are insights, which cannot be derived from numerical crunching, reason is, that numbers will only convey, high or low value of a Customer Satisfaction Score (CSAT), but not the cause behind the falling CSAT score.
Lexical Analysis used as part of text data mining uncovers this value and helps an organization to focus its energies on correcting the faults, rather than trying to address all issues. A focused approach yields rich dividends in short time, with least cost to repair.
In the next article we will discuss, about how Statistical Techniques like Principal Component Analysis can help to identify the problem areas whose correction will result in maximum impact on CSAT improvement.