Text analysis. Text mining. Text analytics.
Applied text analysis 1
Structured and non-structured data
• Structured data: Corresponds to data that have been organized in
repositories such as a database, so that its elements can be accessed
by effective analysis and processing methods (i.e., an SQL table).
• Non-structured data: Corresponds to data that don’t have a
predefined structure or model or that’s not organized in a predefined
way, making them hard to understand using traditional computational
methods (i.e., news and customer complaints).
Search and discovery
A search task is goal-oriented, which means that you must provide a clear criterion to receive the results that you need (i.e., a
condition that must be met by the data attributes).
A discovery task is by nature opportunistic, that is, you don’t know what you want to search for, so data hypotheses are
automatically explored to discover new opportunities in the form of data hidden patterns (or latent), which can be interesting
and novel.
Text mining and text analytics
• Text mining and text analytics are highly interchangeable terms.
• Text mining is the automated process of examining large collections of documents or corpora
to discover patterns or insights that may be interesting and useful (Ignatow & Mihalcea, 2017;
Struhl, 2015; Zhai & Massung, 2016). For this, text mining identifies facts, relationships, and
patterns that would otherwise be buried in textual data (Atkinson & Pérez, 2013). This
information can be converted to a structured form that can be later analyzed and integrated
with other types of systems (i.e., business intelligence, databases, and data warehouses).
• On the other hand, text analytics synthesizes the results of text mining so that they can be
quantified and visualized in a way that supports decision-making, producing actionable
insights, so text mining encompasses broader aspects than text analytics.
• The applications of text analytics in industrial and business areas are many, including
• document clustering,
• text categorization,
• information extraction to populate databases,
• text generation,
• association discovery, etc.
Linguistic problems
• Since the goal is to automatically analyze textual information sources
that are written in natural language by humans, computational
methods (Jurafsky et al., 2014) must be able to address three key
linguistic problems:
• Ambiguity: lexical (i.e., a single word with more than one meaning), syntactic
(i.e., a single sentence that has several possible grammatical structures),
semantic (i.e., a sentence with several possible interpretations), and
pragmatic (i.e., a sentence with several possible contexts to determine its
intention).
• Dimensionality
• Linguistic Knowledge
Text mining areas
• Natural-Language Processing (NLP) provides theories, models, and
methods so that a computer can understand natural language (written
or spoken) at different linguistic levels (i.e., phonetic, morphological,
lexicon, syntactic, semantic, discursive, and pragmatic). In practice,
NLP techniques focus on creating systems that process textual
information in order to make it accessible to other computer
applications.
• Retrieving information from specific unstructured data sources (i.e.,
texts, images, and videos), in which an analysis based on some
measure of relevance (i.e., importance) is key with respect to a certain
input query in order to make them available to other tasks and
applications. In information retrieval (IR) methods and models, NLP
plays a fundamental role in characterizing and “understanding” some
elements of the information (i.e., documents) that’s retrieved.
• Machine learning (ML) is applied, an AI area that provides
computational techniques allowing a computer to learn how to
perform a task based on experience (Wilmott, 2020; Mohri et al.,
2018). Thus, an ML system improves its performance with experience,
without the need to write explicit rules or models. Models
automatically created by ML are thus capable of generalizing
behaviors for unknown cases, improving the performance of certain
tasks.
Tasks and applications
• Text Clustering
• Information extraction
• Text Categorization
• Relationship Inference
The text analytics process
SUMMARY
• Text analytics is the science that is based on examining and discovering
interesting and ideally actionable patterns from large collections of
documents (corpora) written in natural language.
• To make this possible, text analytics combines techniques and models
from NLP, ML, and IR.
• This combination allows performing tasks such as document
categorization, text clustering, specific information extraction from
documents, association discovery, topic detection, etc.
• These tasks are the basis for cross-cutting applications in all domains,
both in the private and public spheres, from scientific applications to
industrial and business applications.
QUESTIONS
• 1. What are the main differences between a text analytics application and a NLP application?
• 2. What difficulties does linguistic ambiguity cause in a text analytics or text mining task?
• 3. List two differences between analyzing informal texts on social media and analyzing formal news
texts.
• 4. How does the dimensionality of documents affect the performance of a text analytics task?
• 5. Describe two problems that can arise when analyzing documents using only lexical analysis, that is,
at the word level.
• 6. Suppose two applications that involve the handling of textual information: One that allows hotel
reservations to be made through natural language and the other that allows the detection of names of
personalities in news texts. In which of them is it necessary to use NLP tools and in which is it required
to use a textual analysis method?
• 7. How can a ML method help a text analytics task?
• 8. You should automatically group all news reaching your email and then store it in specific folders.
What analysis methodology would you use, a clustering method or a categorization method?
• 9. What’s the fundamental difference between an IR task and an information extraction task?
• 10. What type of features could be selected as input to a text mining or text analytics task?
QUESTIONS
• 11. In the text mining process, the evaluation of patterns discovered by some analysis task is essential. In what ways could
these patterns be evaluated in order to generate insights?
• 12. Describe two types of patterns that can be discovered in text analytic tasks.
• 13. In a textual analysis task, such as the sentiment classification on social networks, you could simply use keywords such
as features, so that an automatic classifier can determine whether an opinion expresses positive or negative sentiments.
What’s the problem that we’ll encounter if such an application uses only such a type of features?
• 14. State which of the following applications use NLP models and which use text analytics approaches:
• Simple document search engine
• Keyword-based sentiment classifier
• Rules-based spam filter
• Virtual tutor that helps a child understand math
• Assess quality of texts written by job applicants
• Classify medical diagnoses
• 15. You know there is an important link between an organization X and a person Y in a set of news articles. What text
analytics approach would you take to determine the specific link that exists?
• 16. A service company receives many written complaints (no more than 2–3 paragraphs) through its web portal. What kind
of analysis method would you use to generate statistics about the client’s complaints for a given date?
• 17. You have a large database of invention patent descriptions and want to determine whether a new patent “application”
that comes to you is similar to one that you already have in your database. What text analytics and/or NLP approach would
you use to address this problem?
References
• Atkinson-Abutridy J. Text Analytics. An Introduction to the Science and
Applications of Unstructured Information Analysis. Chapman & Hall.
2022
• Text Mining and Analytics https://www.youtube.com/watch?
v=Uqs0GewlMkQ&list=PLLssT5z_DsK8Xwnh_0bjN4KNT81bekvtt