Digital Libraries - Data, Information, and Knowledge
Digital Libraries - Data, Information, and Knowledge
Fabio Crestani
Sally Jo Cunningham (Eds.)
LNCS 10647
Digital Libraries:
Data, Information, and Knowledge
for Digital Lives
19th International Conference
on Asia-Pacific Digital Libraries, ICADL 2017
Bangkok, Thailand, November 13–15, 2017, Proceedings
123
Lecture Notes in Computer Science 10647
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zurich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7409
Songphan Choemprayong Fabio Crestani
•
Digital Libraries:
Data, Information, and Knowledge
for Digital Lives
19th International Conference
on Asia-Pacific Digital Libraries, ICADL 2017
Bangkok, Thailand, November 13–15, 2017
Proceedings
123
Editors
Songphan Choemprayong Sally Jo Cunningham
Chulalongkorn University University of Waikato
Bangkok Hamilton
Thailand New Zealand
Fabio Crestani
University of Lugano
Lugano
Switzerland
LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI
While the number of digital collections have been increased constantly and in diverse
practices, there are some concerns regarding the relevancy and value of the efforts to
expand, enhance, and sustain these collections to society at large. These concerns call
for discussions and exemplifications of how research efforts and practices in digital
libraries improve the quality of human life in all dimensions, such as education,
business, socialization, public administration, culture, and humanities. In addition,
these questions initiate a quest to discover novel methods in producing, managing,
analyzing, and storing digital collections as well as to deliver state-of-the-art services in
a complex, connected, and ever-changing environment that matter to our daily lives.
The annual International Conference on Asia-Pacific Digital Libraries (ICADL)
series is a significant forum that provides opportunities for researchers, educators, and
practitioners to exchange their research results, innovative ideas, service experiences,
and state-of-the-art developments in the field of digital libraries. The theme of ICADL
2017 was “Data, Information, and Knowledge for Digital Lives” open to all oppor-
tunities that illustrate how digital libraries, digital collections, and corresponding
methods would lead to better lives.
Since the first ICADL in 1998, the conference has grown to become one of the
premier forums in the digital library community. Based on the success of the first 18
ICADL conferences, the 19th ICADL conference was hosted by the Department of
Library Science, Faculty of Arts, Chulalongkorn University, Bangkok, Thailand. This
year the conference was co-located with the 8th Asia-Pacific Conference on Library
and Information Education and Practice (A-LIEP) under the collective title “Interna-
tional Forum on Data, Information, and Knowledge for Digital Lives.” Hosting these
conferences together in the heart of Bangkok brought together a diverse group of
academic and professional community members from all parts of the world to exchange
their cutting-edge knowledge, experience, and practices in various relevant issues in
digital libraries and other related fields.
The submissions to ICADL 2017 covered a wide spectrum of topics from various
areas, including information visualization, data mining/extraction, cultural heritage
preservation, personalized service and user modeling, novel library content and use
environments, electronic publishing, preservation systems and algorithms, social net-
working and information systems, Internet of Things, cloud computing and applica-
tions, mobile services, interoperability issues, open source tools and systems, security
and privacy, multi-language support, metadata and cataloguing, search, retrieval, and
browsing interfaces to all forms of digital content, e-Science/e-Research data and
knowledge management, and cooperative service and community service.
The keynote speakers of ICADL 2017, as part of the International Forum, included
Prof. Chayodom Sabhasri from Chulalongkorn University (Thailand), Prof. Makiko Miwa
from the Open University of Japan, and Prof. Jane Greenberg from Drexel University
(USA).
VI Preface
ICADL 2017 received 51 submissions from 21 countries. Each paper was carefully
reviewed by the Program Committee members. Finally, 21 full papers and six short
papers were selected. On behalf of the Organizing and Program Committees of ICADL
2017, we would like to express our appreciation to all the authors and attendees for
participating in the conference. We also thank the sponsors, Program Committee
members, external reviewers, supporting organizations, and volunteers for making the
conference a success. Without their efforts, the conference would not have been
possible.
ICADL 2017 was organized by the Department of Library Science, Faculty of Arts,
Chulalongkorn University.
Workshop Chair
Marut Buranarach National Electronics and Computer Technology Center,
Thailand
Organizing Committee
Pimrumpai Premsmit Chulalongkorn University, Thailand
Somsak Sriborisutsakul Chulalongkorn University, Thailand
Oranuch Sawetrattanasatian Chulalongkorn University, Thailand
Chindarat Berphan Chulalongkorn University, Thailand
Songphan Choemprayong Chulalongkorn University, Thailand
Sorakom Dissamana Chulalongkorn University, Thailand
Duangnate Vongpradhip Chulalongkorn University, Thailand
Nenuphar Supavej Chulalongkorn University, Thailand
Wachiraporn Chulalongkorn University, Thailand
Klungthanaboon
VIII Organization
Program Committee
Maristella Agosti University of Padua, Italy
Hugo Alatrista-Salas Universidad del Pacífico, Peru
Marut Buranarach National Electronics and Computer Technology Center,
Thailand
Nisachol Chamnongsri Suranaree University of Technology, Thailand
Youngok Choi Catholic University of America, USA
Gobinda Chowdhury Northumbria University, UK
Milena Dobreva UCL Qatar, Qatar
Supol Durongwatana Chulalongkorn University, Thailand
Nicola Ferro University of Padua, Italy
Schubert Foo Nanyang Technological University, Singapore
Edward Fox Virginia Polytechnic Institute and State University,
USA
Dion Hoe-Lian Goh Nanyang Technological University, Singapore
Organization IX
Additional Reviewers
Vichita Jienjitlert
Yufeng Ma
Panuakdet Suwannatat
Chih-Jau Wang
Contents
Mobile Applications
Social Media
User Behaviors
Video Seeking Behavior of Young Adults for Self Directed Learning . . . . . . 314
Cliff Loke, Schubert Foo, and Shaheen Majid
1 Introduction
The world is becoming an increasingly complex place, where information needs are not
always simple to satisfy – even by sophisticated information retrieval algorithms over
large digital libraries with carefully curated content. In this work, we introduce the
novel problem of ‘claim-based queries’ and show how to use focused indexing in
digital libraries to reliably capture claims and subsequently answer respective queries.
So, what are claim-based queries? To get an intuition, consider the following
example: a user interested in medical research may raise the general question of “which
medication should be taken to alleviate a headache?” At first, the question may strike
one as a bit naïve, since the answer will obviously be quite complex: there exist several
medications with different pros and cons depending on the specific problem setting.
Indeed, the main challenge of this example is that any ‘good’ answer has to deal with
knowledge that is open to discussion and is highly dependent on some context missing
in the question. In any case, users will need at least three steps to satisfy their query:
1. Find out what medications to alleviate a headache actually do exist (the entity space
for possible answers),
2. Find documents, e.g. research papers, where each medication has been applied in
particular problem settings (the contextual space for the above entities), and
3. Given all these documents, analyze them to decide which medicament fits the own
particular context best (a selection or ranking method).
We see two basic requirements for any retrieval system to solve the problem. First,
it needs to operationalize the notion of a claim-based query, and second, it needs high
quality content as input. While the first part is indeed quite problematic, the second part
may be solved by digital libraries offering high quality content, often curated by
peer-review. However, a key semantic metadata element for such a system, the central
claim(s) of each paper is usually not available. And this crucial step is the focus of this
paper.
Previous work in the field of argumentation mining has shown the potential of
algorithms to automatically identify argumentative structures such as claims from
clearly structured online debate forums and from persuasive essays on various topics [1].
How-ever, is it possible to find a solution for scientific collections, too? In this paper, we
focus on the proper identification of claims in research papers. We concentrate our
efforts to answer the following questions: How difficult is the task? Is the claim of a
research paper usually in a single sentence or can it stretch over several sentences? Can
extractors reliably identify claims?
Addressing this challenge, this work focuses on the automatic identification of
claims in research papers in an unsupervised fashion. Previously, we have shown the
key role that claims can play for Digital Libraries [2]. In particular, how they can assist
peer-review to support high quality content. In this work, we introduce a novel inte-
gration of neural embedding representations of words within a technique that identifies
claims in scientific articles. We test our approach on a representative corpus of PubMed
articles with more than 1,000 different journals that have claims annotated.
The paper is organized as follows: Sect. 2 provides definitions and the problem
statement that we aim to solve. Section 3 reviews related work. In Sect. 4, we first
provide an analysis of the corpus used to assess the difficulty of the task. In particular,
we perform an explorative analysis to answer whether the number of sentences in a
claim varies, and whether specific vocabulary patterns at the beginning and ending of
claims exist. Section 5 provides details on our experimental setup and discusses our
findings. Finally, we draw conclusions in Sect. 6 and point to future work.
In this section, we introduce the idea behind claim-based queries. We provide defini-
tions and the problem statement that we aim at solving in this paper. In general, a
claim-based query is a query that represents a specific and complex type of information
need: a question whose answer is subject to discussion. In particular, this type of
questions follows a problem-solution pair-pattern. Moreover, more than one solution
exists. For example, “which medication should I take to alleviate a headache?” In this
case, ‘medication’ is the solution, and ‘headache’ is the problem. Moreover, specific
instances of medication could solve this particular problem. Each sentence where an
association between a specific instance of the solution and the problem appears, is what
A New Challenge for Digital Libraries 5
we have called a claim. In this work, we argue that to answer claim-based queries, the
proper identification of claims is a fundamental first step.
We will focus in the medical domain; thus, more specifically, the relationship part
of the claim will be relationships in which the consumption of a product, a drug, a
substance, etc., carries an effect for a given disease. We recognize that health infor-
mation is a complicated process and thus, as our first attempt, we assume that the
claims can be found by identifying the sentences that correspond to the main contri-
butions of a paper. Therefore, the challenge to identify automatically this type of
sentences is the focus of this paper. More formally, we are given a collection of m
documents (research papers) from a digital library D ¼ fd1 ; . . .; dm g, where each
document is represented as a sequence of sentences. Our task is then:
Problem Statement. (Claim detection in research papers). Given a collection of
documents D, and a pair of entities e1 ; e2 , we intend to identify automatically from each
document in D, the sentence(s) fs1 ; . . .sn g where e1 ; e2 are related with the constraint
that the sentence(s) belong to the set of the main contribution(s) of the paper. We
approach the claim detection problem by breaking it down into two tasks:
1. Identification of the sentence(s) that represent the main contribution of a paper.
2. Identification of the sentence(s) of 1 where the entities e1 ; e2 are found.
To address task 1, for a given si in d, and for each d 2 D, we determine whether the
given sentences should be considered as the claim of d: To generate such a binary
decision, we perform a claim detection process claimðd Þ8d 2 D formalized in the
following expression:
Task 2 is trivial once task 1 has been solved: it is only a pruning process to consider
the sentence(s) where entities e1 ; e2 appear. For completeness, we summarize in
Algorithm 1 how to solve the claim detection problem. However, in the following
section, we describe the main contribution of this paper: step 4. In particular, we aim at
performing this step in an unsupervised fashion.
6 J.M. González Pinto and W.-T. Balke
3 Related Work
Our work builds on the Argumentation Mining field where researchers study the
identification of argumentative structures in some given text. For instance, in [8]
rhetorical roles of sentences were investigated to classify academic citations with
A New Challenge for Digital Libraries 7
respect to the citation effect. In particular, the idea of how a citation fits the argu-
mentative structure. As features, they investigated the type of subject of the sentence,
the citation type, the semantic class of main verb, and a list of indicator phrases that
were manually evaluated. Work in [9, 10] studied persuasive essays from the discourse
structure perspective. They introduced an approach to identify argumentative discourse
structures. In their work, components such as claims and premises, and how they are
connected with argumentative relations were studied. The researchers classified a pair
of argument components as either support or non-support to identify the structure of
argumentative discourse. After evaluating several classifiers, novel feature sets were
proposed including structural, lexical, syntactic, and contextual features. In [11] a
classification of argumentative sentences was introduced, namely four categories: none,
major claim, claim, and premise. They used a supervised machine learning approach to
learn these categories automatically, achieving a 0.72 macro-F1-score. In the work of
[12] the idea of claim detection given a particular context was introduced. In particular,
the work used annotated data from Wikipedia to assess a supervised machine learning
approach. Another interesting approach was proposed by [13] where a method that
used structured parsing information, detected claims without requiring contextual
information. In [14] a relation-based approach was introduced for Argumentation
Mining. In particular, the extraction of argumentative relations. The researchers
introduced a detailed use case where pairs of sentences were annotated to focus on
identifying argumentative relations.
Particularly related to our work, in [1] the TextRank algorithm was used to detect
argumentative components in an online debating forum and persuasive essays. What
makes different our approach is that we incorporate two key components to the
algorithm: firstly, different similarity metrics and embedding representation of sen-
tences based on word2vec. In [15], researchers elaborated on the appropriate annotation
scheme for argumentation mining. In particular, they studied the educational domain
using German newspaper editorials from the Web and English documents from forums,
comments, and blogs. They found that the choice of the argument components depends
on several different factors and structures used for expressing argumentation, thus no
argumentation scheme fits all the possible applications where Argumentation Mining
may play an important role. In [16], the IBM Haifa Research Group collected
context-dependent claims and evidence (facts) relevant to a given topic from Wikipedia
pages. The researchers classified evidence into three types: study, expert and anecdotal
using manually curated data from Wikipedia.
4 Dataset
The primary focus of our experiments is to determine to which degree of success the
TextRank algorithm, an unsupervised approach, can perform the task of claim detection
in scientific articles. In particular, in the medical domain. To do so, we perform
experiments on a PubMed corpus extracted using the following query pattern in
PubMed “(help AND prevent) OR (lower AND risk) OR (increase OR increment AND
risk) OR (decrease OR diminish AND risk) OR (factor AND risk) OR (associated AND
risk)” as in [17]. Out of more than 1M articles retrieved, we used a sample of 10,000
8 J.M. González Pinto and W.-T. Balke
that featured abstract and conclusion as metadata elements. We did so because the
sentences in the conclusion metadata are considered as our ground truth. In this work,
we hypothesized that the sentences in the conclusion metadata are a good indicator of
the main contribution of the paper. Unfortunately, we cannot use as ground truth the
Mesh terms of the documents because they are not sentences expressing the main
contribution of the papers. Thus, the sentences of the abstract section and of the
conclusions section constitute the set of sentences that the TextRank algorithm uses as
input. Moreover, we will refer to the conclusions as the claims of the papers hereafter.
In this section, we report results of an exploratory analysis of our corpus. One
particular problem that we wanted to understand is the complexity of the diversity in
the content of the metadata available. Particularly, we shed light on the following
questions: (1) what is the distribution of the number of sentences of a claim considering
different journals? (2) What is the specific vocabulary at the beginning and ending of
claims?
Let us start with our first question: whether the number of sentences containing
claims differs considering different journals. Among the 1,000 different journals from
our query pattern, we found that 3% of the journals use on average between 3 and 5
sentences to represent the claim of the research papers. In other words, the number of
sentences used by the majority of the journals is between 1 and 3.
In Fig. 1 we see a box plot with the mass of the mean number of sentences falling
between 1 and 3 sentences. Concretely, each dot represents a different journal and the
x-axis features the average number of sentences that we found in the metadata that
corresponds to the claims of the papers.
Let us continue with our second question: What is the specific vocabulary at the
beginning and ending of claim(s)? To answer this question, we investigated the
bigrams most frequently used at the beginning and ending of the claims sections. In
particular, we used the median position of the bigrams within the claims sections. In
Fig. 2, we plot bigrams used at least 50 times at the beginning and at the end of the
claims section. It seems that there exist some text patterns than can help in the
implementation of an algorithm for automatically detecting claims in medical research
papers.
Fig. 2. Bigrams use at the beginning (left side of the graph) or end (right side of the graph) of
the claim(s) section.
In Fig. 2, the x-axis represents the median position of bigrams within claims.
Basically, the plot divides in two main groups the bigrams of the claims sections of the
papers. The first position those whose median’s position are less than 50% (beginning)
and the second those that are whose median’s position are more than 50% (end). For
instance, the bigram “is warrant” appears at the end of the claims sections, corre-
sponding to a median position of 91.1%. Building on these insights, in the next section
we proceed to provide details of the actual implementation of our approach.
5 Experiments
In this section, we report the results of our experiments. Because the number of
sentences in the conclusions shows diversity (see Sect. 4), we also vary the number of
sentences in our experiments to evaluate the performance of the implementations of
TextRank. We choose for each particular experiment different number of sentences to
return considering the coverage of most of the cases we found in our exploratory
10 J.M. González Pinto and W.-T. Balke
analysis. Moreover, for each number of sentences we run eight different implementa-
tions of TextRank. The implementations differ in two fundamental aspects: the simi-
larity metric used by the algorithm, and whether the implementation performs
dimensionality reduction of the embedding space or not. For dimensionality reduction,
we use principal component analysis (PCA) [18].
Furthermore, one of the implementations uses a Bag of Word model (BOW) with
the cosine similarity as the similarity metric. We use that simple implementation to
determine if the use of the word embedding for this particular task makes a difference.
To compare the variations of the algorithm, we evaluate whether the returned sentence
of TextRank is in the conclusions metadata. In case it is contained, we consider the
sentence as correctly identified. Otherwise, it is considered incorrect. Thus, we report
accuracy as the measure of success of the different algorithm’s variations. In the
following, we describe the variations of TextRank we evaluate.
1. BOW + TF-IDF: uses a bag of words model with TF-IDF [18] to compute cosine
similarity between the sentences.
2. Embedding: uses cosine as the similarity metric. Each sentence is represented as the
sum of the individual word vectors in a 200-dimensional space.
3. Embedding + Hellinger: uses the Hellinger similarity metric.
4. Embedding + PCA + Cosine: uses PCA dimensionality reduction in the word
vectors. Each sentence is a sum of vectors of its individual words, but in a reduced
space. Uses the Cosine similarity metric.
5. Embedding + PCA + Hellinger: similar to (4) but uses the Hellinger similarity
metric.
6. Embedding + PCA + 2-Norm Diff: similar to (4) but uses the Euclidean distance of
the difference of the vectors that represent each sentence as the similarity metric.
7. Embedding + PCA + 2-Norm Avg: similar to (4) but uses the Euclidean distance of
the average of the vectors that represent each sentence as the similarity metric.
8. Embedding + PCA + 2-Norm Diff & Avg: similar to (4) but using the concate-
nation of the vectors that represent the differences and the average word2vec vectors
of the sentences.
For the PCA variations, to determine the number of components to use, we use a
measure known as ‘explained variance’, which can be calculated from the respective
eigenvalues. Concretely, the explained variance tells us how much information can be
attributed to each of the principal components. We experiment with different variances
to empirically select the number of components and report the best results in this work.
To clarify our findings, we first provide an analysis of cases where the ground truth
consists of two sentences and second, all cases where the ground truth has three
sentences.
Let us start with the first case. We can observe from Table 1 that all the variations
of TextRank using an embedding representation of the sentences outperform the Bag of
Words model representation. This was expected, because word embedding capture
semantics and syntactic features non-existent in the Bag of Words model. What is
interesting to notice is that a sum over the word vectors of a sentence preserves these
properties.
A New Challenge for Digital Libraries 11
Table 1. Accuracy of the different variations of TextRank to identify claims. The value of k
represents the number of sentences used to compute the accuracy
TextRank variation k=2 k=3 k=4
BOW + TFIDF 0.338 0.466 0.582
Embedding 0.418 0.566 0.662
Embedding + Hellinger 0.433 0.562 0.659
Embedding + PCA + Cosine 0.463 0.609 0.701
Embedding + PCA + Hellinger 0.383 0.510 0.613
Embedding + PCA + 2-Norm Diff 0.339 0.500 0.638
Embedding + PCA + 2-Norm Avg 0.393 0.550 0.679
Embedding + PCA + 2-Norm Diff & Avg 0.378 0.535 0.662
With respect to the similarity metric in the embedding space when PCA was not
applied, the cosine similarity outperforms the Hellinger similarity with a very low
margin when the top number of sentences returned by TextRank is k = 3 and k = 4 but
the Hellinger similarity is a better choice when k = 2. Thus, for our particular task of
retrieving claims in an unsupervised fashion, we consider both similarity metrics
equally valuable. However, when using PCA the cosine similarity has no competition.
In fact, this particular implementation of the TextRank algorithm delivers the overall
best results. Moreover, the Hellinger distance was consistently outperformed by the
implementation that uses the norm between the average representations of the vectors
as the similarity metric.
Our finding confirms the work of [7], where a similar representation of the sen-
tences performed on par with more computationally expensive deep learning models of
sentences in the task of document classification. As expected, all the implementations
increase performance as we increase the number of sentences that the algorithm returns.
Nevertheless, considering that the ground truth only consists of k = 2 sentences, we
can observe that all the implementations performed poorly on the task.
Let us continue with the experiments that correspond to cases where the number of
sentences in the ground truth is three. We present the results in Table 2. Similar to what
we observe in Table 1, any embedding representation outperforms the Bag of Word
model. With respect to the similarity metric when PCA was not used, we cannot see a
clear winner between Hellinger and the Cosine similarity metrics. However, when we
perform PCA on the word vectors, the Cosine similarity shines outperforming the
Hellinger similarity metric. Nevertheless, a fundamental difference between Tables 1
and 2, is that the method with best results in Table 2 is not the Cosine similarity with
PCA but rather the implementation of the 2-Norm distance using the average vector of
the sentences.
Discussion. In summary, we found that using an embedding representation of the
sentences had a positive impact for our particular task. Furthermore, when dimen-
sionality reduction was applied to the word vectors, with PCA, we obtained better
results. Moreover, we also observed that as the parameter k that represents the number
of sentences to extract is increased, an embedding representation with dimensionality
reduction delivered the best results. In practice, we will have to make a decision
12 J.M. González Pinto and W.-T. Balke
Table 2. Accuracy of the different variations of TextRank for the second test case to identify
claims. The value of k represents the number of sentences used to compute the accuracy
TextRank variation k=2 k=3 k=4
BOW + TFIDF 0.548 0.685 0.789
Embedding 0.652 0.775 0.858
Embedding + Hellinger 0.659 0.765 0.857
Embedding + PCA + Cosine 0.729 0.814 0.900
Embedding + PCA + Hellinger 0.631 0.723 0.821
Embedding + PCA + 2-Norm Diff 0.709 0.821 0.884
Embedding + PCA + 2-Norm Avg 0.746 0.844 0.904
Embedding + PCA + 2-Norm Diff & Avg 0.735 0.840 0.908
regarding the number of sentences the algorithm should return. This aspect of the
algorithm remains as a parameter that practitioners have to set empirically. We
observed that the approach shows potential to solve the claim detection problem in the
medical domain. However, more work needs to be done to improve the quality of the
results. In particular, for Digital Libraries where high quality is essential, we consider
that the current accuracy should be improved. And one particular way to improve the
approach that we are currently considering is the use of attention mechanisms such as
the one in [19]. With such an approach, the model of the sentences could be more
robust to different word orders and in turn might increase the quality of the results.
6 Conclusions
In this work, we have introduced the novel problem of claim-based queries and argued
how digital libraries can be enabled to solve it. One of the key parts of our solution to
the problem, the automatic identification of claims in an unsupervised fashion, was in
detail investigated and evaluated in this paper. In particular, the use of TextRank, a
graph based algorithm, for the novel task of extracting claims of medical scientific
articles. We performed a series of experiments, where we incorporated representations
of sentences based on word embedding using word2vec with different similarity
metrics with and without dimensionality reduction, using PCA. The representation of
sentences using PCA turned out to provide best results in our evaluation with accuracy
rate of over 70%. We evaluated our approach on a crawled corpus from PubMed and
used all available manually assigned metadata as ground truth.
Although our results look encouraging for focused indexing of the claims found in
a digital collection, in future work we need to further improve the unsupervised
detection of claims. In particular, we would like to incorporate word order in the model
that represent the sentences. Moreover, towards our goal of enabling digital libraries to
answer claim-based queries, we would like to study the impact of claim indexing to
investigate the features that can help to rank documents given a claim-based query.
A New Challenge for Digital Libraries 13
References
1. Petasis, G., Karkaletsis, V.: Identifying argument components through TextRank. In: ACL.
pp. 94–102 (2016)
2. González Pinto, J.M., Balke, W.-T.: Can plausibility help to support high quality content in
digital libraries? In: TPDL 2017 – 21st International Conference on Theory and Practice of
Digital Libraries., Thessaloniki, Greece (2017)
3. Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proc. EMNLP, vol. 85,
pp. 404–411 (2004)
4. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and
phrases and their compositionality. In: NIPS, pp. 1–9 (2013)
5. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in
vector space. In: Proceedings of International Conference Learn. Represent (ICLR 2013),
pp. 1–12 (2013)
6. Collobert, R., Weston, J.: A unified architecture for natural language processing. In:
Proceedings of the 25th International Conference on Machine Learning - ICML 2008,
pp. 160–167 (2008)
7. Lev, G., Klein, B., Wolf, L.: In defense of word embedding for generic text representation.
In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds.) NLDB 2015.
LNCS, vol. 9103, pp. 35–50. Springer, Cham (2015). doi:10.1007/978-3-319-19581-0_3
8. Teufel, S.: Argumentative Zoning: Information Extraction from Scientific Text (1999)
9. Stab, C., Gurevych, I.: Identifying argumentative discourse structures in persuasive essays.
In: Proceedings of 2014 Conference on Empirical Methods on Natural Language Processing,
pp. 46–56 (2014)
10. Stab, C., Kirschner, C., Eckle-Kohler, J., Gurevych, I.: Argumentation mining in persuasive
essays and scientific articles from the discourse structure perspective. In: CEUR Workshop
Proceedings (2014)
11. Stab, C., Gurevych, I.: Annotating argument components and relations in persuasive essays.
In: Proceedings of COLING 2014, 25th International Conference on Computational
Linguistics: Technical Papers, pp. 1501–1510 (2014)
12. Levy, R., Bilu, Y., Hershcovich, D., Aharoni, E., Slonim, N.: Context dependent claim
detection. In: International Conference on Computational Linguistics, pp. 1489–1500 (2014)
13. Lippi, M., Torroni, P.: Context-independent claim detection for argument mining. In: IJCAI
International Joint Conference on Artificial Intelligence, pp. 185–191 (2015)
14. Carstens, L., Toni, F.: Towards relation based argumentation mining. In: Proceedings of the
2nd Workshop on Argumentation Mining, pp. 29–34 (2015)
15. Habernal, I., Eckle-Kohler, J., Gurevych, I.: Argumentation mining on the web from
information seeking perspective. In: Proceedings of the Workshop on Frontiers and
Connections between Argumentation Theory and Natural Language Processing, pp. 26–39
(2014)
16. Rinott, R., Dankin, L., Alzate, C., Khapra, M.M., Aharoni, E., Slonim, N.: Show me your
evidence – an automatic method for context dependent evidence detection. In: EMNLP,
pp. 440–450 (2015)
17. Ciccarese, P., Wu, E., Wong, G., Ocana, M., Kinoshita, J., Ruttenberg, A., Clark, T.:
The SWAN biomedical discourse ontology. J. Biomed. Inform. 41, 739–751 (2008)
18. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge
University Press, Cambridge (2014)
19. Li, J., Luong, M.-T., Jurafsky, D.: A Hierarchical Neural Autoencoder for Paragraphs and
Documents, pp. 1106–1115 (2015)
Application of k-Step Random Walk Paths
to Graph Kernel for Automatic Patent
Classification
1 Introduction
Research and development activities undertaken by industries, research insti-
tutes and universities often produce patent application as an institutional per-
formance measure. To have granted patents, patent applicants must follow a
process named classification. A patent document must be classified into a par-
ticular category according to its field and content. This process is manually
done by applicant and examiner. They examine and analyze which category is
appropriate for each patent. The manual classification process is a big challenge
because categorizing a vast quantity of granted and application patent docu-
ments in patent offices is time-consuming and labor-intensive. The automated
classification of patent applications into a particular patent classification system
is still a challenge for many practical applications [1].
Computing similarities between structured objects is interesting, and graphs
offer a natural way to represent structured objects. Citation between patent
c Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 14–29, 2017.
https://doi.org/10.1007/978-3-319-70232-2_2
Application of k-Step Random Walk Paths to Graph Kernel 15
documents refers to the link between the citing and the cited. A graph can
represent patent citation network in which nodes correspond with patent number
and edges correspond with citation. In patent citation network, we can classify
a patent document into a classification scheme by comparing how likely they are
in patent citation graphs and computing the similarities.
Patents are linked with other patents through citations. A graph of citations
accommodates much information about the connections among patents. These
links represent the relationship among patent document’s content. A graph of
patent citation can be extracted and used in the classification task. Citation data
have been shown useful in other linked document classification research [2]. Most
current research work investigated assigning a patent document to a unique class
is a problem of text categorization [3–5]. These works exploited content features
from whole patent document text. Several works used the kernel-based method
to classify patent documents and utilized patent citation graph advantages [6,7].
These works introduced features-based and citation-based approaches to opti-
mize citation link to increase patent classification task performance.
Citation-based features have not been fully explored in this problem. A few
previous studies used kernel-based methods to capture the structures of patent
citation networks [6,7] by using k-step random walk paths algorithm with single
step. Most of the previous studies employed kernel-based method in specific
topic of patent datasets, such as patent in nanotechnology [6]. Better results
were obtained by combining several approaches and features [4,7,8].
In this paper, we apply k-step random walk paths algorithm to calculate
kernel values of each patent pairwise and SVM classifier to do the classification
task. We used four standard evaluation metrics, namely accuracy, precision,
recall, and F-measure to evaluate the performance of the SVM classifiers. The
idea using k-step random walk paths was inspired by patent citation network-
patent classification [6,7]. The method is based on the property of neighborhood
in a graph. The main contributions of our work are:
– providing a simple approach by exploiting patent citation network based
method for automatic patent classification, and
– applying a technique of subgraphing large patent graph to represent citation
graph information
The remainder of this paper is organized as follows. Section 2 describes related
work required for the discussions of this paper. Section 3 explains experiments we
did for investigating the k-step random walk paths in patent graph citation and
reports how we obtain empirical results. Section 4 presents results and discussion
and Sect. 5 concludes this paper.
2 Related Work
2.1 Automatic Patent Classification
Many studies of patent analysis reported by Abbas et al. [1], including automatic
patent classification, used dataset issued from US Patent Office. For example,
16 B. Nugroho and M. Aritsugi
Hall et al. [2] reported the database of patent citation on U.S. patents for widely
accessible to research and development activities. The dataset consists of about
3 million patents and 32 million citations in the form of edgelist with weight
data frame. We used this dataset for our experiment.
Zhang et al. [9] reported a technological direction patent mining. This study
described a summary of investigation on multiple research questions related to
patent documents, including patent retrieval, patent classification, and patent
visualization. Most of the patent analysis tasks considered Patent Retrieval (PR)
as the foundation. Shalaby and Zadrozny [10] presented an extensive overview of
PR methods and approaches. The overview covered issue of transferring recent
successes and maturity in information retrieval applications to PR. Performance
in automatic PR is essential for interactive search tools which provide cogni-
tive assistance to patent professionals with minimal effort. Some other related
tasks to PR are patent valuation, litigation, licensing, and highlight potential
opportunities and wide directions for computational scientists. Patent retrieval
has recently been made to use text-mining (i.e., extracting keywords from patent
documents) for patent analysis purposes. Noh et al. [11] exploited keyword selec-
tion strategies for applying text-mining to patent data. The strategies included
four factors, i.e., the element of a patent document, selection method, number
of keywords and data format transformation. The four factors were evaluated
and compared to k-means clustering, and entropy values experiment based on
an orthogonal array of the four factors.
Kumar et al. [3] reported a content classification using Probabilistic Latent
Semantic Analysis (PLSA) technique for patent application. The PLSA built
an indexer which marked documents and generated a fit model for automatic
document indexing. For compacting the size of term-document matrix and co-
occurrences matrix, they used a singular value decomposition model which has
some hidden categories. For determining the hidden categories, they computed
probabilities of extracted words appearing in particular patent document and
hidden class, also a probability of patent document containing hidden class, and
then applied expectation maximization algorithm to develop clusters. D’hondt
et al. [12] investigated the improvement of patent classification using different
representations of the patent documents. Based on the Linguistic Classification
System (LCS), an extensive analysis of the class models created by the classifiers,
to examine which types of phrases are most informative for patent classifica-
tion. The LCS has been developed for comparing different text representations.
Three classifier algorithms are usable, i.e., Naive Bayes, Balanced Winnow. In
our paper, instead of using content (Abstract, Claims, and Description) based
on text extraction in the patent document, we exploit one feature embedded to
citation graph to do classification task.
Nguyen et al. [4] proposed a method for Graph-Embedded-Tree-based ontol-
ogy construction. The method promoted domain knowledge from a codifica-
tion in the patent classification process. The ontology consists of four types of
concept, namely Class, Document, Phrase, and Term that define their seman-
tic information to give the classifier better analysis capability whenever the
Application of k-Step Random Walk Paths to Graph Kernel 17
Li et al. [6] optimized patent citation networks as a classification tool using graph
kernels. The kernel functions were constructed based on the document features
and citations. Kernel matrices of citation information were intended to capture
the citation information effectively with two conditions: (1) the scope of the
cited documents, and (2) mentioned document features. By considering these
two conditions, four different kernels were introduced to capture patent citation
information, i.e. bibliographic coupling kernel, labeled co-reference kernel, graph
overlap kernel and labeled citation graph kernel. They also introduced a linear
18 B. Nugroho and M. Aritsugi
text kernel matrix that used text from patent abstract to represent the entire
patent content and captured that information.
In the citation network-classification approach, each patent document has
a citation network with cited vertex designated by its class. Calculating simi-
larities between citation networks and those of other patents already classified
into USPC categories leads to identifying a patent’s class. The similarity of two
patent citation graphs is calculated by comparing their random walk paths. This
approach employs a three-stage, kernel-based technique for patent classification:
data acquisition and parsing, kernel construction, and classifier training. SVM
was used as the kernel machine. The kernel value is calculated in the following
equation:
K(G1 , G2 ) = k(h, h )P (h | G1 )P (h | G2 )
h h
where G1 and G2 symbolize the patent citation graphs associated with two
patents, h and h are the random walk paths in the respective graphs and
P (h | G1 ) and P (h | G2 ) denote the probabilities of random walk paths that
exist in the citation networks. If h and h are identical, k(h, h ) = 1; otherwise,
k(h, h ) = 0.
Kernel matrix is used for SVM classifier to generate a classification model.
The kernel matrix is an enhanced matrix of patent similarity vectors of all patents
in the training set and their class labels. The name is denoted as 1 assuming
that the patent belongs to the unique class; differently, it is denoted as −1.
This denotation is an alleged one-against-rest model for the SVM. To handle
multiclass classification with m classes (m > 2), in which m(m − 1)/2 binary
classifiers are trained; the chosen class is defined by a voting scheme. For each
particular class, a well-trained SVM model is used to predict if a query patent
belongs to the class. The final predicted class is then determined by applying a
“winner-takes-all” strategy to the SVM models of all the classes. In our study,
we use same strategy to predict patent classes.
The hybrid patent classification approach proposed by Liu and Shih [7]
for combining patent network based classification method with three conven-
tional classification methods. The approach aimed to analyze query patents and
predict their classes. The occurrence of patent documents relationship metrics
extracted from the patent metadata established the patent graphs. The classifi-
cation method with a modified k-nearest neighbor classifier analyzed all reach-
able verteices in the patent graph and calculated their relevance to the query
patent to predict a query patent’s class. The approach merges content-based,
citation-based, and metadata-based classification methods to develop a hybrid-
classification method. In this paper, we limit our method in exploiting citation
network based patent classification.
3 Application to Study
We did experiments to investigate the application of k-step random walk paths
algorithm to classify patent documents by exploiting patent citation graphs.
Application of k-Step Random Walk Paths to Graph Kernel 19
The idea using k-step random walk paths was inspired by patent citation
network-based approaches to automatic patent classification [6,7]. We trained
classifier using the kernel matrix of the data instances in the training dataset.
In this study, we used SVM as classifier because of its proven performance in
previous studies [6,7,14,15].
We employed subgraph technique based on the property of neighborhood in
a graph. The neighborhood of a given order n of a vertex v includes all vertices
which are closer to v than the order. For example, order 0 is always v itself,
order 1 is v plus its immediate neighbors, order 2 is order 1 plus the immediate
neighbors of the vertices in order 1, etc. [16].
For our purposes, a similarity measure is a function that associates a numeric
value with a pairwise of patent citation graphs with the concept that a higher
value shows closer likeness between the graphs. There is a positive relation
between a kernel matrix and a distance-based similarity matrix. We use a general
framework of algorithms adapted from [17] as described below.
Training Algorithm
Testing Algorithm
were deleted from the dataset. We created subgraphs based on the number of
patent for each class.
The patent documents represented in kernel matices were divided into two
sets with random sampling in each iteration: (a) a training set (80% of the
collected dataset) containing the patent documents whose classes were known
and (b) a test set (20% of the collected dataset) containing patent documents
whose classes were to be determined. The summary of each dataset is described
in Table 1.
Experiment Steps
Performance Metrics
We used four standard evaluation metrics, i.e., accuracy, precision, recall, and
F-measure to evaluate the performance of the classifiers. The metrics have been
widely used in information retrieval and machine learning studies. The evaluation
metrics equations are as follows:
diag
accuracy =
N
diag
precision =
colsums
diag
recall =
rowsums
(precision × recall)
F1 = 2 ×
(precision + recall)
where N is the number of instances, diag is the number of correctly classified
instances per class, rowsums is the number of instances per class, and colsums
is the number of predictions per class.
Table 2 shows overview of our citation graphs. From these statistics we can
observe that the maximum steps required to cross the graph are eight for g1,
nine for g5 and ten for g7 respectively, which would seem to indicate graphs
without a lot of clustering. The values of Avg. Path Length indicate a fairly low
value relative to the total number of nodes. It takes nodes close to two steps
(2.136 for g1, 2.399 for g5 and 2.348 for g7) on average to reach any other node
in the graphs. We would then anticipate a lower average path length, as a higher
proportion of members would have first degree connections.
Avg. Clustering Coefficient is a measure that determines the percentage of
available triplets that are fully closed. From Table 2, we observe that the mea-
sures are 0.936 for g1, 0.902 for g5 and 0.923 for g7. In the case of our patent
22 B. Nugroho and M. Aritsugi
Statistic g1 g5 g7
Average Degree 20.63 2.542 2.324
Network Diameter 8 9 10
Modularity 0.936 0.902 0.923
Connected Component 124 129 431
Avg. Clustering Coefficient 0.093 0.08 0.092
Avg. Path Length 2.136 2.399 2.348
citation graph, the total graph number is almost exactly 90% closed, with the
remaining 10% still open (two of three edges are connected), but the third edge
is missing. The modularity statistics split these graphs into ten distinct clusters.
This might be satisfactory for our purpose, or if we need further splits, we can
employ one of the dedicated clustering algorithms. A quick way to determine if
this number of nodes for each cluster is adequate is to color the graph using the
modularity class. Figure 1 below illustrates the complete graph of g7.
For calculating the kernel value, we added name class (USPC) of each patent
as an attribute of a vertex. We generated a list of igraph objects by subgraphing
the graph into each vertex subgraph by n = 3 order, as shown in Figs. 2 and 3.
From Fig. 2, we can observe that the g1 dataset is clustered into nine clusters.
The clusters indicate the number of patent classes in this dataset. As confirmed
in Table 1, we have nine classes in g1 with described USPC codes1 . Although a
few nodes are clustered into different color of clusters, patent citation graph is
significant for the basis of classification task.
1
http://www.ibiblio.org/patents/classes.html.
Application of k-Step Random Walk Paths to Graph Kernel 23
⎢ ⎥
234 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
245 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
26
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 82 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
260
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
27
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
289 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
291 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
295 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
298 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ 0 87 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
300 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
334 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
412 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 89 0 0 0 27 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
413 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50 0 0 0 1 0 0 0 0 0 0 0 ⎥
449
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
453
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
462 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 ⎥
470 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 43 0 0 0 0 0 0 0 ⎥
⎢ ⎥
476 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ 0 0 0 0 0 0 0 0 0 0 87 0 0 0 0 0 0 ⎥
527 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 0 0 0 0 0 ⎥
⎢ ⎥
54 ⎢
⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 81 0 0 0 0 ⎥⎥
69 ⎢ 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⎥
⎢ ⎥
79 ⎢ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 ⎥
⎣ ⎦
86 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 64 0
87 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 46
5 Conclusions
References
1. Abbas, A., Zhang, L., Khan, S.U.: A literature review on the state-of-the-art in
patent analysis. World Pat. Inf. 37, 3–13 (2014)
2. Hall, B.H., Jaffe, A.B., Trajtenberg, M.: The NBER Patent Citations Data File:
Lessons, Insight and Methodological Tools (2001)
3. Kumar, R., Math, S., Tripathi, R.C., Tiwari, M.D.: Patent classification of the
new invention using PLSA. In: Proceedings of the First International Conference
on Intelligent Interactive Technologies and Multimedia, pp. 222–225 (2010)
4. Nguyen, H.M., Phan, C.P., Nguyen, H.Q.: GeTCo: an ontology-based approach for
patent classification search. In: iiWAS 2016 Proceedings of the 18th International
Conference on Information Integration and Web-based Applications and Services,
pp. 241–244 (2016)
5. Shih, M.J., Liu, D.R.: Patent classification using ontology-based patent network
analysis. In: Proceedings Pacific Asia Conference on Information Systems PACIS
2010, Taipei, pp. 962–972 (2010)
6. Li, X., Chen, H., Zhang, Z., Li, J.: Automatic patent classification using citation
network information: an experimental study in nanotechnology. In: Proceedings of
the 7th ACM/IEEE Joint Conference on Digital Libraries - Building & Sustaining
the Digital Environment, pp. 419–427 (2007)
7. Liu, D.R., Shih, M.J.: Hybrid-patent classification based on patent-network analy-
sis. J. Am. Soc. Inform. Sci. Technol. 62(2), 246–256 (2011)
8. Stutzki, J., Schubert, M.: Geodata supported classification of patent applications.
In: GeoRich 2016 Proceedings of the Third International ACM SIGMOD Workshop
on Managing and Mining Enriched Geo-Spatial Data, pp. 4:1–4:6 (2016)
9. Zhang, L., Li, L., Li, T.: Patent mining: a survey. ACM SIGKDD Explor. Newsl.
16(2), 1–19 (2015)
10. Shalaby, W., Zadrozny, W.: Patent retrieval: a literature review. arXiv preprint,
January 2017
11. Noh, H., Jo, Y., Lee, S.: Keyword selection and processing strategy for applying
text mining to patent analysis. Expert Syst. Appl. 42(9), 4348–4360 (2015)
12. D’hondt, E., Verberne, S., Koster, C., Boves, L.: Text representations for patent
classification. Comput. Linguist. 39(3), 755–775 (2013)
13. Fall, C.J., Torcsvari, A., Benzineb, K., Karetka, G.: Automated categorization in
the international patent classification. ACM SIGIR Forum 37(1), 10–25 (2003)
14. Li, Y., Bontcheva, K.: Adapting support vector machines for F-term-based classi-
fication of patents. ACM Trans. Asian Lang. Inf. Process. 7(2), 1–19 (2008)
15. Wu, C.H., Ken, Y., Huang, T.: Patent classification system using a new hybrid
genetic algorithm support vector machine. Appl. Soft Comput. 10(4), 1164–1177
(2010)
16. Csardi, G., Nepusz, T.: The igraph software package for complex network research.
InterJ. Complex Syst. 1695, 1–9 (2006)
Application of k-Step Random Walk Paths to Graph Kernel 29
17. Diego, I.M.D., Muñoz, A., Moguerza, J.M.: Methods for the combination of kernel
matrices within a support vector framework. Mach. Learn. 78(1–2), 137–174 (2010)
18. Sugiyama, M.: graphkernels: Graph Kernels. R package version 1.2 (2017)
19. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab - an S4 package for
kernel methods in R. J. Stat. Softw. 11(9), 1–20 (2004)
Detecting Target Text Related to Algorithmic
Efficiency in Scholarly Big Data
Using Recurrent Convolutional
Neural Network Model
1 Introduction
digital archives on monthly and yearly basis [2]. Bhatia et al. [3] estimated that
approximately 900 algorithms have been published in major computer science con-
ferences during the years 2005-2009. This clearly shows that researchers are working
actively to propose new algorithmic solutions or to improve the existing ones. There is
always a possibility that new algorithms may help to improve the existing deployed
techniques. Therefore, it is understandable to assume that there is a need for the
researchers generally and the computer scientists specifically to always keep them-
selves informed about new algorithms and solutions related to their technologies.
The exponential growth in the academic research community and resultant pub-
lished literature has made it difficult for a human to be abreast to all the related research,
proposed algorithms and their reported results on a specific dataset(s). Digital libraries,
like Google Scholar, PloS One, CiteSeerX etc., have efficient search capabilities that
help users to search relevant research literature. However, they have intrinsically
searching limitations because of simple and traditional text matching techniques for user
queries without complete understanding the context and semantics of the text. Recently,
few research-studies have been carried out to investigate the possibility of building a
management system and a search engine for the algorithms [4, 5]. However, such a
searching mechanism is merely a matching algorithm on textual metadata, such as
caption text, reference text etc., with the search queries. Generally, algorithmic solutions
are evaluated on particular datasets and have various computational costs and evaluation
results. Algorithmic technique with less computational cost and with better evaluation
results is considered to be an efficient one.
In order to report the experimentation and evaluation results authors use plain text
and sub-objects (like figures, tables, etc.). The text written related to the reported results
contains more details about the effectiveness of the deployed algorithm and provides a
context that helps to interpret the text. Following is an example of text pertaining to the
performance of an algorithm discussed in one of the publication in our dataset: “…We
have evaluated the LDA-SVD multi-document summarization algorithm by considering
both cases of removing stop-words and not removing stop-words from the computed
and the model summaries. Table 2 tabulates the ROUGE-1 recall values and its 95%
confidence interval…”. Previously, proposed text matching techniques such as
bag-of-words or bag-of-n-gram, latent Dirichlet allocation [6] and mutual-information,
etc. completely fail to capture the semantic and word sequence of the text. These two
features, text semantics and its word sequence, are essential in effective extraction of
the portion of the text where performance of the related algorithm is discussed. While,
other text matching techniques like high order n-gram (5-gram, 6-gram) and Tree
kernels may also help to understand the text semantic and contextual information, but
these techniques still fail to fully understand the sentence’s context which may heavily
affect the classification accuracy.
In this paper, we propose a novel model for automatic detection of text from
scientific publications pertaining to the discussion of algorithms, in terms of their
effectiveness like precision, recall or f-measure etc. We tap into the advancement in
deep learning and create sentence representations using word embeddings. The rep-
resentation is fed into the Recurrent Convolutional Neural Network (RCNN) [7]
classification algorithm, allowing us to accurately find the ‘evaluation results related
text lines’ in full text documents. Finally, we evaluate our proposed method using a
32 I. Safder et al.
dataset of 258 manually annotated scholarly documents from the CiteSeerX repository.
After 100 training epochs, our model achieves 76% training accuracy, whereas we
report 77.65% f-measure and 76.35% accuracy on testing data.
2 Literature Review
The literature review has been categorized into two subsections; the first one discusses
the related work on information extraction in academic articles to enhance digital
repositories and search engine capabilities for important sub-objects (tables, figures,
algorithms) that are found in research articles. The second one is concerned with deep
learning based techniques which gave us the inspiration to employ such algorithms for
related target text extraction in research articles.
3.1 Data
The dataset consists of 258 scholarly articles, selected from the CiteSeerX repository
[4]. Note that, of the total there are 37,000 text lines in our dataset. Further, the data
was manually annotated by four human experts who identified 2,331 text line as target
line, thus, only 6.3% contained target text that conveys information about the efficiency
of the corresponding algorithm.
34 I. Safder et al.
3.2 Approach
Figure 1 gives the high level architecture of our proposed system for Target-text
extraction, named as evaluation metrics detection (EMD). Our proposed system inputs
the scholarly documents in a Portable Document Format (PDF) since a huge digital
search libraries are in the pdf format. In the first step, PDF document is converted into a
plain text by using PDFbox library (https://pdfbox.apache.org/). The extracted text is
passed to the documents-segmentation module for section extraction, inspired by
Tuarob [22]. Further, we preprocess extracted sections for cleaning purpose. Finally,
the text is input to the RCNN model based target-text-line classification system.
Our EMD method uses RCNN to capture the text semantics for target texts lines
classification. It takes the input of preprocessed related sections text as sequence of
words w1, w2, w3…wn and outputs the class of the text. Afterwards, probability function
p(k|D, h) is used to find the probability of text line that belong to a class containing
target lines or not. The following preprocessing steps are taken:
Standard Section Extraction. Generally, scholarly articles are organized into stan-
dard sections (i.e. Abstract, Introduction, Background and Related Work etc.). Sections
such as Methodology, Results, Experiments, and Abstract etc. have very high proba-
bility to contain result related discussion. Therefore, section extraction is a very crucial
task for our EMD method. We employ a rule based technique [22] for section
extraction. The said section extraction technique eventually helps us to keep on related
sections (i.e. Methodology, Results, Experiments, Abstract etc.) and slice up the
non-important sections where chances of target lines are minimum or close to none (i.e.
Introduction, Related work, References, Acknowledgement etc.). Further, text cleaning
Detecting Target Text Related to Algorithmic Efficiency 35
is performed to remove header/footers, paper title, and author affiliations etc. Lastly, the
cleaned and related sections text is given as input to RCNN model.
The Recurrent Convolutional Neural Network Model. Figure 1 shows the detailed
architecture of our RCNN [7] based approach. The word representation is the com-
bination of word and its context. The bidirectional nature of RCNN generates the
representation yi(wi) of each word wi, that captures the context and semantic meaning
of words. For that it first calculates the vectors cr(wi) and cl(wi), Eqs. 1 and 2 re-
spectively, containing the context of words that are left and right to word wi.
Here, W(l) and W(r) are matrices used to transform context between the hidden
layers. W(sl), W(sr) matrices are used to combine the left and right words context with
the current word. Similarly, e(wi-1) is a real valued word embedding vector of word
wi-1. The final word wi represented in Eq. 3 is learned by combining left and right
contexts. The cl and cr vectors are computed by the model in forward and backward
passes. Afterwards, a linear transformation (wx þ b) with tanh activation function is
applied to add nonlinearity (see Eq. 4).
The next layer in the network is the max pooling layer (see Fig. 1) which is applied
ð2Þ
on the yi vector that represents the most important and significant features for the text
representations after evaluating each and every semantic factor. The pooling layer is
used to convert the varying length text to a fixed length vector to only represent the
most significant information from the full text. The Eq. 5 shows the max pooling layer
representation which is applied on element level and only picks maximum element
ð2Þ
from yi against each position. Lastly, a single fully connected (FC) hidden layer as
output layer with Softmax activation function is applied to compute the probability
(Eqs. 6 and 7).
ð2Þ
yð3Þ ¼ maxni¼2 yðiÞ ð5Þ
The experiments are run to detect target text lines (that convey information about the
efficiency of the corresponding algorithm) using RCNN. All experiments are performed
on Nvidia Titan 750 GPU with 2 GB memory, running Ubuntu operating system. We
use the Python Chainer Library (https://chainer.org/) to implement RCNN model.
h ¼ fE; bð2Þ ; bð4Þ ; cl ðw1 Þ; cr ðwn Þ; W ð2Þ ; W ð4Þ ; W ðlÞ ; W ðrÞ ; W ðslÞ ; W ðsrÞ g ð8Þ
X
h! c2D
logpðclassc jD; hÞ ð9Þ
Our dataset suffers from the class imbalance problem, since target-text constitutes
very small portion of the document. This imbalance can adversely affect the classifi-
cation results. To deal with this problem, we incorporated following two balancing
approaches: (a) Random Over-sampling (ROver): minority class instances are ran-
domly replicated, until positive and negative class instances become equal. (b) Random
Under-sampling (RUnder): majority class instances are randomly excluded until pos-
itive and negative class instances become equal.
Fig. 2. Training accuracy of RCNN based EMD method with 100 epoch
Detecting Target Text Related to Algorithmic Efficiency 37
After applying the balancing techniques, we get 4337 positive samples, i.e. the
target lines, and 4770 negative instances, i.e. text lines without target text and with no
discussion about the evaluation results of the algorithm. Data is split into 70% and 30%
for training and testing respectively. Figure 2 shows training accuracy of our EMD
method for 100 epochs to depict the behavior of our model during training. The y-axis
shows the training accuracy along with epochs on x-axis.
Note that the network hyper parameters are assigned as follows: hidden layer size
(H) to 1000, learning rate to 0.01, vocabulary size (V) to 3000 and training epochs set
to 100.
Table 1. Precision, recall and F-measure scores for RCNN and baseline method.
Method Model Pr. Re. F1. Acc.
EMD Baseline 0.42 0.08 0.14 0.69
EMD RCNN 0.79 0.77 0.77 0.76
The results clearly depict that contextual and semantic information help to our
model performs well than traditional keywords matching approach. Table 2 shows
some examples of correctly and incorrectly classified lines of results related text by our
model. One of the limitation of our proposed technique is the use of same embedding
vectors for English language text and numeric figures found in text lines e.g. “…
Implemented technique achieve precision 60.5 and recall 50.4….”. Currently, the
proposed model can only understand the contextual meaning by looking on English
38 I. Safder et al.
language text. However, the numeric text may also contain useful information
regarding the performance of the respective algorithms. Table 2 shows some examples
of correctly and incorrectly classified lines of results related text.
5 Concluding Remarks
In this paper we have proposed the use of word embedding and recurrent convolutional
neural network model to discover and retrieve sentences in the document that convey
the information about the effectiveness (such as precision, recall, and f-measure) of the
corresponding algorithm. This information could be used by the algorithm searchers to
further drill down the desired algorithms using ‘performance’ as a criterion. In future,
we plan to employ natural language processing and machine learning techniques to
extract numeric representation of algorithm’s performance. This information will
enable direct comparison between algorithms. Furthermore, we plan to investigate the
possibility of extracting other algorithm-specific metadata such as run-time complexity,
input, output, and compatible data structures. Note that dataset and code to reproduce
the results can be accessed at the following URL: https://github.com/slab-itu/rcnn_
icadl_2017.
References
1. Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLoS ONE
9(5), e93949 (2014)
2. ArXiv stats. https://arxiv.org/stats/monthly_submissions. Accessed 17 July 2017
Detecting Target Text Related to Algorithmic Efficiency 39
3. Bhatia, S., Tuarob, S., Mitra, P., Giles, C.L.: An algorithm search engine for software
developers. In: Proceedings of the 3rd International Workshop on Search-Driven Develop-
ment: Users, Infrastructure, Tools, and Evaluation, pp. 13–16. ACM (2011)
4. Tuarob, S., Bhatia, S., Mitra, P., Giles, C.L.: AlgorithmSeer: a system for extracting and
searching for algorithms in scholarly big data. IEEE Trans. Big Data 2(1), 3–17 (2016).
IEEE
5. Tuarob, S., Bhatia, S., Mitra, P., Giles, C.L.: Automatic detection of pseudocodes in
scholarly documents using machine learning. In: Document Analysis and Recognition
(ICDAR), pp. 738–742 (2013)
6. Hingmire, S., Chougule, S., Palshikar, G.K., Chakraborti, S.: Document classification by
topic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 877–880. ACM (2013)
7. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text
classification. In: AAAI, vol. 333, pp. 2267–2273 (2015)
8. Coüasnon, B., Lemaitre, A.: Recognition of tables and forms. In: Doermann, D., Tombre, K.
(eds.) Handbook of Document Image Processing and Recognition, pp. 647–677. Springer,
London (2014). doi:10.1007/978-0-85729-859-1_20
9. Chen, S.Z., Cafarella, M.J., Adar, E.: Searching for statistical diagrams. Frontiers of
Engineering, National Academy of Engineering, pp. 69–78 (2011)
10. Kataria, S., Browuer, W., Mitra, P., Giles. C.L.: Automatic extraction of data points and text
blocks from 2-dimensional plots in digital documents. In: AAAI 2008, vol. 8, pp. 1169–
1174 (2008)
11. Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: FigureSeer: parsing result-figures
in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS,
vol. 9911, pp. 664–680. Springer, Cham (2016). doi:10.1007/978-3-319-46478-7_41
12. Fang, J., Gao, L., Bai, K., Qiu, R., Tao, X., Tang, Z.: A table detection method for multipage
Pdf documents via visual seperators and tabular structures. In: International Conference on
Document Analysis and Recognition (ICDAR), pp. 779–783. IEEE (2011)
13. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and
searching in digital libraries, In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on
Digital Libraries, pp. 91–100. ACM (2007)
14. Mitra, P., Giles, C.L., Sun, B., Liu, Y., Jaiswal, A.R.: Scientific data and document
processing in chemxseer. In: AAAI Spring Symposium: Semantic Scientific Knowledge
Integration, pp. 51–56 (2008)
15. Khabsa, M., Treeratpituk, P., Giles, C.L.: AckSeer: a repository and search engine for
automatically extracted acknowledgments from digital libraries. In: Proceedings of the 12th
ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 185–194. ACM (2012)
16. Bhatia, S., Mitra, P.: Summarizing figures, tables, and algorithms in scientific publications to
augment search results. ACM Trans. Inf. Syst. (TOIS) 30(1), 3 (2012)
17. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word
representations. In: hlt-Naacl, vol. 13, pp. 746–751 (2013)
18. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:
1408.5882 (2014)
19. Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for
sentiment classification. In: EMNLP, pp. 1422–1432 (2015)
20. Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and
unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural
Information Processing Systems, pp. 801–809 (2011)
40 I. Safder et al.
21. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural
language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
22. Tuarob, S., Mitra, P., Giles, C.L.: A hybrid approach to discover semantic hierarchical
sections in scholarly documents. In: Document Analysis and Recognition (ICDAR),
pp. 1081–1085. IEEE (2015)
Semantic Facettation in Pharmaceutical
Collections Using Deep Learning for Active
Substance Contextualization
1 Introduction
1
https://www.nlm.nih.gov/mesh/intro_trees.html.
2
https://www.whocc.no/atc_ddd_index/.
3
http://www.ahfsdruginformation.com/ahfs-pharmacologic-therapeutic-classification/.
4
https://www.drugbank.ca/.
Semantic Facettation in Pharmaceutical Collections 43
publications, giving users easy access to hidden entity semantics for digital library
searches. Moreover, these facets can be automatically derived without expensive
manual curation.
The paper is organized as follows: Sect. 2 revisits related work. Section 3 details
our method for facettation of drugs, accompanied by an extensive evaluation against
curated classification systems in Sect. 4. We close with conclusions in Sect. 5.
2 Related Work
Capturing semantically meaningful similarities for scientific entities has since long
been an active field of research. Today, most recognized systems are to a large degree
still manually maintained to guarantee usage experience and to provide a reliable
foundation for value adding services and research planning. While the current explo-
sion of scientific results clearly calls for automation, the quality of resources cannot be
compromised, i.e. a high degree of precision has to be maintained. The most prominent
classification systems (later used as ground truth) for pharmaceutical uses are:
• The Anatomical Therapeutic Chemical (ATC) Classification System. ATC subdi-
vides drugs according to their therapeutic uses and chemical features. Maintained
by the World Health Organization (WHO), it is currently the most used drug
classification system and serves as an important source for tasks like e.g., drug
repurposing and drug therapy composition [7].
• The Medical Subject Headings (MeSH). MeSH is a controlled vocabulary and
serves as general classification system for biomedical documents in MEDLINE
maintained by the National Library of Medicine (NLM). MeSH descriptors are
organized in 16 main categories, e.g. category C for diseases and D for drugs,
further divided in finer levels (subgroups) leading to a hierarchical structure.
• The American Hospital Formulary Service (AHFS). AHFS distinguishes drugs
according to their pharmacologic and therapeutic effect with a focus on drug
therapies. Like ATC and MeSH, AHFS shows a hierarchical structure.
Manual drug annotation may yield superior quality, but it is also related with high
costs. Therefore, in recent years many approaches to annotate drugs automatically
have been designed. In general, these approaches rely on a blend of machine learning,
information retrieval, and information extraction techniques. To annotate properties in
pharmaceutical texts reliably, a wide variety of methods has been devised. For instance,
[7] employs support vector machines to predict ATC class labels for yet unclassified
drugs and shows that given rich training sets, document-based classification can
actually outperform classifications performed on chemical structures only. For the same
task, [8] shows the power of text mining to create enriched drug fingerprints and after
some manual curation their subsequent benefit for retrieval. In [9] an approach for the
automatic annotation of biomedical documents with MeSH terms is presented. Dif-
ferent classification systems are compared to reproduce manual MeSH annotations.
With classification accuracies of already around 80%, all of the above
document-based approaches show the benefits and general applicability of text mining
for entity metadata enrichment. Thus, a domain-specific contextualization of entities in
44 J. Wawrzinek and W.-T. Balke
scientific digital libraries seems appealing. To find central topics in documents two
major approaches have been used: latent semantic analysis (LSA [10]) performs sin-
gular value decompositions over term-document matrices to get topics as linear com-
binations of vocabulary terms. Latent Dirichlet Allocation (LDA [11]) sees documents
as mixtures of different topics, where each term’s generation is attributable to one of the
document’s topics. Since both models show problems in NLP tasks like polysemy
detection or syntactic parsing, recently Word Embeddings [12] quantifying and cate-
gorizing semantic similarities between linguistic items based on their distributional
properties in large samples of language data have been proposed as a powerful deep
learning alternative. Therefore, in the following we will rely on word embeddings as
the state of the art method for entity contextualization and in particular, will use the
Word2vec Skip-Gram model implementation from the open source Deep-Learning-
for-Java5 library.
The basic idea of our approach is to create a new contextualized facet for entity-based
search in scientific digital libraries: in particular, a selection of closely related entities
with respect to the search entity. For actually building contextualized facets every
corpus of scientific documents can be used, but normally the selection of the document
base for subsequent embedding strictly reflects the type of entities under scrutiny. For
example in the case of pharmaceutical entities such as active ingredients, the National
Library of Medicine’s PubMed collection would be a good candidate.
After the initial crawling step the following process can be roughly divided into
four sub-steps:
1. Preprocessing of crawled documents. After the relevant documents were crawled,
classical IR-style text preprocessing is needed, i.e. stop-word removal and stem-
ming. The preprocessing helps mainly to reduce vocabulary size, which leads to an
improved performance, as well as improved accuracy. Due to their low discrimi-
nating power, all words occurring in more than 50% of the documents are removed.
These are primarily often used words in general texts such as ‘the’ or ‘and’, as well
as terms used frequently within a domain (as expressed by the document base), e.g.,
‘experiment’, ‘molecule’, or ‘cell’ in biology. Stemming further reduces the
vocabulary size by unifying all flections of terms. A variety of stemmers for dif-
ferent applications is readily available.
2. Creating word embeddings for entity contextualization. Currently, word embed-
dings [12] are the state-of-the-art deep learning technique to map terms into a
multi-dimensional space (usually about 200-400 dimensions are created), such that
terms sharing the same context are grouped more closely. According to the dis-
tributional hypothesis, terms sharing the same context in larger samples of language
data quite often, in general also share similar semantics (i.e. have similar meaning).
In this sense, word embeddings group entities sharing the same context and thus
5
https://deeplearning4j.org/.
Semantic Facettation in Pharmaceutical Collections 45
collecting the nearest embeddings of some search entity leads to a group of entities
sharing similar semantics.
3. Filtering according to entity types. The computed word embeddings comprise at
this point a large portion of the corpus vocabulary. This means, for each vocabulary
term there is exactly one word vector representation as output of the previous
step. Each vector representation starts with the term followed by individual values
for each dimension. In contrast, classical facets only display information of the
same type, such as publication venues, (co-)authors, or related entities like genes or
enzymes. Thus, for the actual building of facets, we only vector representations of
the same entity type are needed. Here, dictionaries are needed to sort through the
vocabulary for each type of entity separately. The dictionaries either can be directly
gained from domain ontologies, like e.g. MeSH for illnesses, can be identified by
named entity recognizers like e.g., the Open Source Chemistry Analysis Routines
(OSCAR, see [13]) for chemical entities, or can be extracted from open collections
in the domain, like the DrugBank for drugs.
4. Clustering entity vector representations. The last step is preparing the actual
facettation of entities closely related to some search entity. To do this, we first
consolidate the individual document spaces of the filtered entities by multidimen-
sional scaling (reducing its dimensionality to about 100-150). This steep dimen-
sionality reduction removes noise and enables a meaningful subsequent clustering.
We then apply a k-means clustering technique on all representations and decide for
good cluster sizes: in our approach optimal cluster sizes are not decided by a fixed
threshold, but by an analysis of intra-cluster vs. inter-cluster similarity.
While the basic algorithm promises to be applicable for a wide variety of domains,
testing its effectiveness in creating high quality entity facets needs a domain specific
focus. The following section evaluates our approach in a pharmaceutical use case.
For the evaluation, we will first describe our pharmaceutical text corpus and basic
experimental set-up decisions. Moreover, we perform a ground truth comparison and
show the meaningfulness of the facets automatically derived by our facettation method:
we compare results with the three established classification systems from Sect. 2.
6
https://www.ncbi.nlm.nih.gov/pubmed/.
46 J. Wawrzinek and W.-T. Balke
pharmaceutical entity under consideration should be ‘high enough’ because with more
training data, contexts that are more accurate can be learned, yet the computational
complexity grows. Thus, we decided to use the 1000 most relevant abstracts for each
entity according to the relevance weighting of PubMed’s search engine.
Query Entities. As query entities for the evaluation, we randomly selected 275 drugs
from the DrugBank7 collection. We ensured that each selected drug featured at least
one class label in ATC, MeSH, or AHFS, and occurred in at least 1000 abstracts on
PubMed. Thus, our final document set for evaluation contained 275.000 abstracts. As
ground truth, all class labels were crawled from both, DrugBank and the MeSH the-
saurus.8 For example, all retrieved classes for the drug ‘Acyclovir’ are shown in
Table 1. Since all classification systems show a too fine-grained hierarchical structure,
we remove all finer levels before assigning the respective class label to each drug. For
example, one of the ATC classes for the drug ‘Acyclovir’ is ‘D06BB53’. The first letter
indicates the anatomical main group, where ‘D’ stands for ‘dermatologicals’. The next
level consists of two digits ‘06’ expressing the therapeutic subgroup ‘antibiotics and
chemotherapeutics for dermatological use’. Each further level classifies the object even
more precisely, until the finest level usually uniquely identifies a drug.
7
https://www.drugbank.ca/.
8
https://meshb.nlm.nih.gov/search.
9
https://lucene.apache.org/.
10
https://deeplearning4j.org/word2vec.
Semantic Facettation in Pharmaceutical Collections 47
11
http://algo.uni-konstanz.de/software/mdsj/.
12
http://commons.apache.org/proper/commons-math/.
48 J. Wawrzinek and W.-T. Balke
and/or many small clusters will result in unsatisfactory usage experience in the
respective faceted interface.
• Semantic suitability: The selected entities per facet should be clearly justified by the
underlying document collection. Since there are different document-centered
approaches, a quantitative comparison regarding a ground truth is needed.
Semantic Accuracy of the Facettation: In our first experiment, we test the semantic
accuracy of our facettation, i.e. how well do entities in each cluster reflect a common
topic. Since this is obviously dependent on cluster sizes (smaller clusters inherently
show higher purity) and the respective granularity of the topic (in the sense of semantic
distances), we will vary both, the number of clusters in the clustering procedure and the
granularity of the topics (first level vs. second level accuracy). As ground truth, we use
only the categories given by the largest three pharmaceutical classification systems
ATC, MeSH, and AHFS (see Sect. 2). Please note that this ground truth restriction is
overly strict on document-centered contextualization, since commonly understood
contexts reflected in literature might not be reflected by any of the three systems. Thus,
our experiments can be seen as a worst-case boundary for our approach.
First, we quantify the accuracy in terms of precision/recall and F-measures on the
top categorization level only. We use the standard method for clustering accuracy
described in [14]. Because facets should tend towards higher precision for improved
user experience, we report both, F1- and F0.5-scores. We vary the number of clusters
(k) in our k-means clustering between 10 and 80. Since the randomly chosen query
entities might not be evenly distributed over the respective categories chosen as
majority labels, we compare our approach against a base line of clusters, where items
have been randomly exchanged between clusters. If there would be clearly dominant
categories, such a random baseline would show high accuracies.
Figure 1 shows averaged results of 30 independent runs for each number of
clusters. As could be expected, precision steeply increases for higher numbers of
clusters (i.e. small cluster sizes), whereas recall decreases the more clusters are built.
However, the F-scores show a clear optimum at 25 clusters (F1-score) and 35 clusters
(F0.5-score). Hence, preferring smaller cluster sizes (on average of 8-10 entities per
facet) in stark contrast to the random baseline that always prefers the smallest number
of clusters possible. Moreover, our approach’s F-scores constantly outperform the
baselines with 0.55 (F1-score) and 0.65 (F0.5-score) reaching precisions beyond 80%.
Thus, surprisingly our generalist approach is even comparable in overall accuracy to
approaches specifically designed to predict ATC or MeSH classifications, as reported in
Sect. 2.
We repeated the above experiments for the second layer of granularity in the
classification systems and achieved quite similar results (graphs have been omitted for
space reasons), again clearly outperforming the baseline. Of course, with finer gran-
ularity the relative size of clusters has to be expected to be much lower. However, again
measuring the F0.5-score, we achieved best results with a moderate 97 clusters at an
accuracy level of still 0.61. This is only 4% less, compared to the first level of gran-
ularity. For the F1-score, best results were achieved with 69 clusters at an accuracy
level of 0.55.
Semantic Facettation in Pharmaceutical Collections 49
Fig. 1. Comparison of contextualized facettation (red) and random clustering (blue). (Color
figure online)
Fig. 2. Average cluster sizes on first level granularity for the majority label compared to ATC,
MeSH, and AHFS.
On the second level of granularity (see Fig. 3) the medians of the distributions are
noticeably lower, as was to be expected for higher number of clusters (k = 97). Still,
our approach’s distribution again closely resembles the distributions of the respective
classification system. Moreover, in contrast to MeSh and AHFS our approach avoids
empty clusters and shows fewer outliers with large cluster sizes, quite similar to the
ATC classification system.
Looking at the provenance of majority cluster labels we find that on top-level
granularity the majority labels chosen for each cluster on average reflect 60.3% from
ATC classes, 34.3% from MeSh tree classes, and 5.4% from AHFS classes. For second
level granularity, we get 51.8% from ATC, 36.8% from MeSh, and 11.4% from AHFS.
Thus, our contextualization approach does indeed reflect different semantics as given
by the individual, manually created classification systems.
Semantic suitability of the Facettation: In our last experiment, we compare the clus-
tering accuracy of our approach with the accuracy achieved by classical IR techniques
based on term frequencies. Hence, we computed a TF-IDF-weighted vector space
model on all pharmaceutical texts in our selected document corpus for the 275 query
entities, again followed by a k-means clustering step. We then compared the respective
accuracies of the two methods with respect to the three manual classification systems as
ground truth.
In the clustering step for the top-level granularity, also TF-IDF shows highest
accuracy values for a number of 35 clusters and thus seems quite suitable for the task.
Semantic Facettation in Pharmaceutical Collections 51
Fig. 3. Average cluster sizes on second level granularity for the majority label compared to
ATC, MeSH, and AHFS.
of granularity the different majority labels are moderate distributed. Moderate means,
none categorization type dominates the overall facettation. Thus, our contextualization
approach does reflect different semantics as given by the individual, manually curated
categorization systems. This in turn shows that a facet consist of a composition of
different categorization systems, in which the facet elements (active ingredients) share a
similar semantic. In our pharmaceutical case, the facettation can be a suitable alter-
native to expensive as well as in most cases incomplete manually curated categoriza-
tion systems. Moreover, we also demonstrated that our facettation is balanced and does
not generate extreme distributions cluster sizes. Since, small (cluster size < 3) as well
as very large cluster are quite rare. Thus, it reflects a given distribution in respect to the
different categorization systems and therefore facets have a similar size compared to
manually curated categorization system categories. Finally, we tested the semantic
suitability of the facettation by comparing it with classical IR techniques. Our approach
outperformed (up to 30%) TF-IDF-weighted vector space model. Therefore, our deep
learning-based approach is a suitable alternative for classic IR-style frequency-based
approaches.
In addition to the statistical evaluation presented in this paper, we also questioned
domain experts for a first interpretation of our facettation. Surprisingly, they found
hidden semantics for some of the low-accuracy facets. This may indicate that our
facettation technique is able to discover hidden active ingredient contexts. A better
understanding of such hidden contexts would be interesting. Furthermore, labeling of
facets was however not considered in this paper. Such a labeling would prove quite
useful for an interpretation of the individual facets as well as it could lead to a better
understanding with respect to our facettation.
References
1. Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf.
Comput. Sci. 38(6), 983–996 (1998)
2. Tönnies, S., Köhncke, B., Balke, W.T.: Taking chemistry to the task: personalized queries
for chemical digital libraries. In: Proceedings of the ACM/IEEE Joint Conference on Digital
Libraries (JCDL 2011), Ottawa, Canada (2011)
3. Wishart, D.S., Knox, C., Guo, A.C., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z.,
Woolsey, J.: DrugBank: a comprehensive resource for in silico drug discovery and
exploration. Nucleic Acids Res. 34(1), D668–D672 (2006). Database issue
4. Sacco, G.M., Tzitzikas, Y.: Dynamic Taxonomies and Faceted Search: Theory, Practice, and
Experience. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02359-0
5. Köhncke, B., Balke, W.-T.: Context-sensitive ranking using cross-domain knowledge for
chemical digital libraries. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G.,
Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 285–296. Springer, Heidelberg
(2013). doi:10.1007/978-3-642-40501-3_29
6. Gonzalez Pinto, J.M., Balke, W.T.: Demystifying the semantics of relevant objects in
scholarly collections: a probabilistic approach. In: Proceedings of the ACM/IEEE Joint
Conference on Digital Libraries (JCDL), Knoxville, TN, USA (2015)
7. Gurulingappa, H., Kolárik, C., Hofmann-Apitius, M., Fluck, J.: Concept-based
semi-automatic classification of drugs. J. Chem. Inf. Model. 49(8), 1986–1992 (2009)
Semantic Facettation in Pharmaceutical Collections 53
8. Dunkel, M., Günther, S., Ahmed, J., Wittig, B., Preissner, R.: SuperPred: drug classification
and target prediction. Nucleic Acids Res. 36(suppl 2), W55–W59 (2008)
9. Trieschnigg, D., Pezik, P., Lee, V., De Jong, F., Kraaij, W., Rebholz-Schuhmann, D.: MeSH
Up: effective MeSH text classification for improved document retrieval. Bioinformatics 25
(11), 1412–1418 (2009). Oxford University Press
10. Dumais, S.T.: Latent semantic analysis. In: Annual Review of Information Science and
Technology (ARIST), Association for Information Science & Technology, vol. 38, no.
1 (2004)
11. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan),
993–1022 (2003). MIT Press
12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of
words and phrases and their compositionality. In: Proceedings of the Annual Conference on
Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA (2013)
13. Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., Murray-Rust, P.: OSCAR4: a
flexible architecture for chemical text-mining. J. Cheminform. 3(1), 41 (2011). Springer
14. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press, Cambridge (2008)
15. Borg, I., Groenen, P.J.: Modern Multidimensional Scaling: Theory and Applications.
Springer, Heidelberg (2005). doi:10.1007/0-387-28981-X
Cultural Heritage and Indigenous
Knowledge
A Foundry of Human Activities
and Infrastructures
In [1] and related studies we explored indexing digitized historical newspapers. It was
difficult to index the articles for retrieval or, even, to unambiguously identify what text
should be treated as an article. Thus, we proposed the development of knowledge-rich
“community models” to improve retrieval. Many aspects of infrastructure associated
with everyday activities and infrastructure generally can be described with such
community models. Such models would cover both tangible and intangible cultural
heritage such as pottery, clothing, dance, and religious traditions.
This work is parallel to a proposal we have made for direct representation of
scientific research results [4]. However, there are additional challenges for descriptions
of culture and history because of the lack of consensus about the definitions for social
entities and because there are disagreements about the details of cultures and histories.
Nonetheless, as information scientists, we believe that it is useful to develop frame-
works for articulating and exploring the possibilities. Ultimately such frameworks
should support tools both for the public and for scholars.
1
In some cases, “object-oriented” simply means entity or object-based. We use “object-oriented” in
the stronger, programming-language sense of objects that include specific processes and procedures.
2
The descriptions of Thick Entities we envision are analogous to the descriptions of Model or
Reference Organisms. The latter often includes anatomies (i.e., partonomies) and, less often,
descriptions of related Procedures and Mechanisms.
3
A Mechanism describes how a Process is implemented. A Procedure is like a workflow with flow
control and decision points. There is no direct way for BFO to represent control statements such as
loops and conditionals needed for Procedures, although it is possible to represent control statements
with OWL on an ad hoc basis and to use those representations in combination with BFO. A pure
BFO modeling language should be developed that, like the C programming language, is
self-compiling.
The distinction in some object-oriented languages between “private methods” and “public
methods” can also be applied to Thick Entities. Private methods are those which interact internally
only with other Parts of a given Thick Entity whereas Public Methods support interaction with other
Thick Entities.
4
obofoundry.org, obofoundry.org/docs/OperationsCommittee.html.
A Foundry of Human Activities and Infrastructures 59
infrastructure objects, their use and their interaction with other objects. The role of
entities in collections and records are important but secondary [6].
The contents of historical newspapers and other historical records for small towns often
describe entities and activities which are routine, even mundane. A partial list of such
entities and activities includes roads, farming, fishing, blacksmithing, weaving, coin
minting, pottery making, and bookbinding. Each of these is associated with specific
types of objects and procedures.
There are many levels for representing and modeling everyday human activities. At
a general level, we might describe infrastructures and technologies for supporting basic
human needs such as food and shelter. Such models could be increasingly refined as
they are applied to specific scenarios. While Aristotle focused on Universals as natural
entities, BFO has included human artifacts related to scientific research such as flasks.
We further extend the scope of Universals to include all types of human artifacts.
As noted above, we also propose using model-oriented Thick Entities for these
descriptions. Thick Entities would include Processes and Procedures. There is a
complex web of interactions in Processes and Procedures. For instance, farming pro-
cedures are affected by the availability of different metals for plows. Similarly, the
introduction of a train line may dramatically affect a community (cf., [1]).
The development of a large and internally consistent collection of infrastructure
entities will require a major effort that is in its early stages. Ontologies and other
controlled vocabularies have been developed for many entities and functions; for
instance, FGDC (fgdc.gov) provides descriptions for highways. Similarly, standard
descriptions for Mechanisms and Procedures such as from the Handbook of Synthetic
Organic Chemistry could also be included. Some aspects of Human Activities and
Infrastructures (such as farming or silkworm cultivation) could be linked in the OBO.
Ultimately the foundries should be unified.
While the previous section focused on material technologies and infrastructures, ulti-
mately, it will not be possible to separate the technologies and infrastructures from their
interaction with social activities. Social structures may be considered as entities in a
social ontology.5
5
Smith (Social Objects, http://ontology.buffalo.edu/socobj.htm) claims that social entities are entirely
consistent with the BFO framework. There has been significant work on social ontology by some of
the designers of the BFO framework but there does not yet seem to have been a concerted effort to
directly integrate that work into the BFO. Much of the discussion about social ontologies for BFO
has focused on commitments and obligations [21]. Other specific proposals have focused on con-
tracts, economics, and social aspects of medicine [14].
60 R.B. Allen et al.
There are many examples of the interaction of material infrastructure with social
entities. For example, textiles play an important role in traditional Thai society [8]; the
fabrics are integral to courtship, marriage, death, and a variety of Buddhist rituals.
A structured description of the materials and technologies would include aspects of
fabrics and weaving tools and techniques as well as their role in society.
In addition to tangible cultural heritage such as Thai silks, some cultural heritage
like dance and music can be both tangible and intangible. On one hand, musical
instruments are Continuants but musical performances are Occurrents. Moreover,
music also has a social dimension. For example, descriptions of Korean music (gugak)
need to include social distinctions between different genres (e.g., folk music vs. court
music) [16].
In many contexts, the models of the social framework would generally be from the
perspective of the participants. For historical local newspapers [1], we would generally
follow the presentation of the newspaper editors in developing models of schools,
businesses, government, churches, and families. Of course, there are frequently alter-
native interpretations beyond the normative descriptions. Therefore, flexible frame-
works would need to be developed to present and contrast differing viewpoints.
We have described how material infrastructures are interrelated and dependent on each
other. Beyond those simple descriptions, we can explore claims about the relationships
among components of the material and social infrastructures at the level of both
Universals and Particulars. However, in many cases, the relationships are complex and
not susceptible to proof. For instance, culture can be described as a web of relationships
[11]. We need to develop a flexible framework for making claims and demonstrations
about possible relationships and mechanisms (cf., [9, 10]) as well as showing the
arguments and evidence for those claims.
To understand the relationships among Universal Entities in the physical world, we
turned to natural science [4]. For social entities, we could turn to social science. This is
reasonable since we accept the position that social entities are “real”. Moreover, to the
extent that social science makes causal predictions, we can use those predictions to
confirm the validity of Entities. For physical phenomena, this type of confirmation of
Entities is known as scientific warrant. Because there is more uncertainty about social
science models, we may express our lower level of confidence for social entities by
referring instead to consistency warrant.
In sociology, there are several grand theories, or major theoretical frameworks.
Social ontology is a central aspect of each of these theories because they propose
theoretical constructs and relationships among the constructs. Here, we focus on Par-
sons’s AGIL [17] which asserts that there are four essential attributes a society must
have to endure:
• Adaptive: This describes the need to adjust to the environment. Both shelter and
farming would be considered as part of the Adaptive dimension. It covers many of
the human needs identified by [16].
A Foundry of Human Activities and Infrastructures 61
• Goal Oriented: This requires specification and accomplishment of social goals, and
would include regulations, laws, and politics.
• Integration: This describes cohesion of the social group such as through family,
religion, and language.
• Latency: A social group must renew its customs, knowledge, and values for the
next generation through education.
Parsons’s work is an application of Systems Theory to sociology (cf., [5]) and is
often described as structure-functionalist6. Following our analysis of functionality, we
propose that the Function of an Independent Continuant produces (or prevents) a State
Change in a specific Independent Continuant-Process pair [5].7 Thus, we might say:
• The Function of a ladle is to carry liquids.
• The Functions of Court music are to entertain and to impress guests. (Integration)
• The Function of certain types of physical structures (e.g. a house) is to shelter the
inhabitants. (Adaptive)
• The Function of the education subsystem is to transmit knowledge. (Latency)
This description of Parsons’s work just scratches the surface; for instance, he has an
extensive discussion of the function of the family. It may be possible to develop a
structured version of his entire framework. However, we should also note that among
sociologists, there is disagreement about the value of the AGIL system.
Our emphasis on realism for social entities is also relevant to anthropology. We
might first focus on the social science perspective to anthropology rather than the
humanities perspective [18] (cf., [20]). Thus, we might emphasize archaeology and
physical anthropology. Nonetheless, many entities and social activities such as rituals
and icons that are the subject of anthropology clearly have deep symbolic, aesthetic,
and emotional significance which we would need to account for.
6 Models of Particulars
6
A full Functionalist model could have a web of Functions that address Needs. Mechanisms which
satisfy Needs may themselves generate new Needs. BFO seems to lean toward a Structuralist view
but its inclusion of Procedures with an object-oriented flavor suggests it could become more
Functionalist.
7
For an internally consistent ontology/model, all terms in the definitions should also be included in the
ontology.
62 R.B. Allen et al.
widely criticized. Instead, Roberts [19] proposes that most major historical events (e.g.,
revolutions) do not have a single over-arching covering theory but are composed of
smaller events each of which can be accounted for with covering theories. [12] makes a
similar point, that claims about causal relationships among social phenomena need to
be supported by models of mechanisms for how the entities interact.8
Because social situations are complex and because Thick Entities are generally
composed of many parts it may be difficult to confirm causal processes. For instance,
while it is easy to believe that the prosperity in the Roman Empire during the reign of
the Antonine Emperors was due to their good policies [13], we cannot make that case
with scientific rigor. After documenting the evidence, we may apply the generalization
only while retaining some caution about it.
Just as [4] proposed a variety of interrelated repositories for scientific research, similar
interlocking repositories will be needed to complement the foundry of everyday
activities and infrastructures. There would be several layers of knowledge resources:
• Ontology and Model Foundry: The everyday Human Activities and Infrastruc-
tures Foundry would include not only ontologies but also models of Thick Entities.
The complete Foundry will require details of many different types of Procedures. In
addition to the ontologies, the Foundry might include Reference Models such as of
Bronze Age communities or Midwestern U.S. towns.
We may not have full confidence in some of the Universals because there are
competing theoretical frameworks. Thus, we may allow alternative representations
using several of those frameworks. Related to this, we may apply a weaker con-
sistency warrant rather than scientific warrant as a criterion for inclusion.
Ontologies based on the BFO can be considered a type of classification system;
after all, each BFO ontology is a taxonomy. A collection of BFO ontologies (i.e., a
Foundry) can be viewed as an entity-based faceted classification.9
• Models of Particulars: See Sect. 6 above.
• Primary Source Materials: [3] called for cleaned and consistent repositories of
historical source material. Moreover, these materials should have standard
markup. In addition, databases of locations, climate, records, economic data, census
reports, sports scores can also be coordinated with the Foundry ontologies and
models.
8
Much of what is termed systems analysis appears focused more on process re-engineering than on
systematic analysis of existing systems. Case studies can support what might properly be considered
as systems analysis. Specifically, convergent case studies can be useful to evaluate possible causal
mechanisms [12].
9
The links of other entities (such as Locations, Dependent Continuants, and Processes) to the Object
forms a sort of faceting. Indeed, it is easy to see the similarity to Raganathan’s PMEST and to
FrameNet’s Frame Elements [2]. However, such entity-based faceting should be distinguished from
other faceted classification systems which are subject based.
A Foundry of Human Activities and Infrastructures 63
8 Discussion
We have examined issues for collecting and coordinating applied ontologies and
models for Everyday Human Activities and Infrastructures. These ontologies and
models build on the rigorous semantics of the BFO and extend the constraints of BFO
to everyday infrastructures and then to social and cultural descriptions. To do that, we
relax some of the constraints but we expect that these will be flagged appropriately.
This effort is as much about developing a useful information resource as about
maintaining the purity of the ontological framework.
References
1. Allen, R.B.: Toward an Interactive Directory for Norfolk, Nebraska: 1899–1900. IFLA
Newspaper and Genealogy Section Meeting, Singapore (2013). arXiv:1308.5395
2. Allen, R.B.: Frame-Based Models of Communities and Their History. In: Nadamoto, A.,
Jatowt, A., Wierzbicki, A., Leidner, J.L. (eds.) SocInfo 2013. LNCS, vol. 8359, pp. 110–
119. Springer, Heidelberg (2014). Histoinformatics. doi:10.1007/978-3-642-55285-4_9
3. Allen, R.B.: Issues for direct representation of history. In: ICADL 2016, pp. 218–224.
doi:10.1007/978-3-319-49304-6_26
4. Allen, R.B.: Rich semantic models and knowledgebases for highly-structured scientific
communication (2017). arXiv:1708.08423
5. Allen, R.B.: Rich semantic modeling, in preparation
6. Allen, R.B., Song, H., Lee, B.E., Lee, J.Y.: Describing scholarly information resources with
a unified temporal map. In: ICADL 2106, pp. 212–217. doi:10.1007/978-3-319-49304-6_25
7. Arp, R., Smith, B., Spear, A.D.: Building Ontologies with Basic Formal Ontology. MIT
Press, Cambridge (2015). http://purl.obolibrary.org/obo/bfo/Reference
8. Conway, S.: Thai Textiles. British Museum Press, London (1992)
9. Chu, Y.M., Allen, R.B.: Formal representation of socio-legal roles and functions for the
description of history. In: TPDL, pp. 379–385 (2016). doi:10.1007/978-3-319-43997-6_30
10. Diamond, J.: Guns, Germs, and Steel. Norton, New York (1997)
11. Gasser, L.: Information and collaboration from a social/organizational perspective. In: Nof,
S.Y. (ed.) Information and Collaboration Models of Integration, pp. 237–261. Kluwer, the
Netherlands (1994)
64 R.B. Allen et al.
12. George, A.L., Bennett, A.: Case Studies and Theory Development in the Social Sciences.
MIT Press, Cambridge (2004)
13. Gibbon, E.: The History of the Decline and Fall of the Roman Empire (1845). www.
gutenberg.org/files/731/731-h/731-h.htm
14. Jansen, L.: Four rules for classifying social entities. In: Hagengruber, R., Riss, U. (eds.)
Philosophy, Computing and Information Science, pp. 189–200. Pickering & Chatto, London
(2014)
15. Lee, B.W., Lee, Y.S. (eds.): Music of Korea. National Center for Korean Traditional
Performing Arts, Seoul (2007)
16. Maslow, A.H.: A theory of human motivation. Psychol. Rev. 50, 370–396 (1943)
17. Parsons, T.: The Structure of Social Action. Free Press, Boston (1968)
18. Peregrine, P., Moses, Y.T., Goodman, A., Lamphere, L., Peacock, J.L.: What is science in
anthropology? Am. Anthropol. 114, 593–597 (2012). doi:10.1111/j.1548-1433.2012.01510.x
19. Roberts, C.: The Logic of Historical Explanation. Pennsylvania State University Press, State
College (1995)
20. Schilbrack, K.: A realist social ontology of religion. J. Relig. 27, 161–178 (2017). doi:10.
1080/0048721X.2016.1203834
21. Smith, B.: Searle and de Soto: the new ontology of the social world. In: Smith, B., Mark, D.,
Ehrlich, I. (eds) The Mystery of Capital and the Construction of Social Reality, Open Court
(2015). http://ontology.buffalo.edu/document_ontology/Searle&deSoto.pdf
22. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J.,
Eilbeck, K., Ireland, A., Mungall, C.J., Leontis, N., Rocca-Serra, P., Ruttenberg, A.,
Sansone, S.A., Scheuermann, R.H., Shah, N., Whetzel, P.L., Lewis, S.: The OBO foundry:
coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol.
25, 1251–1255 (2007). doi:10.1038/nbt1346
Conceptualising the Digitisation
and Preservation of Indigenous Knowledge:
The Importance of Attitudes
1 Introduction
Digital technologies that are used for generating, collecting, managing and preserving
heritage knowledge are developing so rapidly that both information professionals and
the institutions involved the management of heritage resources are concerned about
possible gaps that can occur in the development of memory for the future. Some
custodians of heritage resources and owners of cultural knowledge also fear they may
lose their heritage resources through digitisation. Consequently, memory institutions
face various challenges in their attempt to collaborate with people from sources
communities and other key stakeholders to digitise and preserve cultural knowledge.
Memory institutions, including libraries and museums have become keen on
opportunities to engage with potential partners and collaborators to develop heritage
digitisation and preservation programmes. Examples of memory institutions’ involve-
ment with source communities include the Europeana project where various memory
institutions across Europe collaborated to facilitate an innovative cultural knowledge
transfer. The national library, national archive and national museum of New Zealand
provide examples with what they are doing to preserve the cultural heritage of
New Zealand. Through the efforts of these cultural institutions, New Zealand has now
established a National Digital Heritage Archive (NDHA). The American Memory and
the Australian Digital Collections are other examples of national digital memory pro-
jects that have all been developed through collaborations with people from source
communities and key stakeholders.
The achievements of these cultural institutions are nevertheless not without chal-
lenges, especially as the digital technologies landscape is evolving quickly and memory
institutions need to keep abreast. For instance, many archives, libraries and museums
are exploring and experimenting with the use of social media and Web 2.0 technologies
to enable participatory digital cultural heritage and indigenous knowledge management
and are battling a range of issues [1]. For such participatory construction of cultural
knowledge to be successful and to ensure the effective involvement of multiple source
cultural communities in heritage digitisation and preservation, the various challenges
and underlying issues need to be identified and dealt with. Appropriate planning,
policies and strategies need to be developed and sufficient resources allocated
accordingly. Also, there is the need to create more awareness among key stakeholders
of the benefits of digital preservation [2]. In this paper, we examine various factors that
influence the digitisation of cultural heritage resources and the challenges faced by
memory institutions. We also seek to outline and discuss the differences in the chal-
lenges and issues faced by memory institutions in developed and developing countries.
The discussions in this paper is guided by the following research questions:
• What are the contemporary challenges faced by memory institutions in digitising
cultural heritage knowledge for preservation purposes, in both developing countries
and developed countries?
• Are there any differences in challenges faced in these different contexts?
Conceptualising the Digitisation and Preservation 67
For our discussions in this paper, we use cases from New Zealand and Australia as
examples from developed countries, and cases from Ghana as examples from a
developing country perspective.
2 Literature Review
Preservation is a very complex phenomenon. Krtalic and Hassenay [3] (2012) observe
that some of the contradictory issues affecting preservation management arise from the
material properties of heritage resources themselves, environmental changes, funding
possibilities, legal documents, selection criteria, user needs, presentation possibilities,
cultural and historical values, national and international context, just to mention a few.
These issues present significant challenges to memory institutions in their quest to
undertake digitisation and digital preservation programmes and projects. Krtalic and
Hassenay categorised these issues into five main clusters: strategic and theoretical,
economic and legal, educational, technical and operational, and cultural and social [3].
These clusters of factors, according to Krtlic and Hassenay [3] provide starting points
for improving the organisation of systematic preservation and management of heritage
resources in developing country contexts.
Issues and challenges faced by memory institutions, including libraries, archives
and museums, in digitising cultural heritage knowledge for preservation purposes can
be seen from the stage of setting up the institution. For instance, various researchers in
the literature have discussed the challenges that affect the establishment of digital
libraries, [4–7]. Others have also looked at issues affecting the setting-up of digital
museums [8–10] and problems relating to digital archives and records management
[11–13]. From the onset, there are concerns on how to effectively setup the technical
infrastructure, how to build the digital collection, how to fund the digitisation, dealing
with metadata concerns, naming, identifiers, copyright issues, intellectual property
issues, just to mention a few. It is important to identify potential problems and to
understand how to deal with the specific issues around those problems [14]. Hence, in
this section, we will explore the challenges that have been discussed within the liter-
ature relating to how memory institutions engage with their stakeholders (including
source communities) to digital heritage resources for preservation purposes so that
those issues can be highlighted for further empirical study to understand them deeper.
All cultural heritage institutions, including galleries, libraries, archives and muse-
ums (GLAM) play a role in information management by handling specific aspects in
the preservation of cultural heritage of society. The roles performed by each member of
the GLAM work towards the management of a common heritage for preservation
purposes. In view of this, Mallan [15] identifies that the term collective memory is
increasingly being used in the last decade to help with the recognition of the roles
played by the various institutions in the GLAM sector. Each of these institutions is also
seriously taking advantage of the improvements in digital technologies to incorporating
digitisation in their activities to ensure the effective preservation and provision of
access to cultural heritage resources.
As Huvila [16] observes, there has been an increasing political interest in memory
institutions and the roles they play in contemporary society in the last couple of
68 E. Boamah and C.L. Liew
The growing number of immigrants from other cultures may find it difficult to fully
appreciate the digitised Māori collection within its Mātauranga Māori vision.
Another recent study analysed the impacts of digitised te reo Māori archival col-
lections on its users. Their main aim was to examine the use of the archive and to
develop a methodology that can be used in impact assessment for digital collection
providers [29]. This study shows that some members of source communities who
donated their heritage resources to be digitised and included in the te reo collection
required assurance that their taonga (cultural treasures) would not be violated, which
called for culturally appropriate ways of involving sources communities in the digiti-
sation of their cultural knowledge. It was also evident that some members of the source
communities were distrustful of online information and they did not trust outsiders of
their culture to getting full access to their cultural knowledge. Crookston et al. describe
this lack of trust and its impact on cultural heritage digitisation:
For indigenous communities, the stigma of being “the other” in research presents an obstacle
to researchers looking to involve themselves in indigenous knowledge acquisition. Yet, through
respectful means, and genuine collaboration, more dynamic and trusted research can
eventuate [29].
This study also reveals a mixed reaction to digitising cultural knowledge. While
some believed that digitisation enhances access to their cultural some others worried
that digitisation degrades the original versions of the heritage resources and cause them
lose some of the wairua (spirit) you could get from learning the same information from
a kaumātua (elder). They believe that some of the cultural documents have voices and
sounds of those who may have passed. Digitising them deprive them of this quality
[29]. These beliefs present additional considerations that need to be addressed appro-
priately in the digitisation of indigenous knowledge for preservation.
In the developing world, Africa is one of the most deprived areas while being highly rich
in culture heritage. Memory institutions in Africa therefore have a huge task to protect
the memories of their countries for the future society. Nevertheless, they face many
challenges when it comes to digitising cultural heritage resources for preservation
purpose. Many authors have written about some of these challenges in the region. For
instance, Boamah and Tackie looked at the state of digital heritage resources manage-
ment in Ghana and found that memory institutions in Ghana operated in poor condi-
tions [30]. Asamoah, Akusah, and Mensah [31], argued that the poor state of memory
institutions in Ghana were due to lack of funding from government, who is the main
financier of memory institutions in the country. Also, Sigauke and Nengomasha [32]
examined the challenges faced by National Archives of Zimbabwe and found that this
important national memory institution did not even have any rigorous laid down pro-
cedure they follow to manage and preserve their national historical records and other
valuable information materials. It was clear Sigauke and Nengomasha’s [32] study that
understanding of the technologies used for digitisation was a problem for professionals.
In addition, Samir, Sharkas, Adly, and Nagi [33] observed that the Alexandria
Library and other memory institutions in Egypt faced a huge challenge of managing the
Conceptualising the Digitisation and Preservation 71
digitisation of more than 800,000 pages of press articles. Thus, they developed a
digitisation workflow that supported the digitisation process massively. Yet, memory
institution in Egypt still face the challenges of associating the accessible online archive
with a multidimensional search engine. Although many people are showing interest in,
and using the new digital technologies, access to these technologies is difficult for
memory institutions in Africa. Lack of funding is identified to be the main hindrance
for acquiring the technology [31].
There is also access issues that relate to how to use the technology. People with the
skills to operate digital technologies are lacking. Training in ICT for professional in
memory institutions is not encouraging in most countries in Africa. It has been found
that most people in Africa lose the desire in training in ICT because of the challenges
they face with accessing the technologies. Also, other related challenges such as lack of
access to electricity to operate digital technologies put most people off from spending
their scare resources to train in skills they may find it difficult to put into use. In the last
couple of years, Ghana for instance, attained international recognition for the use of the
term “dumsor” (which is used daily on-and-off electricity supply and power rationing)
because of the number of times the people searched the term in Google and described
discussed it in Wikipedia. As a result of dumsor, many institutions that rely on elec-
tricity to operate their equipment were bad affected. Most memory institution lost the
few digital technologies (such as computers, scanners, printer, etc.) at their disposal
because of inconsistent power supply, affecting most of the digitisation projects, which
are in their very early stages [34].
Certain attitudes by stakeholders who are involved in memory institutions and their
activities also create challenges. In most developing countries, there is little stakeholder
interest in investments in digitisation and digital preservation. Most of these instances
of lack of interests can be attributed to the fact that there are other pressing needs that
take priority of the use of scare resource. For instance, in places where the people are
struggling to get enough food, clean water, good education, proper health-care, etc.
decision makers will give little attention to investments in digitisation projects and
memory institutions have to make do with what is made available for the. In most of
these developing countries, activities of memory institutions rely on funding from
governments.
Apart from this, there is lack of collaboration among the GLAM institutions.
Libraries do their own things separate from what archives do and museums do their
own things separate from what galleries do. When exploring digitisation initiatives in
Malaysia Zuraidah [35] observed that such a situation of individual digitisation pro-
cesses can result in duplication and ineffectiveness in the management and preservation
of digital resources. In agreement, Boamah [34] indicates that there is a crucial need for
an orderly scholarly investigation to understand the nature and state of the management
of digitisation initiatives by cultural heritage institutions in Ghana. In addition, there is
disregard for Information and Cultural Heritage Management laws. Boamah [34]
observed that a little over a decade ago, organisations in the UK for instance were
observed to be motivated to embark on digital preservation activities by factors
including legal requirements, accountability, protecting the long-term view, protecting
investments, enabling future reuse opportunities, fear of losing information, user
expectations of information and business efficiency [36]. But while the fear in the UK
72 E. Boamah and C.L. Liew
motivates the institutions to digitise and preserve their heritage resources, the fear in
developing countries rather hinders digitisation because, owners of cultural heritage
resources also fear that the may lose ownership of their cultural heritage forever.
Hence, they resist submitting their heritage objects and resources to memory institu-
tions to be digitised.
There are also some tensions among information management authorities and
decision makers when it comes to decision making regarding information management
issues and projects that can lead to digitisation of heritage resources in developing
countries. In Ghana for instance, the Ghana Library Board (GLB) executes national
library responsibilities and the Ghana Library Association (GLA) is the leading pro-
fessional association for library and information professionals in the country. There is
also the Ghana Library Authority who has mandate to develop and oversee library
related activities, policies and projects. Top decision makers in these organisations do
not agree in ideas about the establishment of a national library for Ghana. Their views
on the nature and purpose of a national library for Ghana were conflicting, creating
tension between them. This tension is preventing the Ghanaian government from
releasing funds and rolling out plans for the development of a national library for the
country. Hence, stalling all-important projects including the development of a national
heritage digitisation programme [34].
The tension issues among decision makers in memory institutions is very similar to
the issues of power structure in indigenous African communities. African traditional
systems are based on customary leadership and kinship. Such traditional leaders
include kings, chiefs, clan heads and traditional priests etc. have control of the various
heritage resources in their traditional areas. In Ghana for instance, each of the over 100
tribal groups have their specific heritage resources that are controlled by the traditional
leaders. This separate control over heritage resources makes it difficult for memory
instances to collect, and build a national heritage collection that can be digitised to
establish a national digital memory. This issue become even more complex when a
dominant tribe controls most of a country’s heritage resources. Boamah for instance,
found that one of the many challenges hindering Ghana from establishing a national
heritage repository is because most of the country’s heritage resources are controlled by
the Asante tribe because they are the biggest traditional group. But, Asante is not
prepared to release their heritage resources to be made national because they want the
resources to be still recognised as Asante’s. The other cultural groups also feel Asante
will feel supreme if their heritage resources are used as national heritage. These ten-
sions (which have their roots from ancient tribal wars) have created a lot of animosities
and bitterness among the traditional groups. It is also creating fear permanent loos of
tribal heritage resources in the various tribal groups [34].
There appears to no formidable long-term digital preservation strategies and poli-
cies in place at both national and institutional levels, to have African heritage materials
still available and accessibly in the future. Le Roux [37] analysed a group of literature
and suggest a need for a working group to look into the real need of policy for digital
preservation in Africa. When Imo and Igbo [38] reviewed institutional policies and the
Conceptualising the Digitisation and Preservation 73
management of institutional repositories in Nigerian, they found that out of 129 reg-
istered university libraries in Nigeria, only one had some policy for the management of
it institutional repository. Observing that the fast improving technology and its con-
comitant enhancement of digital resources has brought about a myriad of new chal-
lenges faced by memory institutions generally, McGreal [39] analysed the specific
challenges faced by academic librarians and found that many academics are reluctant to
offer their research output to be published in electronic journals.
Boamah [34] found that while Ghanaian university libraries had good institutional
repositories and some policies around their use, there were no institutional policy for
any of the public memory institutions. Ghana has a policy on ICT for accelerated
development (ICT4AD), which is the main policy that relate to digital activities in the
country. However, the policy lacks associated strategies and has many issues affecting
its implementation [34]. For instance:
• lacks achievable goals and targets because all the miles-stones set in the policy were
not achieved
• there are no strategies for DPM accompanying the policy. None of the strategies
that come with the policy focus on information management
• the policy lacks adequate resources to enhance its effective implementation
• there is no complementary policy to provide multipath actions and outcomes which
the PSR troika model suggests as the effective way of achieving policy goals.
• as a result of political influences, the ICT policy in Ghana was not collaboratively
developed. So, it lacks the input of all stakeholders and largely contains the interest
of some key players.
• incumbent governments are not willing to continue with other policies initiated by
previous governments
• relevant stakeholders are not even aware of the policy
• it is not reviewed on an on-going basis to meet current needs
• it is not effectively promoted [34].
Most of the issues facing memory institutions are common in developed and devel-
oping countries. Nevertheless, there are specific issues faced by memory institutions in
developing countries that are uncommon to their counterpart in developed countries.
Table 1 outlines a summary of the various challenges facing memory institutions in
developed and developing countries.
74 E. Boamah and C.L. Liew
6.1 Strategic
Within the Krtalic and Hassenay [3] preservation management model, the strategic and
theoretical component provides basic elements to consider for any effective digital
preservation management programme. Factors in this category influence the planning
and development of strategies and policies to manage preservation programme within
current contemporary skills, ideas and knowledge, following good practice preserva-
tion activities. These strategies and policies include those developed at both national
and institutional levels within a country. Attention should be paid to the national
context within which memory institutions in the country operate - how preservation is
organised at the national level and how national level plans and strategies affect
institutional level preservation activities; and how institutions collaborate with national
and international stakeholders to enable effective preservation programme [3]. Analysis
of the literature reveals that most developed countries have strong policies and
strategies at both national and institutional levels. These policies and strategies provide
a general context in which their memory institutions operate which have enabled them
to achieve progress in their digitisation and digital preservation management activities.
There is also a defined corpus of knowledge about preservation, taking into account
specific practical digitisation activities [25, 27–29]. In contrast, the literature also shows
Conceptualising the Digitisation and Preservation 75
that most developing countries do not have strong national and institutional policies
and strategies for their preservation management [32, 34], which means there is lack of
defined context for memory institutions to operate. This also makes it difficult for them
to cooperate with one another and collaborate effectively to develop digital preservation
programmes.
6.3 Educational
The factors influencing the educational component of the conceptual preservation
management framework relate to elements that help to define a body of knowledge
about preservation within a particular country’s context. Some of these issues include
how key stakeholders within a country come together to create a forum to discuss what
ideas about preservation management are necessary to develop knowledge for inclusion
in their educational curriculum. In New Zealand for instance, professional bodies like
LIANZA, ARANZ, RIMPA and the National Digital Forum gather together to discuss
area of preservation management that are necessary for education and training. The
professional bodies develop bodies of knowledge around those areas and provide
professional training for their members. Educational institutions collaborate with
employers, professionals and other key stakeholders to incorporate the necessary areas
in their curriculum. The situation is different in developing countries where institutions
generally lack trained workforce for preservation management.
7 Conclusion
References
1. Liew, C.L.: Digital cultural heritage 2.0: a meta-design consideration. In: 8th International
Conference on Conceptions of Library and Information Science, Information Research,
vol. 18, no. 3, Copenhagen, Denmark. 19–22 August 2013 (2013). http://www.informationr.
net/ir/18-3/colis/paperS03.html
2. Boamah, E., Liew, C.L.: Involving source communities in the digitisation and preservation
of indigenous knowledge. In: 18th International Conference on Asia-Pacific Digital
Libraries, ICADL 2016, Proceedings, Tsukuba, Japan, 7–9 December 2016, pp. 21–36
(2016)
3. Krtalić, M., Hasenay, D.: Exploring a framework for comprehensive and successful
preservation management in libraries. J. Doc. 68(3), 353–377 (2012). doi:10.1108/
00220411211225584
4. Cleveland, G.: Digital libraries: definitions, issues and challenges (1998). https://www.ifla.
org/archive/udt/op/udtop8/udtop8.htm
5. Greenstein, D.: Digital libraries and their challenges. Libr. Trends 49(2), 290–303 (2000)
6. Kuny, T., Cleveland, G.: The digital library: myths and challenges. IFLA J. 24(2), 107
(1998). doi:10.1177/034003529802400205
7. Mishra, R.K.: Digital libraries: definitions, issues, and challenges. Innov. J. Educ. 4(3), 1–3
(2016)
8. Museum of New Zealand Te Papa Tongarewa: Planning a new museum-how god is your
idea? Setting up a new museum (2007). https://www.tepapa.govt.nz/sites/default/files/32-
planning-a-museum.pdf
9. Rosati, E.: Copyright issues facing early stages of digitization projects. Mobile collection
projects (2013). http://www.digitalhumanities.cam.ac.uk/Copyrightissuesfacingearlystagesof
digitizationprojects.pdf
10. Simon, N.: Museum 2.0: what are the most important problems in our field (2011). http://
museumtwo.blogspot.co.nz/2011/10/what-are-most-important-problems-in-our.html
11. Holcomb, J.L.: Preserving digital archives, preserving cultural memory. J. Assoc. Hist.
Comput. 3, 3 (2000). http://hdl.handle.net/2027/spo.3310410.0003.320
12. Picot, A.: Digital archives: Here’s the problem – how would YOU address it? Report back
from the Recordkeeping Roundtable Workshop at #ARANZASA in Christchurch, NZ.
(2014). https://rkroundtable.org/2014/10/14/digital-archives-heres-the-problem-how-would-
you-address-it/
13. Wright, J.: Challenges of appraising records in the digital age. In: The Bigger Picture: Exploring
Archives and Smithsonian History (2012). https://siarchives.si.edu/blog/challenges-appraising-
records-digital-age
14. Kaiser, J.F.: Richard Hamming: “You and Your Research”. Transcript of the Bell
Communication Research Colloquium Seminar (1986). http://www.cs.virginia.edu/*robins/
YouAndYourResearch.html
15. Mallan, K.: Is digitisation sufficient for collective remembering? Access to and use of
cultural heritage collections. Can. J. Inf. Libr. Sci. 30(3/4), 201–220 (2006)
16. Huvila, I.: Archives, libraries and museums in the contemporary society: perspectives of the
professional. In: iConference 2014 (2014). doi:10.9776/14032
17. Cathro, W.: Collaboration across the collecting sectors (2010). https://www.nla.gov.au/
content/collaboration-across-the-collecting-sectors
18. Cimasi, R.J., Zigrang, T.A.: The four pillars of healthcare: part IV technological
advancements in the healthcare industry. In: The Value Examiner, January/February 2017,
pp. 17–24 (2017). AN: 121978019
Conceptualising the Digitisation and Preservation 79
19. Walker, I., Vlok, A.J., Kamat, A.: A double-edged sword: the effect of technological
advancements in the management of neurotrauma patients. Br. J. Neurosurg. 31(1), 89–93
(2017). doi:10.1080/02688697.2016.1220504
20. Burris, C.: Technical services and digital literacy. Technicalities 37(1), 13–16 (2017). AN
120933179
21. Kumar, N.: Transforming libraries through information and communication technology. Int.
J. Inf. Dissemination Technol. 6(3), 174–178 (2016). AN 119763982
22. Barker, S.K.: New opportunities for research libraries in digital information and knowledge
management: challenges for the mid-sized research library. J. Libr. Adm. 46(1), 65–74
(2007). doi:10.1300/J111v46n01_05
23. Peacock, A.: Auckland a melting pot - ranked world’s fourth most cosmopolitan city (2016).
http://www.stuff.co.nz/auckland/75964986/Auckland-a-melting-pot-ranked-worlds-fourth-
most-cosmopolitan-city
24. Ladiges, C., Bruenig, M.: Australian museums must innovate or risk becoming ‘digital
dinosaurs’ (2014). http://www.csiro.au/en/News/News-releases/2014/Australian-museums-
risk-becoming-digital-dinosaurs
25. Sansom, M.: Go digital or die, Australia’s cultural institutions told (2014). http://www.
governmentnews.com.au/2014/09/digitalise-die-australias-cultural-institutions-told/
26. Carnaby, P.: Libraries as a common denominator: a view from New Zealand of the citizen,
country and global perspective. Electron. Libr. Inf. Syst. 43(3), 251–263 (2009)
27. National Library of New Zealand: Digitisation Strategy 2014–2017 (2014). https://natlib.
govt.nz/about-us/strategy-and-policy/digitisation-strategy-2014-2017
28. Moran, J.: Born digital in New Zealand report of survey results (2017). https://natlib.govt.nz/
files/reports/research-borndigital2017-report.pdf
29. Crookston, M., Oliver, G., Tikao, A., Diamond, P., Liew, C.L., Douglas, S.L.: Kōrero Kitea:
Ngā hua o te whakamamatitanga: the impact of digitised te reo archival collections (2016).
https://interparestrust.org/assets/public/dissemination/Korerokiteareport_final.pdf
30. Boamah, E., Tackie, S.N.B.: The state of heritage resources management in Ghana. In: 4th
International conference on African Digital Libraries and Archives. (ICADLA). University
of Ghana, 29 May 2015 (2015). http://hdl.handle.net/10539/18396
31. Asamoah, C., Akusah, H., Mensah, M.: Funding memory institution in Ghana: the case of
Public Records and Archives Administration Department (PRAAD) (2015). http://
wiredspace.wits.ac.za/xmlui/bitstream/handle/10539/18395/Asamoah_Akussah_Mensah_
Final.pdf?sequence=1&isAllowed=y
32. Sigauke, D.T., Nengomasha, C.T.: Challenges and prospect facing the digitisation of
historical records for their preservation within the national archives of Zimbabwe. In: 2nd
International Conference on African Digital Libraries and Archives (ICADLA) at The
University of Johennesburg, South Africa (2014). http://wiredspace.wits.ac.za/handle/10539/
11533
33. Samir, A., Sharkas, A., Adly, N., Nagi, M.: Digital preservation: handling large collections
case study: digitizing Egyptian press archive at Centre for Economic, Judicial, and Social
Study and Documentation (CEDEJ). In: 4th International conference on African Digital
Libraries and Archives (2014). http://hdl.handle.net/10539/18635
34. Boamah, E.: Towards effective management and preservation of digital cultural heritage
resources: an exploration of contextual factors in Ghana (2014). http://hdl.handle.net/10063/
3270
35. Zuraidah, A.M.: The state of digitisation initiatives by cultural institutions in Malaysia; an
exploratory survey. Libr. Rev. 56(1), 45–60 (2007)
36. Waller, M., Sharpe, R.: Mind the gap; Assessing digital preservation needs in the UK (2006).
http://www.dpconline.org/search.html?q=1975+NASA+data+problem
80 E. Boamah and C.L. Liew
37. Le Roux, A.: Indigenous knowledge in a virtual context: sustainable digital preservation, a
literature review. In: 4th International Conference on African Digital Libraries and Archives
(ICADLA), University of Ghana, 29 May 2015 (2015). http://hdl.handle.net/10539/18402
38. Imo, N.T., Igbo, H.U.: Institutional policy and management of institutional repositories in
Nigerian universities. In: 4th International Conference on African Digital Libraries and
Archives (ICADLA), University of Ghana, 29 May 2015 (2015). http://hdl.handle.net/
10539/18399
39. McGreal, R.: Stealing the goose: copyright and learning. In: Caswell, J., Haschak, P.G.,
Sherman, D. (eds.) New Challenges Facing Academic Librarian Today: Electronic Journals,
Archival Digitisation, Documents Delivery, Etc. Lewiston, New York (2005)
A Metadata Model to Organize Cultural
Heritage Resources in Heterogeneous
Information Environments
1 Introduction
Cultural Heritage is a unique asset that belongs to a certain community. In the net-
worked information environment developed on the Web, there are large digitized
collections of cultural heritage, which we refer to as Digital Archives in this study.
Cultural heritage, which may be digital or non-digital, is described through information
expressed as a metadata. This study mainly focuses on organizing this Cultural Her-
itage Information (CHI) from the viewpoint of a data model for CHI.
CHI is being considered as difficult to deal with from the viewpoint of interoper-
ability on the Web because of its heterogeneity. Nevertheless, memory institutions, for
instance, Libraries, Archives and Museums (LAM or MLA) accept this challenge and
intervene into this process. They collect cultural heritage resources and digitize them,
organize the digital cultural heritage resources as a part of their collections, and provide
these information resources to their patrons via the Web and/or their in-house services.
With the advancement of the Web technologies LAM is now trying to find novel
approaches to link these collections built by individual institutions and present them as
a single, complete information portal. For instance, the Europeana data portal which
has developed to collect and disseminate digital cultural heritage on the Web is a
typical example of such an effort. Europeana uses a model-based information aggre-
gation process and their model is known as the Europeana Data Model (EDM) [11].
The CHI scenario of the developing regions such as South and Southeast Asia is
not as bright as the developed countries. According to our previous study, we have
identified several basic problems related to LAM in the region [20]. It is obvious that
South and Southeast Asian memory institutions also intervene in the CHI creation,
management and dissemination, but they stand out as individual data silos without any
interconnection. On the other hand, lack of widely accepted standards to share their
information among the institutions leads to many barriers when linking information on
the Web. Metadata is known as a key technology to lower these barriers.
Metadata at LAM is basically created for every item in their collections. On the
other hand, there are many Web resources created by third parties, e.g., Wikipedia
which is a very widely used encyclopedia among end-users on the Web. Those Web
resources are useful for many end-users to understand contextual information about the
cultural heritage resources. It is crucial for LAM to link their metadata and those Web
resources to add values of CHI provided by LAM using the information provided by
the third-parties. A significant problem for this issue is that the objectives of the LAM
metadata description and those of Web resources are quite different. One of the primary
contributions of this study is to clearly identify the objectives of metadata description to
help linking between the LAM’s metadata and Web resources based on One-to-One
Principal of Metadata [15].
This paper tries to investigate the existing CHI issues learned in our previous study
[20] and proposes a suitable model to collect and enrich the CHI. The model is designed
to provide a generalized framework to describe digital collections of cultural heritage
objects. This model can be introduced as a generalized model because it is essential to
capture both tangible and intangible heritage assets, and because the generalized model
helps connect heterogeneous LAM’s metadata and Web resources. The proposed model
is aligned with renowned cultural heritage models and is evaluated through use case
scenarios related to the cultural heritage domain.
1
http://pro.carare.eu/doku.php?id=support:metadata-schema.
2
http://www.getty.edu/research/publications/electronic_publications/cdwa/.
3
http://network.icom.museum/cidoc/working-groups/lido/what-is-lido/.
84 C. Wijesundara et al.
iv. MIDAS Heritage: A British cultural heritage standard for recording information
on buildings, monuments, archaeological sites, shipwrecks and submerged land-
scapes, parks and gardens, battlefields, artefacts and ecofacts.4
v. Object ID: An essential information about archaeological, artistic and cultural
objects in order to facilitate their identification in case of theft.5
vi. SPECTRUM: A standard describes how to manage collections and what to do
with artefacts at each stage of their lifecycle in a collection.6
Correspondingly, there are many local standards developed by each country
depending on their own institutional requirements. Alternatively, Data standards used
by other domains, for example, Dublin Core, MODS (Metadata Object Description
Schema) and VRA (Visual Resources Association) Core Categories etc. are also utilized
by the CHI domain where necessary. To give an instance, Cultural Heritage Metadata
Task Group of Dublin Core Metadata Initiative (DCMI) tried to identify the challenges
of metadata for cultural heritage by developing a simple cross-community metadata
model for cultural heritage objects. Besides they intended to give a recommendation for
the development of DCMI Application Profiles based on the above task [7].
Finally, there are Data Models specifically designed for CHI arena, which can be
used to organize data and define their relationships on par with real world entities.
Some of these well-known CHI models will be discussed in Sect. 3.1 of this paper.
Nevertheless, none of these standards could completely cover the entire cultural
heritage domain to describe its properties. Previously mentioned standards are devel-
oped to capture tangible objects only. Thus, there is always a void between tangible and
intangible heritage data standards which has yet to be filled.
When considering the LAM environment, they mostly record CHI related to a
single object. In particular, museums who collect most tangible cultural heritage objects
present their information as single items. When considering the intangible cultural
heritage, it is difficult to express intangible assets as individual items. Similarly,
intangible cultural Heritage can be realized if it is recorded only. Memory institutions
cannot curate a concept such as a skill or a performance related to an intangible cultural
heritage, but they can use various mediums to capture intangible heritage and record
them as individual records. Whether there is a deviation of tangible and intangible
heritage sometimes these assets are interrelated. Unfortunately, current CHI on the Web
provided by various means does not deliver such contextual information to patrons.
Based on this observation about CHI, this study defines a data model for digital
archives of cultural heritage based on the One-to-One Principle of Metadata [15],
which should be applicable to any type of cultural heritage – either tangible or
intangible, movable or immovable. We discuss the proposed model by comparing it
with CIDOC-CRM and FRBRoo, both of which are known as standard ontologies for
describing museum resources.
4
http://heritage-standards.org.uk/midas-heritage/.
5
http://archives.icom.museum/object-id/.
6
http://collectionstrust.org.uk/spectrum/.
A Metadata Model to Organize Cultural Heritage Resources 85
3 Related Works
3.1 Models for Data Organization
Researches related to model-based data organization are varied. Some are directly
connected with CHI and some are related to other information domains such as bib-
liographic or geographic data. In addition, the technologies behind the data organiza-
tion are also important consideration.
Europeana Data Model (EDM). Europeana is an ideal example for model based CHI
organization and aggregation. This is a large-scale CHI portal which is dedicated to
aggregate, enrich and disseminate digital cultural heritage across memory institutions in
the European Union. At present, it connects over 3,000 institutions across Europe and
these institutions contribute their resources to Europeana data portal. Europeana portal
is based on Europeana Data Model (EDM) which supports and manage the function-
ality of the system. EDM Primer states “EDM is not built on any particular community
standard but rather adopts an open, cross-domain Semantic Web-based framework that
can accommodate the range and richness of particular community standards such as
LIDO for museums, EAD for archives or METS for digital libraries” [11].
CIDOC-Conceptual Reference Model (CRM). This can be identified as an onto-
logical approach to harmonize the cultural heritage resources. CIDOC-CRM “… pro-
vides definitions and a formal structure for describing the implicit and explicit concepts
and relationships used in cultural heritage documentation” [6]. The latest published
version of the CIDOC-CRM consists of 94 classes and 168 properties [13]. Primarily,
this model enables information exchange and integration between heterogeneous
sources of CHI. It provides the semantic definitions and clarifications needed to trans-
form disparate and localized information sources into a coherent global resource.
86 C. Wijesundara et al.
key issue to aggregate metadata for a single cultural heritage object and to link
item-based metadata and Web resources.
Figure 1 represents CHDE model which defines instances and their relationships in a
cultural heritage collection. The CHDE model can have two main deviations: the
Physical Space and the Digital Space. Whether tangible or intangible, any cultural
heritage can physically exist or occur in the real world. On the other hand, LAM
develop their digital collections simply by digitizing those physical instances. CHDE
defines one metadata for each of these instances based on the One-to-One Principle.
Thus, CHDE is designed to clearly split metadata for a physical object and metadata for
its digitized object.
Starting from bottom (Fig. 1), Physical Space can have Curated Objects which
consists of various types of physical Recording Objects recorded on different Mediums.
Conventional LAM collects these Recording Objects such as Image, Sound or Textual
formats which expresses cultural heritage. The upper half of the model embodies the
Digital Space which belongs to the networked digital environment. The records in those
physical mediums can be converted into digital records known as Digital Objects. For
instance, printed photo and VHS video can be converted to JPEG and MPEG images. In
addition, there can be born digital resources for example, Virtual Reality (VR) data and
digital photos which can be directly created from cultural heritage object in the Physical
Space. CHDE model
assumes one or more
Curated Digital
recording objects for a
Digital Space Instance physical cultural heritage
member-of/ aggregates
object from which digital
objects put in the digital
Digital Resources
space is created. Those
Digital Object
Digital Object
Motion/ Still
Digital Object
Textual
recording objects may be
Sound/ Speech
Image Description analog or digital. Subse-
quently, all these digital
converted-to/ converted-from
objects created or con-
Curated Objects verted from the physical
Recording Object
Motion/ Still
Recording
Object
Recording Object
Textual
objects are organized as
Image Sound/ Speech Description an archived collection of
Digital Resources, which
represented-by/ represents
we call a digital archive
Physical Space Cultural aggregated-as/ (s). The uppermost circle
Heritage related-to
labeled Curated Digital
Instance which acts as an
Fig. 1. CHDE model to organize digital resource of cultural aggregated instance in
heritage
Digital Space is created
from cultural heritage
88 C. Wijesundara et al.
Figure 3 represents
CHDE model applied to a
real-world example
“Kandy Esala Perehara”
which is a historical festi-
val in Sri Lanka. From
bottom-Right, “Kandy
Esala Perehara” is rep-
resenting the main intan-
gible cultural event. This
is a religious parade per-
formed by dancers and
musicians along with
decorated elephants.
However, this intangible
entity can have many
instantiations such as
Performance in 2016
which is a Temporal
instantiation. Performers
denotes the dancers and Fig. 3. CHDE model replaced by “Kandy Esala Perehara”
musicians who involved
in the same performance.
Then Fig. 3 shows the physical records related to the Performance in 2016 such as a
video/audio tape, parade schedule as a leaflet and a printed photograph of the parade. In the
Digital Space, we can find similar resource in digital formats primarily hosted by LAM.
Then can present the “Kandy Esala Perahara” as a Curated Digital Instance by aggre-
gating metadata of these resources. In Fig. 3, bottom- right oval represents a tangible object
which is interconnected with the intangible “Kandy Esala Perehara” parade. Waisted Drum
(or Membranophone-double-headed) used to produce music during the performance is a
tangible object in this model. According to Fig. 3, the Digital Space shows a hypothetical
museum collection because there is no existing museum in Sri Lanka that provides a digital
archive of intangible cultural heritage. This collection includes a YouTube video and other
resources accessible on the Web as a virtually collected resource.
The advantage of CHDE is that we can identify every object separately and create
metadata for each object, which is the primary difference from existing databases of
cultural heritage objects at LAMs. In particular, this feature is crucial to collect various
type of records about intangible cultural heritage such as videos showing a festival or a
skill, and data captured from body motion of a dancer or a craftsman, because we can
explicitly separate a single performance and his/her/their skill which is the body of
intangible cultural heritage. Another crucial point is that, most of the metadata for those
objects both in digital and physical spaces are available on the Web such as LAM
metadata created for collected items, UNESCO’s Web page and Wikipedia articles
about cultural heritage, etc. Though, we need to develop technologies to properly link
these metadata.
90 C. Wijesundara et al.
Seeking interoperability between the proposed CHDE model and existing cultural
heritage models, we attempted to crosswalk CHDE main cultural heritage classes and
instantiations to CIDOC-CRM and FRBRoo classes as follows (Tables 1 and 2).
Findings from these mappings are discussed in the following section.
Table 1. Crosswalk between CHDE main heritage classes with CIDOC-CRM and FRBRoo
According to Table 1, starting from left-hand side, title Category denotes mainly
the Tangible and Intangible Cultural Heritage (TCH and ICH). Then it identifies the
main heritage classes such as Movable cultural heritage, Oral traditions etc. defined by
UNESCO [18, 19]. Apart from the main deviation, it is further divided into sub-classes
which are not revealed in Table 1. Somehow, through these sub-classes, Related Terms
were selected from AAT and AFS thesauruses [17, 14] which explains the content of
the Main Classes. Subsequently, we matched these classes to CIDOC-CRM and
A Metadata Model to Organize Cultural Heritage Resources 91
FRBRoo classes. Symbols used in the tables (Subclass: *, Equal Class: = , Not
Equal: 6¼) show the relationships between CIDOC-CRM and FRBRoo classes.
Table 2 represents the Instantiation Classes related to Intangible Cultural Heritage.
Instance classes are realized according to questions covering Temporal, Location,
Category, Agent and Activity Classes. Apart from these, another instance class was
added to represent the conceptual entities named as Concept Class.
Table 2. Crosswalk between CHDE instance classes with CIDOC-CRM and FRBRoo
6 Discussion
it was marked using the (6¼) sign. A similar case can be identified in Table 1 between
E25 Man-Made Feature and F53 Material Copy. According to FRBRoo, F53 is a
subclass of E25. Nevertheless, F53 description is about a physical material of an
information career such as a book or a CD [5, 13]. Therefore, it is not equal to the
CHDE Main classes such as monuments or archaeological sites.
This paper has proposed the CHDE model to collect CHI, which is primarily designed
to organize digital collections of cultural heritage. The resource identification and
integration was done along with the One-to-One Principle of metadata and it gives a
clear discrimination between the CHI and its original object. Through the crosswalk
done between CHDE model and CIDOC-CRM and FRBRoo we sought to identify the
CHDE classes and their relationships. However, it is possible to understand and
express the CHDE classes through the existing CIDOC classes as well. Deviation of
tangible and intangible cultural heritage and their physical, digital resources are not
entirely expressed through these existing models. Therefore, developing a generalized
model such as the CHDE model can be a solution to distinguish physical and digital
entities of a cultural heritage asset in diverse environment.
This would be a novel approach to the CHI domain in the region and this paper will
be a foundation for that effort. The interrelation between intangible and tangible CHI,
the method of collecting physical and digital records and their metadata, and the
metadata aggregation process are still under investigation and that will be the future
direction of this study.
Acknowledgements. This study has been partially supported by JSPS Kaken Grant-in-Aid for
Scientific Research (A) #16H01754. The authors wish to express their appreciation to professors
Atsuyuki Morishima, Mitsuharu Nagamori and all members at the Metadata Lab of the Graduate
School of Library, Information and Media Studies, University of Tsukuba, for the guidance and
support provided.
References
1. Amin, R., Baker, O.F., Deraman, A., Yatim, N.F.M.: Transforming model to meta model for
knowledge repository of Malay intangible culture heritage of Malaysia. Int. J. Electr.
Comput. Eng. 2(2), 231–238 (2012). doi:10.11591/ijece.v2i2.205
2. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. In: Semantic Services,
Interoperability and Web Applications: Emerging Concepts, pp. 205–227. IGI Global, USA
(2009)
3. Carboni, N., de Luca, L.: Towards a conceptual foundation for documenting tangible and
intangible elements of a cultural object. Digital Appl. Archaeol. Cult. Heritage 3(4),
108–116 (2016). doi:10.1016/j.daach.2016.11.001
4. Chen, Y.N., Ke, H.R.: FRBRoo-based approach to heterogeneous metadata integration.
J. Doc. 69(5), 623–637 (2013)
94 C. Wijesundara et al.
5. Chryssoula, B., Doerr, M., Le Bœuf, P., Riva, P. (eds.): Definition of FRBROO a
Conceptual Model for Bibliographic Information in Object-Oriented Formalism. Interna-
tional Working Group on FRBR and CIDOC CRM Harmonisation, version 2.4 (2015)
6. CIDOC-CRM. http://www.cidoc-crm.org/
7. DCMI: Cultural heritage metadata task group. http://dublincore.org/archive/mediawiki_wiki/
Cultural_Heritage_Metadata_Task_Group/
8. Hansen, H.J., Fernie, K.: CARARE: connecting archaeology and architecture in Europeana.
In: Ioannides, M., Fellner, D., Georgopoulos, A., Hadjimitsis, Diofantos G. (eds.) EuroMed
2010. LNCS, vol. 6436, pp. 450–462. Springer, Heidelberg (2010). doi:10.1007/978-3-642-
16873-4_36
9. Hu, J., Lv, Y., Zhang, M.: The ontology design of intangible cultural heritage based on
CIDOC CRM. Int. J. U-and E-Serv. 7(1), 261–274 (2014)
10. Hyvönen, E.: Publishing and using cultural heritage linked data on the semantic web. Synth.
Lect. Seman. Web: Theor. Technol. 2(1), 1–159 (2012)
11. Isaac, A. (ed.): Europeana data model primer (2013). http://pro.europeana.eu/files/Europeana_
Professional/Share_your_data/Technical_requirements/EDM_Documentation/EDM_Primer_
130714.pdf
12. Lanzi, E.: Introduction to Vocabularies: Enhancing Access to Cultural Heritage Information.
Getty Publications, Los Angeles (1999)
13. Le Boeuf, P., Doerr, M., Ore, C.E., Stead, S. (eds.): Definition of the CIDOC conceptual
reference model. ICOM/CIDOC Documentation Standards Group and CIDOC CRM Special
Interest Group, version 6.2.1 (2015)
14. Library of Congress: American Folklore Society Ethnographic Thesaurus. http://id.loc.gov/
vocabulary/ethnographicTerms.html
15. Miller, S.J.: The one-to-one principle: challenges in current practice. In: International
Conference on Dublin Core and Metadata Applications, pp. 150–164 (2010)
16. Tan, G., Hao, T., Zhong, Z.: A knowledge modeling framework for intangible cultural
heritage based on ontology. In: Knowledge Acquisition and Modeling, KAM 2009, vol. 1,
pp. 304–307. IEEE (2009). doi:10.1109/KAM.2009.17
17. The Getty Research Institute: Art & Architecture Thesaurus® Online. http://www.getty.edu/
research/tools/vocabularies/aat/index.html
18. UNESCO: Intangible cultural heritage. https://ich.unesco.org/en/what-is-intangible-heritage-
00003
19. UNESCO: Tangible cultural heritage. http://www.unesco.org/new/en/cairo/culture/tangible-
cultural-heritage/
20. Wijesundara, C., Sugimoto, S., Narayan, B., Tuamsuk, K.: Bringing cultural heritage
information from developing regions to the global information space as linked open data: an
exploratory metadata aggregation model for Sri Lankan heritage and its extension. In: 7th
Asia-Pacific Conference on Library and Information Education and Practice (A-LIEP),
pp. 117–132 (2016)
21. Winda, M., Wijesundara, C., Sugimoto, S.: Modeling digital archives of intangible cultural
heritage based on one-to-one principle of metadata. In: 8th Asia-Pacific Conference on
library and Information Education and Practice (A-LIEP). Manuscript submitted for
publication
Data Sharing and Retrieval
Is Data Retrieval Different from Text
Retrieval? An Exploratory Study
1 Introduction
Data, regarded as the world’s most valuable resource [1], and the lifeblood of research
[2], transcends all domains of scholarship, and could take on a variety of forms including
text, sound, still images, moving images, models, games, and simulations as well as
structured databases [3, 4]. Studies show several benefits of research data sharing [5]. As
a result, governments and research funding bodies are increasingly pushing for open
access and sharing of research data, especially when such data is generated through
publicly funded research. This is all very positive, but until researchers and interested
parties are able to find and use data as and when they need it and with tolerable ease, the
vision of open access and sharing of data cannot be fully realized. Data retrieval systems
are presently still at a relatively early stage of development, with the majority of research
data repositories using the same or slightly tweaked versions of text retrieval engines for
data retrieval. The fundamental characteristics of research data, and the form of its user
interaction, differ considerably from research publications (text), both of which points
make it impractical to expect standard text retrieval engines to adapt well to data. While
both text and datasets can be tagged with metadata, the task of tagging the latter is often
more complex; and unlike the indexing of research papers by services like Web of
Science, the indexing of research datasets is not standardized or controlled. One of the
key challenges of data retrieval arises from this lack of use of standard metadata and
documentation to contextualize data sufficiently for discovery and reuse [2–7]. This
paper does not aim to expound on the theoretical differences between text retrieval and
© Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 97–103, 2017.
https://doi.org/10.1007/978-3-319-70232-2_8
98 M. Bugaje and G. Chowdhury
data retrieval, but to enquire into the following research questions through an exploratory
study:
1. Are there any major differences between the search results of text retrieval and data
retrieval services, particularly in terms of:
a. The nature and volume of files retrieved; and
b. The currently supported functionalities for interacting with the retrieved files; and
2. What are the implications of the above, resource-wise and otherwise? and
3. What measures could be taken to improve the efficiency of data retrieval services?
2 Research Methodology
The research reported in this paper is based on a controlled experiment that aimed to
demonstrate some fundamental differences between text retrieval and data retrieval
from the point of view of interaction and retrieval features of a typical text retrieval
system and some commonly used data retrieval systems. Wikipedia1 organizes all
academic disciplines into 5 broad domains: Arts, Humanities, Social sciences, Natural
sciences, and Applied sciences. For the purpose of this experiment we have slightly
re-organized these further into four broad disciplines by merging together Arts and
Humanities, and having Computer & Information Science represent its parent disci-
pline of Applied Sciences. As each of the domains in the original Wikipedia classifi-
cation is still well-represented, neither reshuffle is likely to affect the results of the
experiment, being done mainly for convenience in the former case and to put the
authors’ subject knowledge to full advantage in the latter case. Five keywords and/or
phrases were selected at random from the Wikipedia homepage of each respective
discipline, and a search was conducted on the keyword/phrase in both data retrieval and
text retrieval contexts. Thomson Reuters Web of Science2 database, being the most
comprehensive database for research publications, was employed for the text retrieval
portion of the experiment; while a total of three research data repositories, viz. UK Data
Service3, DataOne4, and Dryad5 were used for the data retrieval portion. The selection
of repositories was based on recommendations by re3data.org and Nature6; i.e. UK
Data Service for Arts, Humanities, and Social Sciences data; DataOne for Natural
Sciences data; and in the absence of a special Computer & Information Sciences data
repository, Dryad, which is generalist. For both the data retrieval and text retrieval
halves of the experiment, only the first 10 items of search results were considered,
except in instances when an item so obviously departs from the intended topic, in
which case the item is skipped and the next item is considered in its stead. As we have
1
https://www.wikipedia.org.
2
https://www.webofknowledge.com/.
3
https://www.ukdataservice.ac.uk/.
4
https://www.dataone.org/.
5
datadryad.org/.
6
https://www.nature.com/sdata/policies/repositories.
Is Data Retrieval Different from Text Retrieval? 99
tried to mimic a typical search scenario of a researcher in a real world situation, the
choice of only 10 items emanates from research on user search behavior which shows
that well over half of search engine users do not go past the first page of search results
[8–10]; and 10 just happens to be the default minimum number of results on a single
page that is common to most search engines [11, 12], including, in our present case,
Thomson Reuters Web of Science and UK Data Service. File sizes and formats were
noted for each of the items considered: for publications (text retrieval) this constitutes
full research papers; and for datasets (data retrieval) this constitutes the dataset itself as
well as all of its documentation files, if any. A uniformity of file format was noted in the
text retrieval portion of the experiment, all of the research papers being in PDF format,
viewable on the web browser, and downloadable at the user’s discretion. The journals
featured include IEEE Xplore, Sage, ScienceDirect, Taylor & Francis, PLOS One, and
JSTOR among others. Conversely for the data retrieval part, over 20 different file
formats were noted, notwithstanding the more or less homogenizing effect of our
decision to always give preference to non-propriety formats (e.g. txt, CSV,
tab-delimited) over propriety formats (e.g. STRATA, SPSS, XLS, MATLAB) wher-
ever possible. Also, as there have been variations in file sizes for the same dataset on
account of the aforementioned multiplicity of file formats, preference was given to the
rendering that is smaller in size. Unlike their publication counterparts, datasets cannot
be viewed on the web browser, but must be downloaded first before even a first or
cursory glimpse of their content could be had.
3 Findings
Figure 1 shows, for each keyword in each discipline, what proportion of the total file
size retrieved constitutes research datasets and research publications. It could be seen
that on average the file sizes of research datasets generally and significantly exceed
those of research publications, so that, in some cases (i.e. search behavior, face
recognition, computer vision, ‘renewable energy’, and ‘ultraviolet light’), the whole
appears to be composed entirely of research datasets; but that is not really so: the
observation merely demonstrates the overwhelming disproportionateness in average
file size of research datasets to research publications (text) in those subjects. Table 1
provides a more accurate representation, where the average file size of a single research
dataset may in some cases be observed to amount to as much as 900 times over the
average file size of a single research publication
Other key observations of this study with regards to the first research question are:
1. Average file size of retrieved datasets is several times larger than that of retrieved
research publication files; and these in turn vary from one discipline to another.
2. Unlike publications, the retrieved datasets may be of different file types or formats.
3. Whereas research publications comprise of only the publication itself, research
datasets are almost always accompanied with separate documentation files (up to 22
have been noted in this experiment). Each piece of documentation furnishes further
information about the dataset in question, and may be necessary for its potential re-
use. These documentation files tend be include code snippets, original survey
100 M. Bugaje and G. Chowdhury
Fig. 1. The relative file size proportions for research datasets and research publications out of
the overall total file size of all the files retrieved for each keyword.
For the second research question, it may thus be argued that the currently available
research data retrieval services are unsustainable (for details on sustainability of
information see [13, 14]). They involve an unnecessarily high consumption of valuable
resources – both in terms of the network and researcher time – given evidence from
Table 1, and seeing as research datasets must first be downloaded before their contents
could even be viewed. These unnecessary downloads of large volumes of data will
additionally involve energy, which could have severe environmental and economic
implications in terms of server costs, etc. (see [13, 14] for details). In addition, these
empty downloads present a major stumbling block to the reliability of research data
impact indicators which cite download count as a measure of the impact, usefulness,
Is Data Retrieval Different from Text Retrieval? 101
Table 1. Average sizes of files retrieved for research datasets and research publications.
Discipline Keywords Data Text Approx. ratio of
Retrieval* Retrieval* text to data
Arts & Humanities Art museums 6.205 MB 0.820 MB 1:8
Nineteenth 2.898 MB 1.042 MB 1:3
century
“World war” 6.158 MB 0.508 MB 1:12
Medieval 5.158 MB 1.091 MB 1:5
Popular music 9.334 MB 1.000 MB 1:9
Social Sciences Unemployment 4.729 MB 0.455 MB 1:10
Cognition 13.340 MB 1.612 MB 1:8
“Labour law” 2.827 MB 0.410 MB 1:7
“Trade union” 15.939 MB 0.748 MB 1:21
Imprisonment 2.444 MB 0.503 MB 1:5
Computer & Information Search behavior 657.707 MB 0.731 MB 1:900
Science Face recognition 1.394 GB 1.535 MB 1:908
Computer vision 1.339 GB 2.782 MB 1:481
Research data 1.574 MB 0.521 MB 1:3
sharing
Social media data 19.597 MB 1.078 MB 1:18
Natural Sciences Marine life 32.318 MB 1.491 MB 1:22
“Climate change” 2.808 MB 2.497 MB 1:1
“Renewable 766.432 MB 3.606 MB 1:213
energy”
“Ultraviolet 496.745 MB 1.991 MB 1:250
light”
“Oxidative 41.177 MB 1.895 MB 1:22
phosphorlyation”
*Average File Size, inclusive of documentation
**Average File Size
2. The efficiency of data retrieval systems can be achieved through a combination of:
a. Providing adequate contextual information for retrieved datasets in the form of
metadata specific to that discipline and thorough documentation. However, this
would be highly resource-intensive if done manually, and therefore new auto-
mated and software-assisted means must be developed;
b. Better training for researchers on the use of metadata and other tagging methods;
c. The development of improved functionalities for data repositories, with inter-
active options allowing datasets to be previewed on the browser before
download;
d. Undertaking more research in the area of data retrieval with user-centered focus
to better understand the data seeking and use behavior of researchers, and to
develop models of users and usability for data retrieval; and
e. Building reliable weighting and ranking methods for research datasets to guide
users.
4 Conclusion
This study provides some useful insights on data retrieval, and is methodologically
designed such that the experiment could be repeated with different parameters and
variables to gain further insight. The average file size of research datasets is often
several times larger than that of research publications. Moreover, often the retrieved
datasets cannot be read or used online, but must be downloaded first before anything
can be done with them; as a result of which users end up downloading files without full
knowledge of the files’ contents or usefulness. This is further compounded by the
differences in file types, size, format, and/or documentation; all of which have major
implications for search efficiency and resource requirements. Also, besides wastefully
consuming large amounts of storage disk space and network resources, the unnecessary
downloading of multitudes of large datasets (with all associated documentation) falsely
spikes up download count, effectively rendering this metric unreliable as an indicator
data impact, usefulness, or popularity. What’s more, as a pre-requisite to reusing a
research dataset, the user often spends a considerable amount of time wading through
the dataset and its documentation to gain an understanding of it; an endeavor that is
exacerbated by the lack of standardized tagging and documentation system for research
datasets.
Research shows that energy consumption increases with increase in server load
because energy is consumed during both phases: while doing computing work and
while waiting for database data to arrive [15]. Hence, a reduction in the volume of data
downloaded will reduce the energy consumption of IT infrastructure of data services as
well as the universities and research institutions, thereby reducing the environmental
costs of research data management. This may be achieved by building more efficient,
user-centered, and perhaps discipline-specific data retrieval services.
Is Data Retrieval Different from Text Retrieval? 103
References
1. The world’s most valuable resource. The Economist, p. 9, 6–12 May 2017
2. Borgman, C.L.: The conundrum of sharing research data. J. Am. Soc. Inf. Sci. Technol.
63(6), 1059–1078 (2012)
3. Borgman, C.L.: Big Data, Little Data, No Data: Scholarship in the Networked World.
The MIT Press, Cambridge (2015)
4. Borgman, C.L., Wallis, J.C., Mayernik, M.S.: Who’s got the data? Interdependencies in
science and technology collaborations. Comput. Support. Coop. Work 21(6), 485–523
(2012)
5. The Data harvest: How sharing research data can yield knowledge, jobs and growth.
An RDA Europe report, December 2014. https://rd-alliance.org/sites/default/files/
attachment/The%20Data%20Harvest%20Final.pdf. Accessed 11 June 2017
6. MacMillan, D.: Data sharing and discovery: what librarians need to know. J. Acad.
Librarianship 40(5), 541–549 (2014)
7. Wallis, J.C., Rolando, E., Borgman, C.L.: If we share data, will anyone use them? Data
sharing and reuse in the long tail of science and technology. PLoS ONE 8(7), e67332 (2013).
doi:10.1371/journal.pone.0067332
8. Jansen, B.J., Spink, A.: How are we searching the world wide web? A comparison of nine
search engine transaction logs. Inf. Process. Manag. 42(1), 248–263 (2006)
9. Spink, A., Wolfram, D., Jansen, B.J., Saracevik, T.: Searching the web: the public and their
queries. J. Am. Soc. Inf. Sci. 53(2), 226–234 (2001)
10. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-through
rate for new ads. In: Proceedings of the 16th International Conference on World Wide Web,
WWW 2007, pp. 521–530 (2007)
11. Maley, C., Baum, N.: Getting to the top of google: search engine optimization. J. Med. Pract.
Manag. MPM 25(5), 301–303 (2010)
12. Wu, M., Marian, A.: A framework for corroborating answers from multiple web sources. Inf.
Syst. 36(2), 431–449 (2011)
13. Chowdhury, G.G.: Sustainability of Scholarly Information. Facet Publishing, London (2014)
14. Chowdhury, G.G.: How to improve the sustainability of digital libraries and information
services? J. Assoc. Inf. Sci. Technol. 67(10), 2379–2391 (2016)
15. Boru, D., Kliazovich, D., Granelli, F., Bouvry, P., Zomaya, A.Y.: Energy-efficient data
replication in cloud computing datacenters. Cluster Comput. 18(1), 385–402 (2015)
Preparedness for Research Data Sharing:
A Study of University Researchers in Three
European Countries
Abstract. Many government and funding bodies around the world have been
advocating open access to research data, arguing that such open access can bring a
significant degree of economic and social benefit. However, the question remains,
do researchers themselves want to share their research data, and even if they do
how far they are prepared to make this happen? In this paper we report on an
international survey involving university researchers in three countries, viz. UK,
France and Turkey. We found that researchers have a number of concerns for data
sharing, and in general there is a lack of understanding of the requirements for
making data publicly available and accessible. We note that significant training
and advocacy will be required to make the vision of data sharing a reality.
1 Introduction
Researchers have described data as the glue of a collaboration [1], and the lifeblood of
research [2]. Several benefits of research data sharing have been highlighted including
for example, economic growth, increased resource efficiency and securing public
support for research funding [3, 4]. It is reported that: “in the US, one study estimated
the $13 billion in government spending on the Human Genome project and its suc-
cessors has yielded a total economic benefit of about $1 trillion. A British study of its
public economic and social research database found that for every £1 invested by the
government, an economic return of £5.40 resulted” [5]. Given these benefits, various
governments and funding bodies are pushing for OA to research data. However,
take-up of the concept of OA and sharing amongst researchers has been low. This may
be attributable to a lack of skills and knowledge in making data discoverable, acces-
sible, and reusable for others’ research; or it may be attributed to issues related to trust,
reputation, and ethics [5–9].
The exploratory research reported in this paper addresses the following questions:
1. Are university researchers willing to share their research data and what concerns do
they have with regard to research data sharing?
2. Are university researchers familiar with various activities and preparation needed to
make data shareable and usable?
The UK Data Archive [4] proposes a research data lifecycle that comprises six major
sets of activities some of which, such as data creation, data access, analysis and re-use,
are undertaken or primarily driven by researchers in a specific discipline. Researchers,
therefore, also have some important roles in research data management activities.
However, not much is known about how researchers go about managing their data and
whether they are willing to share data with others outside of the immediate research
collaboration [10]. Often researchers are preoccupied by immediate issues of backing
up data rather than the longer term question of preservation [11]. In fact, the sharing of
data even within interdisciplinary projects is also highly problematic [12], and incen-
tives to release data are lacking, the adopting of data repositories remain slow and there
are questions regarding their design in that they are optimised for performance rather
than scientific enquiry [2]. This is especially troublesome as the advent of “big science”
and the emergence of the “fourth paradigm” which is a “computational data-intensive
approach to science that constitutes a new set of methods beyond empiricism, theory
and simulation” [10]. However, this appears to be less of a problem for large
well-established and long-lasting collaborations than it is for small-scale, short-lived
collaborative projects which is often called the “long tail of science and technology”
[10]. In these small teams methods tend to be local and specific to the research at hand,
where reusing this data requires a great deal of contextual knowledge about procedure.
Without contextual information, where data have been separated from context, reuse
can become “difficult or impossible” [13]. Within the “long tail” research projects data
sharing is described as a “gift culture” [10] where data is bartered between colleagues
in trusted relationships. Therefore, providers of data will overcome problems of context
and documentation for trusted others but this is clearly unsustainable in the longer term.
One of the key challenges of data sharing is that it requires standard metadata and
documentation to contextualise data sufficiently for re-use [2] and discovery [13, 15]
outside of the collaboration for which it was intended. MacMillan [14] notes that, very
few researchers (22%) use metadata, preferring to use their own laboratory standards
instead, a view supported by Carlson et al. [11]. An underlying technical issue is the
decreasing lifespan of data storage formats, which require more sustainable data
management practices [14]. Furthermore, researchers lack the data curation skills and
these are not addressed at undergraduate level [16]. There is also a lack of academic
106 G. Chowdhury et al.
credit or reward for data curation [10, 14], and for developing common data structures,
metadata formats and ontologies to support data mining [2, 13, 15]. The role of edu-
cation should not be underestimated here. In a study of researchers in the area of health
for example, 77% of researchers reported that they had, “never received any formal
training” and reported their expertise as “very low” in data management [17, p. 54].
Another survey amongst over 2000 academics and researchers from around the world
noted that, “researchers do not know how open they have made their data - 60% of
respondents are unsure about the licensing conditions under which they have shared
their data, and thus the extent to which it can be accessed or reused” [18, p.14]. This
clearly indicates that there is a significant gap in awareness and understanding which
needs to be addressed [11, 19].
3 Research Method
implemented services for research data storage, analysis and curation. There are no
units within research institutions which provides support to researchers who would like
to store and share their research data [21, 22]. However, starting from 2012, there have
been several initiatives which aim to increase awareness towards the importance of the
subject and address the current situation in Turkey.
It is clear that RDM technology and policy developments are at varying levels in
the UK, France and Turkey. The choice of the universities was based on a slightly
different criterion: the three chosen universities have similarity in their nature such as;
emphasis is given both to teaching and research, but all of them has increasing demand
for national and international researches in different fields. This approach was
employed to gain a sense of the awareness of, and preparedness for, RDM amongst
academics and researchers in universities of similar nature but from three different
countries that have different levels of progress in overall RDM activities.
The survey was developed by the researchers and a pilot study was carried out first
to make sure that all questions were clear and understandable. Based on the pilot study
results, the survey instrument was developed. E-mail invitations were sent out to the
academics and researchers in the three chosen universities. There were 26 questions to
collect data on: researcher information – role, discipline, gender, experience, etc.;
nature of data collected, created, etc.; data sharing practices, concerns; familiarity with
data management practices and policies/challenges including knowledge of metadata,
training, etc. The research reported in this paper addresses only those questions in the
questionnaire that are related to data sharing practices, concerns, and researchers’
awareness and familiarity with for example, various RDM tools, techniques and
policies. SPSS was used to analyse the dataset, and Chi-Square tests, at 0.05 signifi-
cance level, were conducted to find out correlation between researchers’ behaviour in
different areas of RDM especially with regard to data tagging and storage, sharing and
re-use of research data, etc., and researchers’ characteristics such as country, discipline,
age, gender and years of experience.
4 Data Sharing
Conducted in the summer of 2016 this survey received a total of 215 completed
responses. Tables 1 and 2 present the general demographic data by country and years
of experience. The OECD classification of disciplines [23] was provided as a list for the
respondents to choose from.
However, in order to be able to run correlation tests, subject categories were merged
under larger groups such as sciences, social sciences and humanities: 53% were from social
sciences, 25% from humanities and 23% from sciences. Table 3 shows the user behaviour
in relation to data sharing. Statistically significant differences were detected between
specific behaviours with regard to sharing data with others and country (C), years of
experience (E), discipline (D), e.g. sharing with own team (Cv2ð2Þ ¼ 41; 858 ;
p ¼ 0; 000; Ev2ð4Þ ¼ 22; 305; p ¼ 0; 000; Dv2ð2Þ ¼ 9; 376; p ¼ 0; 009), sharing with
researchers in the same university (Cv2ð2Þ ¼ 14; 382; p ¼ 0; 001; Ev2ð2Þ ¼
14; 931; p ¼ 0; 005), sharing with researchers in other institutions (Cv2ð2Þ ¼ 6; 419;
p ¼ 0; 040; Ev2ð4Þ ¼ 24; 445; p ¼ 0; 000; Dv2ð2Þ ¼ 7; 108; p ¼ 0; 029), and not
sharing data (Cv2ð2Þ ¼ 28; 539; p ¼ 0; 000; ¼ 34; 924; p ¼
Ev2ð4Þ ¼
0; 000; Dv2ð2Þ
7; 171; p ¼ 0; 028). A significant difference was detected between researchers’ beha-
viour for not sharing data and country (v2ð2Þ ¼ 28; 539; p ¼ 0; 000). Whilst nearly half
(45%) of the UK researchers claim that they do not collaborate in data sharing, this is
significantly less for the other two countries: approximately 13% in France and 11% in
Turkey. A statistically significant difference was also detected between sharing data with
own team and country (v2ð2Þ ¼ 41; 858; p ¼ 0; 000), sharing data with researchers in the
same university (v2ð2Þ ¼ 14; 382; p ¼ 0; 001), sharing data with researchers in other
institutions (v2ð2Þ ¼ 6; 419; p ¼ 0; 040) and country.
Preparedness for Research Data Sharing 109
Researchers use different coding or tagging for their datasets (Table 4). However, not
all of them are familiar with the concept of metadata, nor do they always use standard
metadata (Table 5). Nearly a third of the researchers are either uncertain or are not
familiar with the concept of metadata (Table 6). Nearly 95% of the researchers are
either uncertain or do not know whether their university has a prescribed metadata set
for uploading data onto the repository. However, nearly 60% of researchers feel that a
formal training on metadata would be useful for managing research data.
Only one metadata related behaviour in relation to data use, viz. using datasets that are
tagged with standard metadata had a significant correlation with researchers’ experience;
and no significant correlation was found between researchers’ metadata related behaviour
and their gender. No significant correlations were found between researchers’ status and
tagging of datasets. Only one tagging behaviour (description of the data file,
v2ð2Þ ¼ 13; 048; p ¼ 0; 001) correlated with discipline: descriptions of a data file are used
more by researchers in science (48%) and least by researchers in humanities (15%).
Correlations were also detected between country and some tagging behaviour (such as no
assignment, v2ð2Þ ¼ 10; 559; p ¼ 0; 005; administrative information, v2ð2Þ ¼ 9; 318;
p ¼ 0; 009; discovery information, v2ð2Þ ¼ 13; 508; p ¼ 0; 001; and technical informa-
tion v2ð2Þ ¼ 14; 434; p ¼ 0; 001). The number of researchers who do not assign tags and
metadata to their datasets is higher in the UK (46%); and assigning administrative (38%),
discovery (20%) and technical (15%) information to datasets is the lowest in UK.
Only 23% of researchers agree that their university encourages OA and data sharing,
and only 31% of researchers are familiar with the OA requirements (Table 7).
Researchers have different views on the potential benefits and challenges of OA and
data sharing (Table 8); only 55% of researchers are comfortable and willing to share
research data; 67.5% of researchers perceive that data ethics could be an issue for data
sharing. Researchers do have a number of concerns for making data available in open
access mode (Table 9); and some of the key concerns of researchers include: legal and
ethical issues, misuse and misinterpretation of data, and fear of losing the scientific
edge (Table 10).
Despite various government and funding body mandates, researchers still appear to be
not quite familiar with DMP: two-thirds or more researchers are either uncertain or do
not know whether their institution has a DMP, and only a quarter of researchers have or
used a DMP for their research (Table 11). However, on a positive note, 40% of
researchers believe that a DMP helps researchers manage their data. Tables 12 and 13
show that very few researchers practise or use standard file naming systems which is a
key requirement of a good data management system. Very few people had any formal
training on different aspects of data management that are essential for research data
sharing and use (Table 14):
• Only 6.5% had any formal training on DMP;
• Only 10% had a formal training on metadata;
• Only 2.8% had any training on version control, etc.
However, over 77% of researchers are willing to take formal training on these
topics. A significant correlation was observed between the researchers’ country and
their opinion about the role of the universities for recommending a standard file
naming system (v2ð8Þ ¼ 41; 927; p ¼ 0; 000). Turkish researchers had the highest score
(53%) in this regard. Some correlations were discovered between researchers’ country
Preparedness for Research Data Sharing 113
and the use of standard file naming system (v2ð4Þ ¼ 15; 711; p ¼ 0; 003), use of stan-
dard style for citing research data (v2ð4Þ ¼ 14; 214; p ¼ 0; 007), being recommended a
specific guideline for citing data by the university (v2ð4Þ ¼ 29; 136; p ¼ 0; 000) and
owning a unique researcher ID (v2ð4Þ ¼ 13; 390; p ¼ 0; 010). More than 40% of
researchers in UK own a unique researcher ID, while this is only 17% in France. Whilst
approximately 60% of researchers in both the UK and Turkey claim that their uni-
versities recommend some guidelines for citing data, for France it is only 15%. Nearly
half (46%) of the researchers from France also claimed that they do not use a standard
style for citing research data.
This study shows that the culture of OA and data sharing is not yet common: only
about 40% of researchers do almost always or often use OA data, and only about 23%
work with datasets with restricted access. In most cases (80%) researchers have to put
in some effort before they can make use of OA data. There may be several reasons such
as data may not be tagged properly or standard metadata set has not been used, or for
example, researchers may not be familiar with tagging or data management. In general,
nearly 80% of researchers do not want to share data with anyone. Less than a quarter of
researchers agree that their university encourages OA and data sharing, and only 31%
of researchers are familiar with the OA requirements of the funding bodies. Nearly 95%
of researchers are either uncertain or do not know whether their university has a
prescribed metadata set. Despite various government and funding body mandates,
majority (about 80%) of the researchers do not want to share data with others; and the
key concerns for OA and data sharing include: legal and ethical issues, misuse and
misinterpretation of data, and fear of losing the scientific edge. In total, 40% of
researchers do not use a standard data citation style, and only 50% universities have a
recommended citation style; 61% are familiar with the concept of DOI, but only a third
of the researchers have a unique researcher ID; and researchers do not always find
appropriate systems for version control of datasets.
Although UK is ahead of the two other countries in terms of research and devel-
opment in RDM, the willingness for data sharing is still low: 45% of the UK
researchers claim that they do not collaborate in data sharing. UK researchers appeared
114 G. Chowdhury et al.
to be more reluctant to share data: 28% said they would make data available with
restricted access and 27% will not make data available to anyone else. They also show
the lowest score for making data available upon request (38%). Researchers in France
seem to be more willing to share their research data (74%) and they see data sharing
less problematic (54%) compared to other two countries. The number of researchers
who do not assign tags and metadata to their datasets is higher in the UK (46%); whilst
assigning administrative (38%), discovery (20%) and technical (15%) information to
datasets is also the lowest in UK. However, researchers in Turkey displayed the lowest
score for familiarity with metadata. More than 40% of researchers in UK own a unique
researcher ID, while this is only 17% in France. Nearly 60% of researchers in both the
UK and Turkey claim that their universities recommend some guidelines for citing
data, but for France it is only 15%; 46% of the researchers from France also do not use
a standard style for citing research data. Two-thirds or more of researchers are either
uncertain or do not know whether their institution has a data management plan (DMP),
and only a quarter of the researchers have used a DMP for their research. Over 70% of
researchers did not have any formal training in DMP, metadata, consistent file naming
and version control or data citation. This corroborates previous research [17] which
noted that 77% of researchers never received any formal training in data management.
Overall, this research demonstrates that a significant number of gaps exist between
researchers’ perceptions and behaviours with regard to research data creation and
sharing, and the ambition of funding bodies and academic institutions with regard to
OA data. The gap in the skill sets required for university researchers can be filled by
developing data literacy which is broadly defined as, “knowing how to select and
synthesise data and combine them with other information sources and prior knowl-
edge” [13, p. 405].
The purpose of this study was to explore whether differences exist amongst
countries, disciplines, and years of the experience of the researchers with regard to their
awareness and behaviour in relation to RDM. The findings show a range of interesting
behaviours in research data sharing and various RDM practices displayed by university
academics and researchers that may provide valuable insight for the development of
data literacy training programmes. However, given the relatively small sample size and
response rate, the results, especially the comparison at country, discipline and expe-
rience level, should be taken with some caution. More detailed studies with larger and
more representative samples should be undertaken in order to make reliable compar-
isons amongst these variables.
References
1. Borgman, C.L., Wallis, J.C., Mayernik, M.S.: Who’s got the data? Interdependencies in
science and technology collaborations. Comput. Support. Coop. Work 21(6), 485–523
(2012)
2. Borgman, C.L.: The conundrum of sharing research data. J. Am. Soc. Inf. Sci. Technol.
63(6), 1059–1078 (2012)
Preparedness for Research Data Sharing 115
3. Beagrie, N., Houghton, J.: The value and impact of data sharing and curation: A synthesis of
three recent studies of UK research data centres, JISC (2013). http://repository.jisc.ac.uk/
5568/1/iDF308_-_Digital_Infrastructure_Directions_Report%2C_Jan14_v1-04.pdf. Acces-
sed 11 June 2017
4. UK Data Archive: Create & manage data. Research data lifecycle. Concordat on open access
data Version 10, 17 July 2015. http://www.rcuk.ac.uk/documents/documents/
concordatopenresearchdata-pdf/. Accessed 11 June 2017
5. The Data harvest: How sharing research data can yield knowledge, jobs and growth.
An RDA Europe report, December 2014. https://rd-alliance.org/sites/default/files/
attachment/The%20Data%20Harvest%20Final.pdf. Accessed 11 June 2017
6. Faniel, I.M., Kriesberg, A., Yakel, E.: Data reuse and sensemaking among novice social
scientists. Proc. Am. Soc. Inf. Sci. Technol. 49(1), 1–10 (2012). doi:10.1002/meet.
14504901068. Accessed 11 June 2017
7. Faniel, I., Kansa, E., Kansa, S.W., Barrera-Gomez, J., Yakel, E.: The challenges of digging
data: a study of context in archaeological data reuse. In: JCDL 2013 Proceedings of the 13th
ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295–304. ACM, New York, NY
(2013)
8. Yakel, E., Faniel, I.: Virtuous circles: circulating old data through new collaborations. In:
17th ACM Conference on Computer Supported Cooperative Work and Social Computing
Workshop: Sharing, Re-Use and Circulation of Resources in Cooperative Scientific Work.
Baltimore, MD, 15 February 2014
9. Borgman, C.L.: Big Data, Little Data, No Data: Scholarship in the Networked World.
The MIT Press, Cambridge (2015)
10. Wallis, J.C., Rolando, E., Borgman, C.L.: If we share data, will anyone use them? Data
sharing and reuse in the long tail of science and technology. PLoS ONE 8(7), e67332 (2013).
doi:10.1371/journal.pone.0067332
11. Carlson, J., Fosmire, M., Miller, C.C., Nelson, M.S.: Determining data information literacy
needs: a study of students and research faculty. Portal: Libr. Acad. 11(2), 629–657 (2011)
12. Mayernik, M.S., Wallis, J.C., Borgman, C.L.: Unearthing the infrastructure: humans and
sensors in field-based scientific research. Comput. Support. Coop. Work 22(1), 65–101
(2013). doi:10.1007/s10606-012-9178-y
13. Koltay, T.: Data literacy: in search of a name and identity. J. Documentation 71(2), 401–415
(2015). doi:10.1108/JD-02-2014-0026
14. MacMillan, D.: Data sharing and discovery: what librarians need to know. J. Acad.
Librarianship 40(5), 541–549 (2014). doi:10.1016/j.acalib.2014.06.011
15. Verbaan, E., Cox, A.M.: Occupational sub-cultures, jurisdictional struggle and third space:
theorising professional service responses to research data management. J. Acad. Librari-
anship 40(3–4), 211–219 (2014). doi:10.1016/j.acalib.2014.02.008
16. Frank, E.P., Pharo, N.: Academic librarians in data information literacy instruction: a case
study in meteorology. Coll. Res. Libr. 77(4), 536–552 (2016)
17. Federer, L.M., Lu, Y.L., Joubert, D.J.: Data literacy training needs of biomedical
researchers. J. Med. Libr. Assoc. 104(1), 52–57 (2016)
18. Fane, B., Treadway, J., Gallagher, A., Penny, D., Hahnel, M.: Open season for open data: a
survey of researchers. In: Digital Science Report, The State of Open Data: A Selection of
Analyses and Articles About Open Data, Curated by Figshare, Digital Science, pp. 12–19,
October 2016. https://figshare.com/articles/The_State_of_Open_Data_Report/4036398.
Accessed 11 June 2017
19. Prado, J.C., Marzal, M.Á.: Incorporating data literacy into information literacy programs:
Core competencies and contents. Libri 63(2), 123–134 (2013)
116 G. Chowdhury et al.
20. RCUK Common Principles on data Policy, Research Councils UK. http://www.rcuk.ac.uk/
research/datapolicy/. Accessed 12 Sept 2017
21. Tonta, Y.: Açık erişimin geleceği ve araştırma verilerine açık erişim, (Future of open access
and open access to research data). http://library.bilkent.edu.tr/activities/librarianship-
seminars/presentations/yasar-tonta.pptx. Accessed 12 Sept 2017
22. Aydınoğlu, A.U.: Araştırma verileri yönetimi: Türkiye, (Reserach data management:
Turkey). Paper Presented at the 5th National Open Access Conference in Ankara, Turkey
(2016)
23. Frascati Manual: Proposed Standard Practice for Surveys on Research and Experimental
Development, 6th edn. http://www.oecd.org/sti/inno/frascatimanualproposedstandardpractice
forsurveysonresearchandexperimentaldevelopment6thedition.htm
Lexical and Discourse Analysis
Deep Stylometry and Lexical & Syntactic
Features Based Author Attribution on PLoS
Digital Repository
1 Background
the writing style of authors. These features, termed as style markers, are considered to
differentiate the writing style of one author from another [3].
Stylometry has a vast scope for academic purpose - it has been used for author
name disambiguation [4], citation pattern matching [5], bibliometric [6] and plagiarism
detection [7]. Recently, a new field of Forensic Stylometry has been introduced that
focuses to identify the mental health of patient after analyzing his/her writing style
[8]. In addition, it has been used to solve cybercrimes by analyzing the language in
order to identify the actual author of suspicious messages, tweets or Facebook profile
etc. [9]. Deep learning techniques have been vastly adopted for authorship identifica-
tion tasks [10]. Both vocabulary based cues as well as sequential patterns are provided
to the model [11, 12], which then identifies a pattern for authorship attribution.
Recently deep learning techniques have been widely adopted in the field of Stylometry,
due to their extensive classification power [13].
In this paper, we address the problem of authorship attribution through Stylometry
on the scientific publications downloaded from PLOS.org [14], having more than 200
unique authors. In contrast with the existing models that use different features such as
average word length, most frequent words, function words [15–20], we identify the
potential changes in writing style of different authors using lexical features (including
n-gram and word frequency) and syntactic features (including parts of speech tagger).
Using these stylometric markers, we deploy k-Means clustering algorithm with the goal
that papers by unique authors be grouped in a single cluster. In addition, we also
employ a novel long short-term memory (LSTM) based deep learning model to predict
the author of a given publication. While the unsupervised model shows that 88.17%
authors are classified into correct cluster (all papers written by the same author) with at
most 0.2 coefficient of Entropy error, our LSTM deep learning based model consis-
tently show above 95% accuracy across all the given testing samples of publications
written by an author.
2.1 Dataset
We download all available (till July, 1 2015) 158918 full text publications from PLOS.
org [10] in XML format. Further, we identify all single authored 1506 publications
from the dataset. Among these publications, we select 803 publications authored by
203 unique authors with at least two publications each. For further processing, we
extract <body> section from the xml files that contain full text of publication and
converted them into the “.txt” file format. These “txt” files are then named with their
respective author identification number including serial number of publication out of
total available publication e.g. AU001_1_2.txt represent a paper written by author # 1
having total 2 publications). Further we pre-process the dataset by omitting few,
seemingly superfluous, sections like tables and function words (such as and, but, in,
may etc.). For function word removal we use Apache Open NLP - a machine learning
based toolkit used for the pre-processing of the natural language text [21]. For stem-
ming, Snowball stemmer is used - a common stemming algorithm for information
Deep Stylometry and Lexical & Syntactic Features based Author Attribution 121
retrieval pre-processing [22]. The Fig. 1 shows the distribution of publications with
respect to authors. Our dataset ranges from 136 authors with at least 2 publications each
to an author with 56 publications.
136
140
105
# of Authors
70
35 23
13 10
4 1 3 2 1 1 1 1 1 1 1 1 1 1 1
0
2 3 4 5 6 7 8 9 10 13 15 21 20 16 29 31 35 41 56
# of Publications
jA \ B j
JðA; BÞ ¼ ; where 0 JðA; BÞ 1 ð1Þ
jA [ B j
We apply n-gram on our dataset by varying the value of n from 1 to 10. Further, we
also calculate Jaccard similarity on all 1 to 10 g to obtain similarity matrices of size 803
by 803.
Word Frequencies. We use another lexical feature i.e. word frequencies, to obtain
frequency of each word in a publication - then arranged them in decreasing order of
122 S.-U. Hassan et al.
their frequencies. Then from these the top fifty most frequent words of each publication
is compared with the top fifty most frequent words of every other publication in our
dataset [20].
Frequency of Parts of Speech (POS). For syntactic feature, we use frequency of POS
tags extracted by using Apache Open NLP library [21] for tagging all 36 POS tags [23].
After POS tagging we compute the frequency of each tag. Further, we compare the
frequency of each tag across the publication in our dataset. For processing POS tag
syntactic feature, we apply Euclidian distance between the publications across all 36
POS tags. Euclidean Distance is the distance between two points - it is the square root
of sum of squares of two points as shown in Eq. 2. Here qi and pi are two identical POS
tags in publication p and q and n ranges from 1 to 36.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Xn
dðp; qÞ ¼ i¼1
ð qi pi Þ 2 ð2Þ
We obtain multiple distance matrices of size 803 by 803. For n-gram: 10 matrices,
each for 1 to 10 g. For frequency of POS: A matrix for POS tagger and a matrix for the
comparison of top 50 most frequent.
Evaluation Indices for Clustering Accuracy. Finally, to evaluate the clustering
accuracy, we use Dunn’s index to evaluate our k-Means based model. In addition, we
also employ Entropy based model to evaluate the effectiveness of unsupervised cluster
models used to group the publications that belong to a given author. Well known
Dunn’s index identifies all sets of clusters that are cohesive and well separated from
each other such that the means of different clusters are sufficiently far apart, as com-
pared to the intra-cluster variance, as shown in Eq. 3, where Xi,j is the average
Deep Stylometry and Lexical & Syntactic Features based Author Attribution 123
dissimilarity between the clusters i and j and Yk is the average dissimilarity within
cluster k.
minðXi;j Þ
Dunn's ¼ ð3Þ
maxðYk Þ
The Entropy measures the information spread - greater the value of Entropy more
the information spread be. For instance, if all the publications that belong to an author
are grouped in a single cluster then the contribution to Entropy index for this cluster
would be 0 – implying the correct grouping of related publications into a single cluster.
In contrast, if all the publications of an author are uniformly spread in all 203 clusters
then the value of Entropy index would be 1 – an evidence for maximum error in
grouping. Equation 4 shows the Entropy index, where n is the number of cluster, P(xi)
denotes the probability of a publication in a given cluster i, and H(A) is Entropy of a
given author. The value of Entropy ranges from 0 to 1.
Xn
HðAÞ ¼ 1
Pðxi Þ logn Pðxi Þ ð4Þ
ezj
rðzÞj ¼ Pk for j ¼ 1; . . .; k ð5Þ
j¼1 ez j
Where z is a vector of inputs to output layer and j is the number of output units.
Here ezj represents an exponential function, whose value increases the probability of
maximum value of previous layer. The value of rðzÞ j are real values between 0 and 1.
In our case, this function is used to represent categorical distribution. For better
understanding, Fig. 3 shows the flow chart of LSTM based employed model.
Baseline Stylometric Model. Further, to compare the results of our LSTM based
model with baseline, the following steps were considered. At first, we compute word
frequency of training and testing sets (i.e. publication) of a given author. We chose top
40% most frequent words from the training and testing set and mark them as the true
class. Further a confusion matrix is computed which is used to calculate evaluation
indices. Finally, the results of our baseline method are compared with that of
state-of-the-art LSTM based deep learning model.
124 S.-U. Hassan et al.
This section presents the results of our deployed unsupervised clustering technique and
supervised deep learning models.
80 75
60 53 51
No. of Authors
40
20 15
6 3 0 0 0 0 0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Entropy
1 3.5
Learning Accuracy
0.8 2.8
Learning Loss
0.6 2.1
0.4 1.4
0.2 0.7
0 0
1 12 23 34 45 56 67 78 89 100
Epochs
For testing, we use the remaining 50% data correspond to each individual author.
Table 2 shows average recall, precision and accuracy of testing data of our LSTM
model. Here the LSTM model is evaluated by minimizing the categorical cross-en-
tropy loss. In addition, we also show evaluation metrics corresponding to our baseline
method. The LSTM model shows very encouraging average accuracy of 0.96. While
precision and recall indices are reported as 0.92 and 0.93 respectively.
Overall, the results indicate that our model can easily distinguish one writing style
from the other. These scores also confirm the validity of the results obtained from k-
Means.
4 Concluding Remarks
References
1. Juola, P.: Authorship attribution. Found. Trends® Inf. Retrieval 1(3), 233–334 (2008)
2. Rudman, J.: Non-traditional authorship attribution studies: Ignis Fatuus or Rosetta stone?
Bull. (Bibliograph. Soc. Aust. NZ) 24(3), 163 (2000)
3. Stamatatos, E.: A survey of modern authorship attribution methods. J. Assoc. Inf. Sci.
Technol. 60(3), 538–556 (2009)
4. Smalheiser, N.R., Torvik, V.I.: Author name disambiguation. Annu. Rev. Inf. Sci. Technol.
43(1), 1–43 (2009)
Deep Stylometry and Lexical & Syntactic Features based Author Attribution 127
5. Gipp, B., Meuschke, N.: Citation pattern matching algorithms for citation-based plagiarism
detection: greedy citation tiling, citation chunking and longest common citation sequence.
In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 249–258
(2011)
6. Bergsma, S., Post, M., Yarowsky, D.: Stylometric analysis of scientific articles. In:
Proceedings of the 2012 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pp. 327–337 (2012)
7. Eissen, Z.M.S., Stein, B.: Intrinsic plagiarism detection. In: European Conference on
Information Retrieval, pp. 565–569 (2006)
8. Smith, M.W.: Forensic stylometry: a theoretical basis for further developments of practical
methods. J. Forensic Sci. Soc. 29(1), 15–33 (1989)
9. Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship analysis in cybercrime investigation.
In: Chen, H., Miranda, R., Zeng, Daniel D., Demchak, C., Schroeder, J., Madhusudan, T.
(eds.) ISI 2003. LNCS, vol. 2665, pp. 59–73. Springer, Heidelberg (2003). doi:10.1007/3-
540-44853-5_5
10. Wang, L.Z.: News authorship identification with deep learning. https://cs224d.stanford.edu/
reports/ZhouWang.pdf. Accessed 4 Jan 2017
11. Macke, S., Hirshman, J.: Deep Sentence-Level Authorship Attribution. https://cs224d.
stanford.edu/reports/MackeStephen.pdf. Accessed 5 Feb 2017
12. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with multi-task
learning (2016)
13. Surendran, K., Harilal, O.P., Hrudya, P., Poornachandran, P., Suchetha, N.K.: Stylometry
detection using deep learning. In: Behera, H., Mohapatra, D. (eds.) Computational
Intelligence in Data Mining, pp. 749–757. Springer, Singapore (2017). doi:10.1007/978-
981-10-3874-7_71
14. PLOS.org. https://plos.org/. Accessed 3 Jan 2017
15. Nirkhi, M.S.: Stylometric approach for author identification of online messages. Int.
J. Comput. Sci. Inf. Technol. 5(5), 6158–6159 (2014)
16. Mustafa, T.K., Mustapha, N., Azmi, M.A., Sulaiman, N.B.: Dropping down the maximum
item set: improving the stylometric authorship attribution algorithm in the text mining for
authorship investigation. J. Comput. Sci. 6(3), 235 (2010)
17. Chakraborty, T.: Authorship identification in Bengali literature: a comparative analysis.
arXiv preprint arXiv:1208.6268 (2012)
18. Bozkurt, I.N., Baglioglu, O., Uyar, E.: Authorship attribution. In: 22nd International
Symposium IEEE Computer and Information Sciences, pp. 1–5 (2007)
19. Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial
fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)
20. Voyer, D.: Word frequency and laterality effects in lexical decision: right hemisphere
mechanisms. Brain Lang. 87(3), 421–431 (2003)
21. OpenNLP. https://opennlp.apache.org/. Accessed 1 Feb 2017
22. Porter, M.F.: Snowball: a language for stemming algorithms. snowball.tartarus.org/texts/
introduction.htm. Accessed 17 June 2017
23. List of part-of-speech tags. https://www.ling.upenn.edu/courses/Fall_2003/ling001/p-enn_
treebank_pos.html. Accessed 17 June 2017
24. Bagnall, D.: Author identification using multi-headed recurrent neural networks. arXiv
preprint arXiv:1506.04891 (2015)
Automatic Answering Method Considering
Word Order for Slot Filling Questions
of University Entrance Examinations
1 Introduction
Recently, automatic answering technologies such as question answering (QA)
have attracted attention as a technology to satisfy various information requests
from users. Questions handled by QA can be categorized into two types: factoid
type, which requires facts with short words such as person’s name as the answer,
and non-factoid type, which needs to explain definitions or procedures in the
answer. Many researches have been conducted on factoid type QA so far, whereas
it is difficult to say that these technologies can adequately respond to the diverse
and complicated questions in realistic situations including university entrance
examinations handled in NTCIR-131 QA Lab-3.
Figure 1 shows an example of factoid type questions in the university entrance
examination world history problems of Japan. If we exclude multiple-choice ques-
tions, they can be classified into slot-filling type and response type. In general
1
NTCIR-13:http://research.nii.ac.jp/ntcir/ntcir-13/.
c Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 128–141, 2017.
https://doi.org/10.1007/978-3-319-70232-2_11
Automatic Answering Method Considering Word Order 129
factoid type QA, response type questions as shown in Fig. 1(b) are often targeted.
In previous studies, to automatically answer such questions, many QA systems
adopt the method to acquire various clues from the question and select the
answer based on them. For example, predicting the answer category such as the
name of a person, the region name, etc. is often introduced in the conventional
methods, whereas they assume the sentence structure of the restriction-type
questions, meaning that these conventional methods cannot be applied to the
slot-filling questions as it is. For this reason, it is necessary to introduce the opti-
mal method into QA systems to answer automatically to slot-filling questions.
Figure 2 shows the basic processing steps of factoid type QA. When one
question is input, the modules are executed in the order of “question analy-
sis”, “document retrieval”, “answer candidate extraction” and “answer candi-
date evaluation”, to output the final answer. The process of obtaining clues
from the question mentioned above corresponds to the question analysis mod-
ule. Also, the process of selecting the answer based on the clues is performed in
the answer candidate evaluation module. Hence, it is necessary to introduce the
optimal method into these modules to cope with slot-filling questions.
In this paper, we propose an automatic answering method considering word
order for the slot filling questions in the university entrance examination world
history problems, based on the process of Fig. 2.
130 R. Tagami et al.
2 Related Research
Many researches on QA systems have been conducted so far [1–3]. In NTCIR,
question answering tasks have been frequently held. The system developed by
Murata et al. [3] gave the best results at NTCIR-52 QAC-3 for factoid type QA.
The feature of this system is that the evaluation score of the answer candidate
for each document are finally added up by using multiple documents. The system
also estimates the answer category using a rule-based method. That is, even with
a rule-based method, certain correct answer rate can be achieved in factoid type
QA. However, as detailed in Sect. 4.1, it is expected that the definition of the
rules becomes complicated and the range that can be covered is limited with
the rule-based method in case of the slot filling problems, because questions in
the university entrance examination use complex phrases and the structure of
the question sentence is different from that used in the conventional factoid type
QAs.
Research on the QA system for questions of university entrance examination
has also been conducted in recent years. Sakamoto et al. [4] identifies response
type questions with slot-filling type questions, as the same word answer ques-
tions. The basic processing steps of this system is also in accordance with Fig. 2.
In question analysis module, the system predicts the answer category such as
the name of a person or the region name, by focusing on the interrogative of the
question, and also estimates the question’s focus such as the name of king or
the name of god, by focusing on the word just before the interrogative (after the
interrogative in case of English). In answer candidate evaluation module, each
answer candidate is scored by how frequent the candidate word appears in the
source documents and by how well the candidate matches the answer category
and the question’s focus. Finally, the word with the highest score is outputs as an
answer. Also in automatic answering of slot-filling type question, we think that
improvement of correct answer rate can be expected if the answer category pre-
diction is possible. However, the sentence structure is different between response
type and slot-filling type, so the same method as Sakamoto et al. cannot be used.
There are also various studies on distributed representation of words.
The word2vec developed by Mikolov et al. [5] is famous as learning tool of
distributed representation of words. Continuous Bag-of-Words (CBOW) model
is one of the learning models used in this tool. In this model, the central word
is predicted from its surrounding words. This is similar to the way of thinking
by humans when answering slot-filling type questions. It is, however, difficult to
predict the central word suitable for the context because CBOW model doesn’t
consider the word order. In the first place, word2vec does not necessarily intend
the task to predict the central word from its surrounding words using the model
generated by this tool.
Ariga et al. [6] proposed a new learning model with word order information
added to CBOW model. In particular, they proposed two methods: Left and
Right (LR) model distinguishing between front and back words of the central
2
NTCIR-5: http://research.nii.ac.jp/ntcir/ntcir-ws5/.
Automatic Answering Method Considering Word Order 131
word, and Word Order (WO) model that distinguishes all positions of the sur-
rounding words. They reported that the accuracy of predicting the central word
was improved by the proposed model. Unlike the CBOW model, the WO model
is given word order information, which makes it easier to predict the central
word suitable for the context.
In this paper, the QA system is based on the basic steps shown in Fig. 2. This
section explains the specific steps used by the system. Based on that, Sect. 4
explains our proposal method to be incorporated into the system.
We prepared some dictionaries and knowledge sources for developing the auto-
matic answering to world history problems.
132 R. Tagami et al.
4 Proposed Method
4.1 Prediction of Answer Category and Category Mismatch
Judgement
As explained in Sect. 2, the correct answer rate is expected to be improved if the
system can accurately predict the answer category of slot-filling type questions.
A rule-based method can be taken as a basic method to estimate the answer
category. However, the method requires humans to find patterns and create
rules while referring to past questions and is extremely difficult to guarantee
that the rules cover new phrases. Therefore, we propose a method using the
word prediction model as a method to estimate the answer categories for any
given phrases in the question.
Before executing the automatic answering process, we construct a word pre-
diction model that predicts a center word from its surrounding words and their
word order.
4
Apache Solr : http://lucene.apache.org/solr/.
134 R. Tagami et al.
Middle
Layer
H Middle
H
Layer
Output t Output
Layer t
Layer
Figure 4 shows the outline of each layer for both CBOW and WO model. t
is the center word to be predicted. t ± x indicates that it is the xth surrounding
word before and after t, respectively. Unlike the CBOW model, the WO model
produces the vector H in the middle layer while keeping the positional relation-
ship of words, meaning that t can be predicted considering the word order.
The word prediction model is constructed based on the WO model. When
the surrounding words are input to this model, candidates of the center word
are output in descending order of possibility. The number of words to be input
depends on the window size x at model construction.
Category: Person
noun Candidate A
noun Candidate B Category: Location
Output
noun Candidate C
noun Candidate D No category
Not noun Candidate E (When it is not a word
noun Candidate F related to world history.)
fbackward (w) =
b (if the backward string of w matches the word next to the slot) (2)
0 (otherwise)
For example, as shown in Fig. 3 (D), b is added to the score since the backward
part of the candidate word is (tribe)” which exists immediately after the slot.
In the case of the university entrance examinations, it is very unlikely that the
words already appearing in the question sentences become their correct words.
Accordingly, as one of the indicators explained in Sect. 3.6, we introduce an
index based on non-existence word judgement as shown in Eq. (3). The positive
136 R. Tagami et al.
value c is added to the score if w does not exist in the instruction part or the
context part.
c (if w does not exist in the question sentences)
fexistence (w) = (3)
0 (otherwise)
5 Experiments
5.1 Experiment 1: Change in Precision Due to Different Word
Prediction Models
Purpose. The prediction accuracy of the category is considered to change
depending on the parameters set at the time of the model construction and on
the output word count of the model. By this experiment, we find more appro-
priate conditions such as parameters.
Method. First, Table 2 shows the list of conditions for constructing the model.
The format of the model name is a combination of the learning model and
the window size. The learning model is the name of the learning model used to
Result. The results of the accuracy of the prediction categories under each
condition are shown in Table 3 and Fig. 6.
They indicate that x = 4 in CBOW model and x = 5 in WO model gave more
accurate results than others. In CBOW-4, the MAP value when the number of
output candidates is five is higher than that when it is one. As for WO-5, the
MAP value gets decreased as the number of output candidates becomes 1, 5,
and 10.
5
NTCIR-12: http://research.nii.ac.jp/ntcir/ntcir-12/.
138 R. Tagami et al.
(MAP) (MAP)
0.5 0.5
WO-5
WO-5
CBOW -4
CBOW -4
WO-5
0.4 0.4
WO-4
CBOW -6
WO-4
CBOW -4
CBOW -5
CBOW -6
CBOW -5
WO-7
WO-7
CBOW -7
CBOW -3
CBOW -3
CBOW -7
0.3 0.3
WO-6
WO-4
CBOW -3
CBOW -5
CBOW -6
WO-6
CBOW -7
0.2 0.2
WO-3
WO-3
WO-7
WO-6
0.1 0.1
WO-3
0.0 0.0
Top 1 Top 5 Top 10 Top 1 Top 5 Top 10
CBOW WO
Method. The data set is the same as the one used in Sect. 5.1. In the experiment
2, the following 54 questions were used.
– Hokkaido University, 2011 (9 questions)
– Chuo University, 2011 (15 questions)
– Waseda University, 2011 (8 questions)
– Kyoto University, 2011 (22 questions)
the value of the parameter that makes the correct answer rate highest when
automatically answering the data set of questions used in Sect. 5.1. The result
showed that the correct answer rate became highest when a = 10, b = 30, c = 50.
Therefore, these values are used for parameters in the following experiments.
Finally, Table 5 shows the definitions of correct and wrong answers in this
experiment. Each question is classified into one of the sub classes depending on
the relationship between the scoring result and the true correct word.
6 Discussion
6.1 Discussion 1: Predictive Accuracy of Answer Category
In experiment 1, it was found that the accuracy of the answer category prediction
greatly varies depending on the method of constructing word prediction models
and the number of output words.
Looking at the results in terms of the window size, the accuracy becomes the
highest when the window size is four in case of CBOW, and when it is five in
140 R. Tagami et al.
case of WO. If the window size is too small, the accuracy is considered to get
decreased because the number of clues are reduced. In contrast, if the size is too
large, the accuracy is also considered to get decreased because the number of
clues increase and more irrelevant words will be included.
As for CBOW, the highest accuracy was given by the CBOW-4 model with
five output words, and the MAP was 0.343. By contrast, in case of WO, the
highest accuracy was achieved by the WO-5 model with one output word, and
the MAP was 0.404. Therefore, the prediction accuracy gets better when the
word order is taken into consideration. Additionally, WO has high accuracy when
the number of output words is one. From this, it is expected to further improve
the category prediction accuracy by further strengthening the WO learning.
Experiment 2 examined the change in the correct answer rate of the automatic
answer depending on the different target document set to be searched and on
the different word prediction models.
As for the document set to be retrieved, the overall correct answer rate gets
higher with the paragraph set than the sentence set. Specifically, there was no
improvement in the “single” correct answer rate, whereas the “same rate” correct
answer rate became higher. The reason for this is because the documents related
to the question sentence in the paragraph set tend to be ranked with higher score
of BM25 than those in the sentence set, and because many answer candidates w
exist having the same base score.
Regarding the word prediction model to be used, the correct answer rate
became higher when the word order is not taken into consideration. The reason
for this is probably because the accuracy of the current category prediction itself
is low, predicting incorrect categories, and adversely affecting scoring.
7 Conclusion
In this paper, we proposed an automatic answering method for the slot filling
questions in the university entrance examination world history problems.
In the experiments, we examined the accuracy of the category prediction
using word prediction models, the correct answer rate of automatic answer by
different methods, and the effect of each indicator in the answer candidate evalu-
ation module. As a result, we confirmed that the prediction accuracy of category
prediction becomes better for models considering word order.
In the future, in order to improve the accuracy of the category prediction, we
plan to improve the method of constructing models and propose new methods.
We will also consider structuring slot-filling questions and introducing a new
scoring indicator that takes advantage of its characteristics.
References
1. Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A.A.,
Lally, A., Murdock, J.W., Nyberg, E., Prager, J., et al.: Building Watson: an
overview of the DeepQA project. AI Mag. 31(3), 59–79 (2010). doi:10.1609/aimag.
v31i3.2303
2. Iyyer, M., Boyd-Graber, J.L., Claudino, L.M.-B., Socher, R., Daumé III, H.:
A neural network for factoid question answering over paragraphs. In: EMNLP,
pp. 633–644 (2014)
3. Murata, M., Utiyama, M., Isahara, H.: Japanese question-answering system using
decreased adding with multiple answers at NTCIR 5. In: NTCIR-5 Workshop Meet-
ing (2005)
4. Sakamoto, K., Ishioroshi, M., Matsui, H., Jin, T., Wada, F., Nakayama, S., Shibuki,
H., Mori, T., Kando, N.: Forst: question answering system for second-stage exam-
inations at NTCIR-12 QA lab-2 task. In: 12th NTCIR Conference on Evaluation
of Information Access Technologies, pp. 467–472 (2016)
5. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. CoRR (2013)
6. Ariga, S., Tsuruoka, Y.: Synonym extension of words according to context by
vector representation of words (in Japanese). In: 2015 The Association for Natural
Language Processing, pp. 752–755 (2015)
7. Sato, T.: Neologism dictionary based on the language resources on the web for
MeCab (2015)
8. Kimura, T., Nakata, R., Miyamori, H.: KSU team’s multiple choice QA system
at the NTCIR-12 QA lab-2 task. In: 12th NTCIR Conference on Evaluation of
Information Access Technologies Conference on Evaluation of Information Access
Technologies, pp. 437–444 (2016)
9. Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: BM25
and beyond. Found. Trends R Inf. Retr. 3(4), 333–389 (2009)
10. Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source
framework for deep learning. In: Proceedings of Workshop on Machine Learning
Systems in NIPS (2015)
A Pilot Study on Comparing and Extracting
Impact Relations
Abstract. Documents often contain knowledge about who did what to whom
under what conditions, which can be expressed as relations between two entities.
Impact relations between two entities express the impacts of one entity on the
other. The goal of this pilot study is to examine whether impact relations are
similar across domains or not, and investigate how to extract impact relations
from unstructured documents using existing techniques. Impact relations in
three domains – medical science, international relations, and environmental
science (particularly oil spill) are collected and examined. Impact relations
account for a significant percentage of semantic relations in all the three
domains. The three domains share a common set of impact relations, and each
two of the three domains share a significant number of common impact rela-
tions. An approach to applying the knowledge to extract impact relations from
environmental science documents is proposed. The common impact relations
and synonyms of two very different domains can be applied to extract impact
relations of a third domain.
1 Introduction
Impact relation is a new concept we use in this pilot study. Extraction of concepts and
relations is studied in the natural language processing and knowledge organization
communities. Relation extraction is the task of extracting semantic relations between
two entities from unstructured documents. Impact relations between two entities
express the impacts of one entity on the other. Human beings are naturally interested in
the impact of an event or activity, such as the impact of the 2010 Gulf of Mexico
Deepwater Horizon oil spill incident on the coastal states in the United States of
America, and the impact of climate change on the earth.
Relations between entities are usually represented as verb phrases. Impact relations
are represented by a group of verb phrases. People in different domains tend to be
interested in different things and their impacts. For examples, economists discuss
economic events (e.g., the end of quantitative easing may raise interest rates), and
medical professionals care about drugs and diseases (e.g., a drug is used to treat a
disease). Understanding the major activities and their impacts in a profession is one of
the major tasks of the profession. Understanding and extracting impact relations from
unstructured documents aims to extract knowledge (in the form of relations between
two entities) from documents in a digital library and help professionals reduce the
information explosion problem.
The goal of this pilot study is to examine whether impact relations are different
across domains or not, and investigate how to extract impact relations from unstruc-
tured documents using existing techniques. To proof-test the new concept, we collected
the impact relations in three domains – medical science, international relations, and
environmental science (particularly oil spill), and examined whether impact relations
are domain sensitive or not, and then proposed to apply the knowledge to extract
impact relations from environmental science (particularly oil spill related) documents.
3 Related Work
This section provides a brief review of impact relation lexicons and main approaches of
information extraction and relation extraction.
144 Y. Wu and L. Yang
Relation extraction aims to extract the semantic relations between two entities. It
has two settings. The first setting is to extract all relationships between a given
(marked) pair of entities in a natural language document. There are three methods:
feature-based methods, kernel-based methods, and rule-based methods [16].
Feature-based methods extract a flat set of features for use by conventional classifiers
such as decision tree or SVM [22]. Kernel-based methods use kernel functions to
capture the similarity between two structures such as trees and graphs for use by a
SVM classifier to predict the relation type [16]. Rule-based methods create proposi-
tional and first order rules over structures around the two entities [16]. The second
setting is to extract entity pairs in a corpus given a relation type. There is no labeled
unstructured training data in this setting. Instead, we are given a corpus with a set of
relation types and entity types forming arguments of these relation types, and a seed set
of relation-entity pair examples indicating that the entity pair has a specified relation.
There are three steps to solving the problem in this setting [16]. The first step is to learn
extraction patterns from seed triplets (i.e. relation-entity pair examples) by bootstrap-
ping [14, 16]. The second step is to apply the learned extraction patterns to extract
candidate entity pairs that support the given relation types. The third step is to validate
the extracted relations using additional statistical tests.
Bootstrapping-based information extraction system [14] requires only a small
number of seed triplets, which are used to generate extraction patterns, which in turn
extract new triplets from the corpus. This approach is to be used in this study.
This study has three objectives: (1) to collect and compare the impact relations in three
domains – medical science, international relations, and environmental science (related
to oil spill, particularly), (2) to apply the knowledge to extract impact relations and
entities from oil spill related documents, and then to expand the oil spill entity lexicon
and impact relation lexicon, and (3) to evaluate the lexicons with annotated documents.
We selected these three domains because they are different domains (although medical
science and environmental science share some common topics) and we have linguistic
resources in these domains.
To fulfill the first objective, we collected the CAMEO lexicon, the Oil Spill
Semantic Relation Taxonomy (OSSRT), and UMLS semantic relation types which are
used as a small semantic relation lexicon in this study. By comparing each two of the
three lexicons or taxonomies, we generated a common impact relations lexicon for each
pair. We hypothesized that the common impact relations lexicon that appear in two
very different domains (e.g., medical science and international relations) should also
appear in a third domain (i.e., the oil spill domain). We implemented the following two
research tasks to test the hypothesis:
a. studying whether impact relations in different disciplines are mainly different or
similar by comparing the impact relation lexicons in the three disciplines.
b. studying whether the common impact relation lexicon that is generated from
CAMEO and UMLS can be applied to the oil spill domain.
146 Y. Wu and L. Yang
As the result of our oil spill topic map project, we have created a set of triplets
(entities and relations) in the oil spill domain [23]. Therefore, we use the oil spill
domain as a test bed. To fulfill the second objective, we plan to supply the common
impact relation lexicon and the oil spill entity lexicon to the TextRunner information
extraction system to verify whether a relation exists between two entities. TextRunner
can extract the relationship between two entities from the Web with 80.4% accuracy
[15]. It takes three query terms: Argument 1, Predicate, and Argument 2. A program
can be written to formulate a query using a pair of arguments and a predicate, issue the
query to TextRunner, and crawl the result page, which presents the relation between the
two arguments, and example sentences. The HTML file will be processed to identify
the relations between the two arguments.
If an impact relation between two entities is extracted, its existence is verified. The
verified impact relations can be added to the oil spill impact relation lexicon. The
extended impact relation lexicon and one entity of the oil spill triplets can be supplied
again to TextRunner to extract the other entity. The extracted entity can be added to the
oil spill entity lexicon. This bootstrapping process is described in Fig. 1.
To fulfill the third objective, for evaluation purpose, we plan to manually annotate a
small set of oil spill related documents that present impact relations between entities.
The TABARI coding system [24] can be customized to code the oil spill documents using
the oil spill entities and impact relation lexicons. Text2Onto [25] can also be used to
extract entities and relations. Based on our approach, providing any bootstrapping-based
information extraction system with the oil spill entities and any three impact relation
lexicons (including OSSRT) will tell the value of the lexicons and the accuracy of the
system.
UMLS has defined 54 semantic relation types [5, 6], and 11 (or 20%) of them are direct
impact relation types. CAMEO has 1,835 effective verb phrases [9], and 540 (or
29.4%) of them are direct impact relations. The Oil Spill Semantic Relation Taxonomy
(OSSRT) has 900 semantic relations, and 263 (or 29.2%) of them are direct impact
relations. Therefore, direct impact relations account for about 20-30% of semantic
relations in the three domains respectively. Impact relations play a significant role in all
the three domains.
By comparing the direct impact relations in CAMEO and OSSRT, we found 72
common direct impact relations and 75 direct impact relation synonyms, so 55.9% of
direct impact relations and synonyms in OSSRT appear in CAMEO, or 27.2% of direct
impact relations and synonyms in CAMEO appear in the OSSRT. Synonyms were
judged according to the Merriam-Webster online dictionary [26]. Ten out of 11 (91%)
UMLS direct impact relations appear in OSSRT. They are affect, be result of, bring
about, cause, disrupt, interact with, manage, treat, prevent, and produce. The only
verb that does not appear in OSSRT is complicate. Eight out of 11 (73%) UMLS direct
impact relations appear in CAMEO. They are affect, cause, complicate, disrupt,
interact with, manage, prevent, and produce. The UMLS verb phrases that do not
appear in CAMEO are be result of, bring about, and treat.
This indicates two direct findings. Different domains share a significant number of
direct impact relations although every domain may have some unique impact relations,
such as “extradite” and “assault” in CAMEO. The hypothesis that the common impact
relations of two very different domains (i.e., medical science and international rela-
tions) should also appear in a third domain (i.e., oil spill) is not fully supported, because
“complicate” which appears in UMLS and CAMEO does not appear in OSSRT.
A reason for the issue is probably that the OSSRT lexicon is not big enough and has
missed the verb “complicate.” However, the hypothesis can be revised as the follow-
ing: most of the common impact relations and synonyms of two very different domains
should appear in a third domain. Consequently, the common impact relations and
synonyms of two very different domains can be used to guide the extraction of impact
relations of a third domain. Furthermore, two semantic relation lexicons (i.e., CAMEO
and OSSRT) contain semantic relations that are not impact relations. The non-impact
relations of the two domains can also be used to guide the extraction of impact relations
of a third domain because they are less likely to be impact relations of the third domain
148 Y. Wu and L. Yang
and so are often recommended to be ignored if they are extracted by the impact relation
extraction system.
The findings suggest three generalized hypotheses. First, there is a common set of
direct impact relations across different domains. Second, whether impact relations are
domain sensitive or not may depend on the scope of the domain. The impact relations
of a narrow-scope domain (e.g., medical science) may not be as sensitive to the domain
as a wide-scope domain (e.g., international relations). Third, every domain may have
some unique impact relations.
This pilot study compares impact relations of three domains: medical science (UMLS),
international relations (CAMEO), and oil spill (OSSRT). Impact relations account for a
significant percentage (about 20-30%) of semantic relations in the three domains. Most
(73-91%) of the UMLS impact relations appear in CAMEO and OSSRT whereas
OSSRT and CAMEO share a significant number (27-47%) of common direct impact
relations. Each domain may have some unique impact relations. The scope of a domain
may affect the size of its semantic relations. A wide-scope domain may have a bigger
set of direct impact relations than a narrow-scope domain. The study indicates a revised
hypothesis that most of the common impact relations and synonyms of two very
different domains should also appear in a third domain. Consequently, the common
impact relations and synonyms of two very different domains can be applied to guide
the extraction of impact relations from unstructured documents of a third domain
because they are very likely to appear in the third domain. The semantic relations of
two different domains that are not impact relations can also be applied to guide the
extraction of impact relations in the third domain because they are less likely to be
impact relations of the third domain.
The research project is the initial stage and much work remains to be completed in
the future. The project has three objectives. For the first objective, the UMLS semantic
types need to be expanded to be an impact relation lexicon. For the second and third
objectives, the impact relation extraction approach needs to be implemented and
evaluated.
References
1. Bertaud, V., et al.: The value of using verbs in Medline searches. Med. Inf. Internet Med. 32
(2), 117–122 (2007)
2. Green, R.: A relational thesaurus: modeling semantic relationships using frames. Annu. Rev.
OCLC Res., 94–97 (1996)
3. Wang, Y., et al.: Relational thesauri in information retrieval. JASIS 36(1), 15–27 (1985)
4. Swanson, D., Smalheiser, N.: Implicit text linkages between Medline records: using
Arrowsmith as an aid to scientific discovery. Libr. Trends 48(1), 48–59 (1999)
5. UMLS: About the UMLS (2016). https://www.nlm.nih.gov/research/umls/about_umls.html.
Accessed 9 Sept 2017
A Pilot Study on Comparing and Extracting Impact Relations 149
Abstract. In this study, we measure the discourse scale of tweet sequences and
observe their characteristics, for 80 3 Japanese Twitter accounts that deal
with books, films, and other interests. For each account, a sequence of 3,000
tweets is regarded as the overall textual unit for which the discoursal scale is
evaluated. To measure the discourse scale, we first selected 50 words that we
call “discourse keywords” and observed how they occur in each of the Twitter
accounts. The results showed that the discourse scale is about 15 tweets,
regardless of their interests.
1 Introduction
In this study, we analyse the discourse scale of tweet sequences and examine their
characteristics, for Japanese Twitter accounts that have a record of referencing books,
films, and other interests (interests other than books and films).
Existing work on Twitter has mainly focused on its two characteristic features, i.e.
network connections among accounts (follower and followee relationships) and tweets
(through RTs, replies, and likes) [5, 15, 25, 32], and transitions of tweet topics in time
scale, the typical application of which is trend detection [12, 23, 27]. Some textual and
multimodal characteristics of tweet texts have also been observed and analysed [10,
24], either descriptively or in relation to such applications as maximising dissemination
of information. Emotion detection has recently been an important topic [8, 26, 30].
Characteristics of individual Twitter accounts have been observed from the point of
view of posting behaviour, user profiles, and follower-followee characteristics,
including identification of spam accounts [4, 33].
To the best of our knowledge, however, there has been little work, if any, that
analysed the discoursal characteristics of tweets for different Twitter accounts or
account type. By “discoursal characteristics”, we mean to observe the set of all tweets
with the temporal order of tweets as one textual unit and to observe its discoursal
features such as coherency. We observe that some people successively post tweets on
the same topic, following the previous tweets with later tweets and construct an
argument about the topic. This is especially observable in Japanese tweets, which can
contain nearly twice as much information as English tweets due to the nature of
character sets [22]. These tweets are often posted in the form of self-replies, but this is
not necessarily the case. Taking into account this kind of successive tweeting beha-
viour, we set out observing how different people make different discoursal scales using
their Twitter accounts. Some people may tweet on different topics in succession, while
others may construct a coherent discourse over a span of successive tweets.
The present study is descriptive, but it can contribute to some Twitter-based
applications. For instance, the authors are currently developing a book-recommendation
system using Twitter [31]. The system delivers users a package of information related to
books that are mentioned in Twitter accounts’ timelines registered by the system’s users.
What sort of information related to books should be packaged into the unit of infor-
mation to be delivered to users is a topic that involves difficult decisions. One hypothesis
is that those who tend to follow Twitter accounts that have heavy and long discoursal
scales are more likely to accept in-depth, analytical information about books, while
those who prefer “lighter” accounts do not want heavy loads of information accom-
panying the book information that the system provides. We recognise that knowing the
discoursal characteristics of Twitter accounts is not only interesting on its own but could
potentially be useful to a range of applications, such as ours, since the discoursal
characteristics can be utilised for user profiling.
The rest of the paper is organised as follows. In Sect. 2, we briefly introduce related
work from the view of discourse analysis methods. In Sect. 3, we define the concept of
discourse scale and introduce indices to measure the discourse scale. We also elaborate
on how we actually apply these indices in measuring the discourse scale of tweet
sequences. In Sect. 4, we discuss the result of the analysis. Section 5 concludes the
study.
2 Related Work
3 Method
We investigated the “discourse scale”—which can be observed from the point of view
of the number of successive tweets that constitute a coherent unit of topical discourse—
of Japanese Twitter accounts. We chose Twitter accounts that explicitly list books as
one of their interests, as we are developing the book recommendation system men-
tioned earlier. In particular, we are interested in whether book lovers have a charac-
teristic discourse scale. For purposes of comparison, we also analyse Twitter accounts
that deal with film, and accounts that deal with other interests. Films are chosen as a
similar media-related interest to books or reading. The accounts enjoying other interests
are collected for a simulation of average Twitter users because interests other than
books and films can cover almost any type of interest. This will show whether a
difference in interests relates to discourse scales of accounts or not.
We explain below how we chose accounts and collected basic tweet data, taking
book-related Twitter accounts as an example. From approximately 1,000 Japanese
Twitter accounts that state that they are book lovers in their profile, or accounts whose
profile (or user name) contain both “interest” and “reading” in Japanese, we selected 80
accounts randomly. Note that the initial 1,000 accounts were already biased by
Twitter’s recommendation algorithm as related to the starting account we prepared. For
each of these accounts, we collected 3,000 recent original tweets and selected the 50
most frequently occurring content words (nouns, verbs, and adjectives) in the 3,000
tweets1. Thus each account has a different set of 50 content words2. Hereafter, we refer
to these as “discourse keywords”. Note also that the discourse keywords may not
necessarily be related to books or reading, but this is valid because what we are
concerned with is the discoursal scale of Twitter accounts, and not the discoursal scale
for book-related content.
Twitter accounts that deal with films were collected in the same manner. Those that
deal with other interests were regarded as accounts whose profile contains the word
“interest” but not “reading” or “films” in Japanese. For each account, intervals and
discoursal spans are observed for each of the 50 words. We explain below how we
measured the interval and the discoursal span.
1
In order to infer the Part-of-Speech tags of Japanese words, we adopted the Japanese morphological
analyser MeCab (https://github.com/taku910/MeCab), with a dictionary enhanced for neologisms
frequently appearing online [29]. The version we used was released on 24th April 2017.
2
In any set of content words, we removed Japanese stop-words suitable for content analysis [14],
which enables us to exclude delexical words among nouns, verbs, and adjectives.
Measuring Discourse Scale of Tweet Sequences: A Case Study 153
due partly to space limitations and partly to the fact that we are interested in gaining
insight into the general discoursal characteristics of Twitter accounts that deal with
books in contrast to accounts that deal with different interests such as films or other
interests, we focus on the differences between different groups, i.e. those accounts that
are interested in books, in films, and in other interests. For that, we only give summary
figures of the intervals. That is, we first obtain the mean intervals of 50 discourse
keywords per user, and then further calculate the mean of these mean intervals among
each user group. We also calculate their maximum, minimum, quantiles, and standard
deviation values.
Table 1 shows the summary figure for the accounts dealing with books, films, and
other interests.
Table 1. Descriptive statistics on user-wide mean values of mean tweet intervals among the top
50 frequent content words in the accounts dealing with books, films, and other interests.
Books Films Other
Mean 63.17 64.91 60.32
Std. 31.11 38.60 27.58
Min. 5.55 9.67 6.67
25% 42.98 44.09 40.16
50% 56.96 57.19 59.27
75% 80.39 78.50 73.53
Max. 153.60 317.64 135.20
Fig. 1. Tweet length per discourse unit Fig. 2. Tweet length per discourse unit
(x-axis) and DMI (y-axis) of book lover (x-axis) and DMI (y-axis) of film lover
users. users.
The discoursal span for a word w for each user account is calculated as follows:
1. Divide the 3,000 tweets into P units.
2. For each word w among the 50 most frequent words (discourse keywords):
a. Calculate the mutual information between w and J for the original text:
X
P
pðwÞpðjjwÞ
MIðw; J Þ ¼pðwÞ pðjjwÞ log2
j¼1
pðwÞpð jÞ
X
P
pðjjwÞ
¼pðwÞ pðjjwÞ log2 ;
j¼1
pð j Þ
Measuring Discourse Scale of Tweet Sequences: A Case Study 155
b. Calculate the mutual information between w and J for randomly reordered text:
XP
h^
pðjjwÞi
hMIðw; J Þi ¼ pðwÞ h^pðjjwÞi log2 ;
j¼1
pð jÞ
4 Discussion
First, from Table 1, we can say that there is little outstanding difference among user
groups of different interests. This result describes the distributions of mean tweet
intervals at which the top 50 frequent content words appear within each user’s tweet
sequences. This roughly means, for a mean user in each user group, a mean discourse
keyword appears once in a sequence of around 60 tweets. Second, from Table 2, we
found that the user group of film lovers has relatively longer mean discourse spans than
the others, whereas book lovers and accounts with other interests have similar distri-
butions. We chose these interests for comparative purposes, and the accounts dealing
with other interests can be regarded as a group of average users because people with
any specific interests excluding books and films cover a much wider variety than
people who specifically like books or films. Thus, the results can be interpreted as
signifying that film lovers on Twitter behave differently to average accounts while book
lovers behave more similarly. We informally observed that the discourse keywords
among the accounts dealing with films tend to contain more film-related words than
those among the other two user groups. That is, accounts listing books as their interest
seem to mention book/reading-related words much less in comparison to accounts that
like films.
Another finding of the discourse span analysis is that outlier accounts that produced
high information amounts of DMI were all bot accounts and accounts that mainly tweet
via tweet-automation services. This is because such bot accounts tend to pack many
words into one tweet, which causes larger texts within each discoursal unit of tweets in
the calculations for Sect. 3.2.
The figures of discourse spans suggest that users in each group can be segmented
into several groups according to their DMI values. These segmentations can be related
to users’ tweeting behaviours such as tweet frequencies per day.
156 S. Yada and K. Kageura
5 Conclusion
In this study, we analysed the discourse scale of tweet sequences and examine their
characteristics, for Japanese Twitter accounts that declare their interests to be books,
films, and some other interests. We prepared 80 accounts for each of the above three
groups and gathered 3,000 tweets per user. Applying discourse scale calculation to the
top 50 content words (discourse keywords) of each account, we found that, regardless
of their interests, Twitter users seem to mention their favourite topics at intervals of
around 15 tweets long.
We plan to conduct further analyses on the discourse scale of Twitter, which
remains to be clarified in this research. We will examine the effects of individual
discourse keywords in terms of their types and topics. We are also going to investigate
the characteristic of users segmented by tweet frequencies per a certain time span.
References
1. Adams, P.H., Martell, C.H.: Topic detection and extraction in chat. In: 30th International
Conference on Software Engineering, pp. 581–588 (2008)
2. Barzlay, R., Elhadad, M.: Using lexical chains for text summarization. In: ACL Workshop
on Intelligent Scalable Text Summarisation, pp. 111–121 (1997)
3. de Beaugrande, W., Dressler, W.U.: Introduction to Text Linguistics. Longman, London
(1981)
4. Benevenuto, F., Haddadi, H., Gummadi, K.: The world of connections and information flow
in Twitter. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 42(4), 991–998 (2012)
5. Bi, B., Cho, J.: Modeling a retweet network via an adaptive Bayesian approach. In: 25th
International World Wide Web Conference, pp. 459–469 (2016)
6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(4–5),
993–1022 (2003)
7. Blei, D.M., Lafferty, J.: Topic models. In: Srivastava, A., Sahami, M. (eds.) Text Mining:
Classification, Clustering, and Applications, pp. 71–93. CRC, London (2009)
8. Bollen, J., Mao, H., Pepe, A.: Modeling public mood and emotion: Twitter sentiment and
socio-economic phenomena. In: Fifth International AAAI Conference on Weblogs and
Social Media, pp. 450–453 (2011)
9. Brown, G., Yule, G.: Discourse Analysis. Cambridge University Press, Cambridge (1983)
10. Can, E.F., Oktay, H., Manmatha, R.: Predicting retweet count using visual cues. In: 22nd
ACM International Conference on Information and Knowledge Management, pp. 1481–
1484 (2013)
11. Dascalu, M.: Analyzing Discourse and Text Complexity for Learning and Collaborating.
Springer, Heidelberg (2014)
12. Guzman, J., Poblete, B.: On-line relevant anomaly detection in Twitter stream: an efficient
bursty keyword detection model. In: 19th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, pp. 31–39 (2013)
13. Halliday, M.A.K., Hasan, R.: Language, Context and Text. Deaking University Press,
Geelong (1985)
Measuring Discourse Scale of Tweet Sequences: A Case Study 157
14. Kokubu, H., Yamazaki, H., Nosaka, M.: Japanese stopword list making for keyword
extraction suitable for semantic interpretation. Trans. Japan Soc. Kansei Eng. 12, 511–518
(2013). [in Japanese]
15. Luo, Z., Osborne, M., Tang, J., Wang, T.: Who will retweet me? Finding retweeters in
Twitter. In: 36th International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 869–872 (2013)
16. Mani, I., Bloedorn, E., Gates, B.: Using cohesion and coherence models for text
summarisation. AAAI technical report (1998)
17. Marcu, D.: The theory and practice of discourse parsing and summarization. MIT Press,
Cambridge, Mass (2000)
18. Mathioudakis, M., Koudas, N.: TwitterMonitor: trend detection over the Twitter stream. In:
2010 ACM SIGMOD International Conference on Management of Data, pp. 1155–1158
(2010)
19. Montemurro, M.A., Zanette, D.: Towards the quantification of the semantic information
encoded in written language. Adv. Complex Syst. 13(2), 135–153 (2009)
20. Montemurro, M.A., Zanette, D.: The statistics of meaning: Darwin Gibbon Moby Dick.
Significance 6(4), 165–169 (2014)
21. Montemurro, M.A.: Quantifying the information in the long-range order of words: semantic
structures and universal linguistic constraints. Cortex 55, 5–16 (2014)
22. Neubig, G., Duh, K.: How much is said in a Tweet? A multilingual, information-theoretic
perspective. In: AAAI Spring Symposium: Analyzing Microtext, pp. 32–39 (2013)
23. Paris, C., Wan, S.: Listening to the community: social media monitoring tasks for improving
government services. In: The ACM CHI Conference on Human Factors in Computing
Systems, pp. 2095–2100 (2011)
24. Paris, C., Thomas, P., Wan, S.: Differences in language and style between two social media
communities. In: 6th International AAAI Conference on Weblogs and Social Media (2012)
25. Pezzoni, F., An, J., Passarella, A., Crowcroft, J., Conti, M.: Why do I retweet it? An
information propagation model for microblogs. In: 5th International Conference on Social
Informatics, pp. 360–369 (2013)
26. Roberts, K., Roach, M.A., Johnson, J., Guthrie, J., Harabagiu, S.M.: EmpaTweet: annotating
and detecting emotions on Twitter. In: 8th International Conference on Language Resources
and Evaluation, pp. 3806–3813 (2012)
27. Sakaki, T., Toriumi, F., Matsuo, Y.: Tweet trend analysis in an emergency situation. In:
ACM the Special Workshop on Internet and Disasters, no. 3 (2011)
28. Silber, G., McCoy, K.: Efficiently computed lexical chains as an intermediate representation
for automatic text summarization. Comput. Linguist. 28(4), 487–496 (2003)
29. Toshinori, S.: Neologism dictionary based on the language resources on the web for Mecab
(2015)
30. Wan, S., Paris, C.: Understanding public emotional reactions on Twitter. In: 9th
International AAAI Conference on Weblogs and Social Media (2015)
31. Yada, S.: Development of a book recommendation system to inspire “infrequent readers”.
In: 16th International Conference on Asia-Pacific Digital Libraries, pp. 399–404 (2014)
32. Yang, Z., Guo, J., Cai, K., Tang, J., Li, J., Zhang, L., Su, Z.: Understanding retweeting
behaviors in social networks. In: 19nd ACM International Conference on Information and
Knowledge Management, pp. 1633–1636 (2010)
33. Zhao, D., Rosson, M. B.: How and why people Twitter: the role that micro-blogging plays in
informal communication at work. In: ACM International Conference on Supporting Group
Work, pp. 243–252 (2009)
Mobile Applications
Tracking Smartphone App Usage
for Time-Aware Recommendation
1 Introduction
In recent years, with the substantial growth of smartphone apps market, millions
of people use smartphones as their primary device not only for communication,
but also for accessing information regarding bus schedules, maps, events, news
or even for entertainment and other specialized apps. This pertains to people
carrying smarphones throughout their every day and using them. The interaction
of users with their smartphones can be a wealthy source of information about
users’ habits and interests and can have application in recommending apps to
users [12], retrieving a desired app and showing it in the lock screen just in time
the user needs it (similarly to what personal assistants such as Apple’s Siri do),
or even logging one’s life for summarizing one’s activities in order to increase
one’s self-awareness and self-organization. The latter example has been the focus
of several productivity apps.
c Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 161–172, 2017.
https://doi.org/10.1007/978-3-319-70232-2_14
162 S.A. Bahrainian and F. Crestani
“An empirical study conducted by the Yahoo Aviate team shows that on
average there are 96 apps installed on each mobile device. This large number of
apps installed calls for the design of new paradigms for the management of the
installed apps” [8].
Therefore, its necessary to organize apps as a form of personal information
management and accessing the right app at the right time or rearranging the
apps such that the ones that are more likely to be used would be easily accessible.
Furthermore, such data can be utilized for Just-In-Time Information
Retrieval (JITIR), e.g., predicting the next app that a user is going to use and
show it in the home screen just before you try to access it. Such prediction can
be useful because: (1) a user who has hundreds of apps installed on her smart-
phone would not need to go through the cumbersome process of finding an app
on her device, as it could be opened or placed in a notification bar at the right
time. (2) modeling a user and showing how she behaves can bring increased self-
awareness. (3) in case a user is distracted and forgets to interact with a certain
app, it can aid her memory and remind her about the event. The latter motiva-
tion originates from studies that show the effectiveness of presenting information
to a user which could serve as memory cues [5,6].
Baeza-Yates et.al. [3] discusses that “given the large number of installed apps
and the limited screen size of mobile devices, it is often tedious for users to search
for the app they want to use. Although some mobile OSs provide categorization
schemes that enhance the visibility of useful apps among those installed, the
emerging category of homescreen apps aims to take one step further by auto-
matically organizing the installed apps in a more intelligent and personalized
way”. We follow the same aim in this paper. Most of the current models for pre-
dicting app usages focus primarily on frequency and co-occurrence of patterns
in order to recommend apps that are likely to be used at a specific time. How-
ever, the decisions that people make in exhibiting certain behaviors and habits
are also influenced by complex cognitive memory functions in their minds. A
person’s memory can recall certain behaviors at different times depending on
time, mood, the surrounding environment, etc. Thus, we design a time-aware
app recommendation system based on the psychology of human memory [13]
that tracks a person’s app usage log in order to assist her with organizing apps.
In this paper we specifically would like to solve a problem which is formally
defined as: given the app usage behavior of a user from time slices t1 , t2 , . . . , tn
and the corresponding contexts, predict the app that the user will use during a
future time slice. That is:
where t shows a specific time and c represents a specific context for time t. Thus,
the aim is to find an app which maximizes the probability of it being used by a
user under a certain context and time.
This goal is based on the belief that users have repeating behaviors in using
certain apps at a specific time and under a specific context. As an example,
Fig. 1 shows the behavior of a user in using communication apps. From the
Tracking Smartphone App Usage for Time-Aware Recommendation 163
Fig. 1. The behavior of a user from the Frappe dataset in using communication apps
figure we observe that this user has significantly more communications around
noon times, afternoons and in the evenings. From this we can infer that this
user is maintaining regular office hours. Additionally, we observe that the user
is more inclined to using communications apps on Fridays and Saturdays. We
notice that just by looking at the statistical pattern of this user. We therefore
hypothesize that by looking at usage patterns of all apps that this user has
installed on her smartphone, we would be able to predict the next app that she
is going to use. Therefore, our goal is to develop a model that can predict the
app usage behavior of a user. Such models can be beneficial in assisting users
in their everyday lives by sending them relevant information and notifications.
Moreover, a person’s memory could be augmented with relevant information just
in time one needs the information. This goal motivates this paper. We present
a model, that not only can predict the app usage behavior of a user but also, if
trained with a different dataset, can predict other aspects of a user’s behavior.
The remainder of this paper is organized as follows: Sect. 2 briefly discusses
previous related work. Furthermore, Sect. 3 describes two state-of-the-art meth-
ods that we use in this paper as baselines. Section 4 presents our proposed model.
Section 5 describes our experimental setup and initial evaluation. Finally, Sect. 6
concludes this paper and presents insights into future work.
2 Related Work
Predicting the next app usage has been studied before in the literature. Do et al.
[12] proposed applying the Author Topic Model (ATM) to the problem of next
app usage prediction. They used the Nokia Challenge dataset [15] in their work.
164 S.A. Bahrainian and F. Crestani
They showed that this model was capable of modeling users’ app usage behavior
and effectively predict future app usages. They use a bag-of-apps approach with
the aim of discovering the level of phone usage over specific times of a day, using
the probabilistic ATM to represent each user as a mixture of different patterns.
We use this model as a baseline for our work. In Sect. 3 we describe the ATM
model in more detail.
Another related work in this domain is [3] which studies how to improve
homescreen apps’ usage experience through a prediction mechanism that allows
to show to users which app they are going to use in the immediate future. The
prediction technique they propose is based on a set of features representing
the real-time spatiotemporal contexts sensed by their homescreen app. They
model the prediction of the next app as a classification problem and propose a
personalized method to solve it.
Furthermore, Baltrunas et al. [8] carried out a related study in-the-wild with
users’ app usage patterns which indicate that contextual variables, such as loca-
tion and time are very important signals for modeling app usage and providing
recommendations. This is while they also report that feedback collected in their
small scale user study shows that, while users understand the value of con-
text dependent adaptation, their expectations in this regard are also very high.
They provide a set of lessons learned which outline important considerations for
designing, deploying and evaluating mobile context-aware recommender systems
in-the-wild with real users.
Another study [9] which also motivates studying user’s app usage patterns,
aims at modeling the life cycle of a user using a certain app. This life cycle
include phases such as first view, installation, direct usage and long-term usage.
Based on the user app usage behavior they then try to recommend new apps
to the users. For achieving this goal they designed a usage-centric evaluation
considering different phases of application engagement.
Furthermore, [10] carried out a large-scale deployment-based research study
that logged detailed application usage information from over 4,100 users of
Android-powered mobile devices. They also study app usage patterns based on
contextual factors such as time of the day and location. They present some inter-
esting findings based on their large-scale user study. For instance, they found out
that people are most likely to use news apps in the mornings and games in the
night. They also mention that communication applications are almost always
the first used upon a device’s waking from sleep.
Deerwester et al. [11] proposes applying the Singular Value Decomposition
(SVD) method to a very different but related problem. SVD or Latent Semantic
Indexing (LSI) have been extensively used in the literature to identify semantic
patterns in a dataset to recommend items to users [19]. SVD-based models have
been long a state-of-the-art in collaborative recommender systems. Therefore,
we use SVD as a second baseline for identifying the times of the day that a user
is more likely to use a certain app.
Use of user profiling and personalization has been explored in the past,
not only for app recommendation but also in other domains. For instance a
Tracking Smartphone App Usage for Time-Aware Recommendation 165
3 Baselines
3.1 Author Topic Model
Since it was shown that ATMs are effective in addressing the problem of next
app prediction [12], we include this model to be compared against our model as
a baseline. In the following we briefly explain the ATM model.
In other words, lets say we divide different times of a day over different days
in a week into multiple time buckets. Assume that in each of the time buckets
a user is required to use a different app. The ATM model can identify which
apps are actually triggered as a function of a specific time bucket and which
app usages are just random occurrences which could be seen as noise. Hence, it
assigns those apps that strongly correlate with a certain time and context to its
corresponding time bucket.
Figure 2 illustrates the graphical model of ATM. In the figure, x indicates
the time bucket responsible for a given app chosen from ad . Each time bucket is
associated with a distribution over topics θ, chosen from a symmetric Dirichlet
(α) prior. The mixture of weights corresponding to the chosen author are used to
select a topic z and a word (i.e. app word) is generated based on the distribution
φ drawn from a symetric Dirichlet (β). For further details about this model
and our implementation of it, we refer to [12]. The only difference between our
implementation and that of [12] is that we use variational inference as opposed
to Gibbs sampling.
Figure 3 shows the matrix decomposition of SVD. In this case k topics will
be extracted from t words presented in matrix T. Furthermore, the diagonal of
matrix I presents the rank or strength of each topic. Hence, using this simple
matrix decomposition technique the dimensionality of documents in a corpora
are reduced and topics are extracted. Additionally, matrix D presents the simi-
larity of each document to each topic. One advantage of this model is that it is
deterministic, meaning that the resulting topics are always the same in different
runs of the algorithm on the same dataset. Another advantage of this model is
that it is simple to interpret and understand.
Analogous to the ATM, we train one LSI per each user. After a model is
trained, we query it with unseen test data to compute a ranked list of other
similar app usage entries under the model. As it will be explained later, for
each ranked list we find the top 5 highest ranked app names to examine the
correctness of the result. We then compare the highest ranked apps with the
ground truth data to assess the performance of this model. This process will be
explained elaborately in Sect. 5.
4 Methodology
In this section we introduce a novel method for predicting the next app that
one is likely to use. Our method considers the passage of time in order to model
change in the behavior of a user. We want to identify those user behaviors that
have persisted over time. That is because a behavior which is both frequent and
persistent over time is more likely to be repeated in the future.
We base our model on the SVD/LSI topic model. However, by using a time
series we model the changes in behavior of a user with respect to a certain app
over time.
For this purpose we break down the data containing sequences of app usage
of one user into n partitions or time slices. Then we apply an SVD/LSI to each
partition. By doing so, we strive for modeling a user in different time slices and
then aggregate the results in a time series fashion. The rationale behind our
method is that applying a single SVD/LSI to the entire dataset will result in
a global model of the user. Such model would treat each app usage as words
in a bag-of-words model. Furthermore, the time factor is not considered at all.
Therefore, the strategy behind our model assumes that there might be some
local changes in a user’s behavior due to certain needs or state of mind. Under
our model, there is the assumption that a user does not necessarily behave the
same over different consecutive time slices.
As explained, SVD/LSI can identify the most frequent patterns of a user.
By modeling the most salient behavioral patterns of a user’s app usage behavior
over time, we can further identify those behaviors that are established over time.
In simple terms, our model assumes that there is a high chance that a user would
repeat a behavior that not only has been frequent in the entire dataset, but also
is persistent over time. We visualize our model in Fig. 4.
After n models are trained for each user for n consecutive time slices, we
query each of the n models with an unseen test data which describes a context
168 S.A. Bahrainian and F. Crestani
n
e−(n−1+λ) ∗ |P (wi,n )|
Pw,c = (2)
wi ∈v (n)
n=1
where n is the time slice sequence number, wi,n is the probability of app name
w derived from the nth time slice. The resulting constructed vector is an average
representation of probability of all app names present in all n models where
the most persistent behaviors are weighted higher. Finally, λ is the persistence
(i.e. establishment of a behavior) rate which models an exponential factor of
time. The use of an exponential factor is due to findings of psychology research
[13] which shows that forgetting is an exponential function of time. This finding
has been used by information retrieval researchers [16] which have modeled user
behavior over time. Therefore, since every time a user repeats a behavior it shows
that the user memory recalls this behavior, we multiply it by an exponential
factor of time. In this research we set λ to 1.5. As a future work, we plan to test
the effect of this parameter.
Finally, based on the final ranked list of apps computed by the above equa-
tion, we can predict the app usage behavior of users. A higher rank shows a
higher likelihood of a user using the corresponding app at the time specified in
the query.
Tracking Smartphone App Usage for Time-Aware Recommendation 169
5 Experimental Setup
5.1 Dataset Description
We use the dataset presented in [8] for our experiments. The original dataset
contains over 96,000 entries of app usage from 957 users. However, in our analysis
of the dataset we observed that for many of the users there are as many as a
few app usage entries. Therefore, we reduced the dataset to those users with at
least 200 app usage entries.
The resulting dataset contains 69,787 app usage entries belonging to 176
users. For the experiments reported in this paper we used this dataset. A full
description of the dataset is available in [8]. Out of all the attributes collected
for each app usage we are only interested in the day of the week, time of the
day, whether or not the day is a weekend, location, weather, category of app and
the app name. For training the models we use these entries, but for testing the
models we issue queries without the app categories and app names.
As explained earlier, we treat the data of each user independently from other
users. That is, for each user we train a separate model. In order to train and test
each model we need to divide the data of each user into two sets of training data
and testing data. Since we are dealing with data that are in sequence, we can
not divide the data merely on a random basis. Instead, we take a 20% sample of
the data in the original sequence and use it as test data. That means that out
of every 5 app usage entries we put aside the last entry for testing the trained
models. The remainder of the data, is then used for training the models.
The data is also pre-processed such that all the blank spaces in between app
name and categories are removed and also all letters are lower-cased. Addition-
ally, if one of the attributes always remained the same, e.g. a user always stayed
in Spain, that attribute was removed from the data as a stop word. These steps
are necessary for training the topic models in an effective way.
We present our evaluation based on the presented dataset in Sect. 5.3.
Our goal in this study is to develop models that given a specific time and situa-
tion, can predict which app a person will use in that specific time and rearrange
the order of the apps such that those that are more likely to be used by the
user at that specific time would show on the home screen of a smartphone. Since
most current smartphones have screens big enough to show 5 apps in the home
screen, we evaluate our models based on whether or not the app that a user used
at a specific point in time was shown among the 5 apps in the home screen.
Therefore, our evaluation metric computes the accuracy (i.e. correctness) of
a recommended app at a certain point in time such that:
#of CorrectlyP redictedApps
Accuraccy = (3)
#of AllExaminedApps
170 S.A. Bahrainian and F. Crestani
where #of CorrectlyP redictedApps indicate the apps that were shown in
the home screen of a smartphone and were predicted by the model, and
#of AllExaminedApps is the total number of apps in the test data. In the
next section we present the results of our experiment.
5.3 Evaluation
In this section, we present the results of our evaluation of the models presented.
We first train the models for each user and test them using the evaluation metric
described in Subsect. 5.2. For this purpose, we set the number of topics across
all models to 10 so that the models are comparable. Also, we set the number of
time slices n of our novel time-aware model to 4.
Then we compute the accuracy of the three presented models and compare
them. Table 1 shows the results of our experiment.
Table 1. Results of comparison between our time-aware model, the Author Topic
Model (ATM) and the Singular Value Decomposition (SVD/LSI) based on accuracy.
As it could be seen in the table, our model outperforms both the ATM
and the SVD/LSI models in terms of accuracy. Our intuition from observing
these results is that our model not only finds a generalized pattern based on the
frequency of usage of certain apps at certain times, but also finds a pattern which
is generalizable over time. In other words, our model removes noisy observations
which are not persistent over time.
That may be the reason why our method has shown better results as com-
pared with ATM and SVD/LSI. Our proposed models could be used to manage
and organize the apps that a user has installed on her mobile phone by rear-
ranging the apps customized to the behavior of the user at each specific time.
In a second experiment, we would like to assess the impact of the sampling
rate of the training data. As described in Sect. 5.1, we took a 20% sample of the
sequential app usage as test data. In this experiment we would like to analyze
the effect of increasing the sample rate so that we reduce the amount of training
data. In this experiment, for each 3 app usage entries we hold the last entry out
Table 2. Results of the comparison between all models with an increased sampling
rate of 33% for the test data.
as test data. Therefore, we have increased the sampling rate to slightly more
than 33% (Table 2).
As we see in the table the performance of all three models in terms of accuracy
drop. However, our model still outperforms the other two baseline models despite
the reduced amount of training data.
6 Conclusion
With the rapid proliferation of smartphones and the variety of apps developed
for them, there is a need for managing and organizing such content. In this
paper, we tackled the problem of predicting users’ app usage behavior and time-
aware organization of apps in a personalized way. We first presented two baseline
methods which could be used for solving such problem. Then, we proposed a
time-aware model for the same task. Our results demonstrate that our model is
superior to the baseline methods in terms of accuracy.
In the future we plan to extend the current work to analyzing all interactions
between users and smartphones. Additionally, we would like to assess the content
that users interact with on smartphones in greater depth for designing models
that can retrieve user information needs proactively just-in-time they need it.
The retrieved information could be shown to the user on the smartphone home
screen, or in the form of notifications and reminders.
References
1. Aliannejadi, M., Mele, I., Crestani, F.: Personalized ranking for context-aware
venue suggestion. In: SAC, pp. 960–962 (2017)
2. Aliannejadi, M., Rafailidis, D., Crestani, F.: Personalized keyword boosting for
venue suggestion based on multiple LBSNs. In: Jose, J.M., Hauff, C., Altıngovde,
I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193,
pp. 291–303. Springer, Cham (2017). doi:10.1007/978-3-319-56608-5 23
3. Baeza-Yates, R., Jiang, D., Silvestri, F., Harrison, B.: Predicting the next app that
you are going to use. In: Proceedings of the Eighth ACM International Conference
on Web Search and Data Mining, WSDM 2015, pp. 285–294 (2015)
4. Bahrainian, S.A., Bahrainian, S.M., Salarinasab, M., Dengel, A.: Implementation
of an intelligent product recommender system in an e-store. In: An, A., Lingras,
P., Petty, S., Huang, R. (eds.) AMT 2010. LNCS, vol. 6335, pp. 174–182. Springer,
Heidelberg (2010). doi:10.1007/978-3-642-15470-6 19
5. Bahrainian, S.A., Crestani, F.: Cued retrieval of personal memories of social inter-
actions. In: Proceedings of the First Workshop on Lifelogging Tools and Applica-
tions, LTA 2016, pp. 3–12 (2016)
6. Bahrainian, S.A., Crestani, F.: Towards the next generation of personal assistants:
systems that know when you forget. In: Proceedings of the 2017 ACM International
Conference on the Theory of Information Retrieval, ICTIR 2017 (2017)
7. Bahrainian, S.A., Mele, I., Crestani, F.: Modeling discrete dynamic topics. In:
Proceedings of the Symposium on Applied Computing, SAC 2017, pp. 858–865
(2017)
172 S.A. Bahrainian and F. Crestani
8. Baltrunas, L., Church, K., Karatzoglou, A., Oliver, N.: Frappe: Understanding the
usage and perception of mobile app recommendations in-the-wild. arXiv preprint
arXiv:1505.03014 (2015)
9. Böhmer, M., Ganev, L., Krüger, A.: Appfunnel: a framework for usage-centric
evaluation of recommender systems that suggest mobile applications. In: Proceed-
ings of the 2013 International Conference on Intelligent User Interfaces, IUI 2013,
pp. 267–276 (2013)
10. Böhmer, M., Hecht, B., Schöning, J., Krüger, A., Bauer, G.: Falling asleep with
angry birds, facebook and kindle: a large scale study on mobile application usage.
In: Proceedings of the 13th International Conference on Human Computer Inter-
action with Mobile Devices and Services, MobileHCI 2011, pp. 47–56 (2011)
11. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Index-
ing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
12. Do, T.-M.-T., Gatica-Perez, D.: By their apps you shall understand them: mining
large-scale patterns of mobile phone usage. In: Proceedings of the 9th International
Conference on Mobile and Ubiquitous Multimedia, MUM 2010, pp. 27:1–27:10
(2010)
13. Ebbinghaus, H.: Memory: A contribution to experimental psychology (1985).
Translated in 1913
14. Hofmann,T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd
Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 1999, pp. 50–57 (1999)
15. Laurila, J.K., Gatica-Perez, D., Aad, I., Bornet, B.J.O., Do, T.M.T., Dousse, O.,
Eberle, J., Miettinen, M.: The mobile data challenge: big data for mobile computing
research. In: Pervasive Computing (2012)
16. Li, W., Eickhoff, C., de Vries, A.P.: Probabilistic local expert retrieval. In:
Advances, Information Retrieval, pp. 227–239 (2016)
17. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for
authors and documents. In: Proceedings of the 20th Conference on Uncertainty in
Artificial Intelligence, pp. 487–494. AUAI Press (2004)
18. Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. In:
Proceedings of Uncertainty in Artificial Intelligence (2008)
19. Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative
filtering for the netflix prize, pp. 337–348 (2008)
Use of Mobile Apps for Teaching and Research –
Implications for Digital Literacy
Abstract. This paper reports on the results of an online survey about mobile
application (app) use for academic purposes, i.e. teaching and research, by
Higher Degree Research (HDR) students and academic staff at one of the eight
New Zealand universities. Two thirds of the 138 respondents reported they used
apps for academic purposes. In teaching, apps were reported to be used as a
means to push information to students. In research, apps appeared to be used to
self-organise, collaborate with colleagues, store information, and to stay current
with research. This paper presents the survey results and discusses implications
for personal information management in education context and opportunities for
university library services.
1 Introduction
Mobile learning has been claimed as the Future of Learning (Bowen and Pistilli 2012).
Mobile apps are a fundamental feature of mobile devices and can be valuable in higher
education for such activities as gathering and using information, accessing content,
promoting communication, collaboration and reflection (Bowen and Pistilli 2012;
Beddall-Hill et al. 2011). They also offer extended capacity to undertake research across
a wider range of locations than traditionally possible and enable the collection,
manipulation and sharing of data in real time (Hahn 2014). The intervention of tech-
nology has the potential to prompt new practices in research, both expanding and
constraining relationships with the research process and methodological approaches
(Goble et al. 2012). This is not necessarily a smooth path. According to Makori and
Mauti (2016), usage of digital technologies is negatively impacted on by a range of
crucial factors, including inadequate social computing facilities, insufficient information
infrastructure coupled with weak institutional and physical structures, lack of enough
information resources, and inadequate knowledge, skills and competencies. Digital
literacy is increasingly on the agenda of higher education organisations as they commit
The research literature on using mobile apps for education and research purposes is
extremely sparse and significant potential for research in this area is evident in the
related work that we are able to present here. Discussion of digital tools for research has
focused on opportunities and challenges, ranging from technical issues to complex
concerns involving implications for future research processes (Carter et al. 2015;
Davidson et al. 2016; Garcia et al. 2016; Raento et al. 2009). Several studies have been
conducted on the selection, use or development of mobile apps by or for libraries (Wong
2012; Hennig 2014; van Arnhem 2015), mainly focusing on delivery of information or
data about the library services. Mobile apps for libraries are often featured by these
authors—an example being apps for ethnographic field research (van Arnhem 2015).
One of the pitfalls of writing about apps with respect to education is the tendency to
merely describe app functionalities. The University of Chester observed the ready
adoption of mobile note-taking software by undergraduate students (Schepman et al.
2012). The previously held concern that not all students have access to a smartphone is
not supported by recent data (Anderson 2015). However, McGeeney (2015) observed a
number of logistical and technical constraints for using mobile apps, compared to Web
browsers, for surveys, including lower response rates, increased costs applied by some
survey apps vendors and more design constraints which can involve limiting options
such as navigation buttons and check boxes. Due to time and effort required to learn how
to use an app effectively, using apps resulted in lower response rates than web-based
data collection (Pew Research Center 2015). Carlos (2012) identified the advent of
mobile research tools as a useful supplement to the desktop computer. Within the
academic environment, provision of technical infrastructure is an accepted service for
both research and teaching/learning. Adopting an analogous view of mobile technology
may assist in exploring its potential. MacNeill (2015) suggests that academic staff make
use of apps for teaching and research purposes, with initial focus on keystone apps
around which to build the body of supporting apps (MacNeill 2015, p. 241).
Use of Mobile Apps for Teaching and Research 175
3 Methodology
An online survey was conducted to investigate how mobile apps were being used for
teaching, research and learning purposes across the university.
Data Collection. The data collection used an online, self-administered survey inten-
ded as a snapshot of the situation across all faculties of a single university. The
university’s research office forwarded invitations to all departmental administrators,
who distributed the survey invitation to all the university’s academics and researchers
via email. For the higher-degree students, the School of Graduate Research emailed
their student body and posted the invitation on the School’s Facebook page. The
potential sample size was about 1400 participants (including 820 students and 580
academics). Responses were anonymous and external participation was excluded
through the use of location-restriction in the Qualtrics Survey Software.
Survey Questions. The survey used a 24-item survey utilising Likert scales, radio
buttons, and free text questions; for details see (Hinze et al. 2017). The first section
comprised four demographic questions, followed by a short section on whether mobile
apps had been used, the third section focused on device and operating system used, the
following, main section, depending on role and type of academic purpose (teaching or
research), sought reflection on aspects of mobile apps use and whether such use had
influenced research or teaching practice. For those respondents who had not used, and
were not intending to use, mobile apps information was sought on the reason for this
situation.
Data Analysis. The results were analysed using a variety of reports, both default and
cross-tabulation for measuring association, within Qualtrics. A basic descriptive sta-
tistical analysis was applied to the data.
Use of Mobile Apps. Sixty-five percent of respondents (90 of 138) had used mobile
apps for academic purposes (71% of academic and 67% of student respondents); with a
composition of 73% of male and 60% of female respondents. Of those who had used
mobile apps for academic purposes, most were in the Faculty of Computing and
Mathematical Sciences, followed by the Faculty of Education. Respondents showed a
clear preference for smartphones (twice as likely as the second preference of iPad);
further options were android tablets, cellphones and wearable devices. Most were using
android devices (>60%), followed by iOS (48%); Mac, Windows and others made up
(*26%); multiple selections were possible.
Non-users. Thirty-five percent of respondents (48 of 138) had not used mobile apps
for academic purposes; half of these indicated they were not planning to do so either.
When asked what was stopping them, 23 people responded, some noting more than one
impediment. Nearly half considered their own lack of knowledge about how apps
might be used as the leading factor. Approximately one third of the responses indicated
that the responder was uninterested in apps and/or viewed them as irrelevant to their
teaching or research. Other responses included the opinion that computers offer better
options than mobile devices, with a lack of support also being stated as reason for
future non-use. The 50% of non-users who might use apps in future named a range of
potential uses, such as document sharing (64%), communication (45%), note taking
(42%), storage (36%) and access to course information and data collection (both 32%).
These respondents were also asked to rate factors in increasing app usage; the question
was answered by 21 participants (see Fig. 1).
Fig. 1. Factors to encourage apps use: from very helpful (dark) to very unhelpful (blue) (Color
figure online)
be reliable enough that researchers can be confident that they will not suffer data losses
if they use just apps”), pedagogical usefulness (“[…] we have gone into more and more
web based teaching […] use of white board and limited amount of notes uploaded will
work well, with lot of laboratory type hands-on elements. I strongly believe that if we
[lose] the “human touch” in classroom setting, it will gradually and negatively affect
the quality of the graduates we produce”), and being concerned that “one can only
move as fast as students are able […] you have built a learning task on a particular
resource and then find that half the class cannot even access it”. Some respondents
expressed reservations about institutional support and felt “it would also be great if
there was some sort of online resource on the uni website that lists and briefly explains
some of the apps that might be useful when conducting research”. Some respondents
found apps inconvenient (“I despise having to download and constantly update several
apps, plus they come with intrusive permissions”) or they felt, at the present time, apps
were “Only useful where use of a real computer is impossible”. Several participants
noted that “it is challenging to find the most appropriate app to meet a specific teaching
purpose” or “to modify existing apps to suit the purpose of the user and the context of
the user”. Some of the comments by participants reveal concerns that seem born out of
a lack of practical experience with apps (e.g., having to constantly update apps and
student not willing or able to engage with apps).
Purpose of App Usage. All of the 90 people who had used apps, responded to a
question about the purpose (multiple selections possible): 36 (40%) had used them for
teaching/supervision and 80 (89%) for research purposes. Figure 2 shows the distri-
bution of roles of the users of mobile apps.
Apps for Teaching/Supervision. The 36 respondents using apps for teaching and
supervision were asked to select which apps they used from a list. They were also
asked to indicate if the app was for their own use or if they had asked students to use
the app (see Fig. 3). There were 19 other options named, not shown in the figure:
Skype (2), Facebook (2), Feedly (1), Viber (1), Kahootz (1), Trello (1), Kindle (2), and
Google apps (9). The same respondents were asked about the specific aspects of their
teaching practice the apps were used for (see Fig. 4). Twenty-five of 36 had also asked
their students to use mobile apps (for purposes see Fig. 5).
178 A. Hinze et al.
Dropbox
Zotero App
Evernote
EndNote App
Notability
OneDrive
RefMe Own use of app
Browzine Asked students to use app
Apps for Research. Eighty of the 90 app-using respondents did so for research
purposes. They were asked what mobile apps they had used for research, with results
summarized in Fig. 6. The 40 others include Mendeley (3), ToDo (1), Keynote (1),
iBook (2), Spotify(1), Facebook (1), Skype (4), Compass (1), Trello (1), Mindmeister
(1), NoteIt (1), and Google apps (17). The research purposes are summarized in Fig. 7.
Dropbox
Evernote
OneDrive
Zotero App
EndNote App
Notability
RefMe
Browzine
Other
0 10 20 30 40 50 60
Storage
Sharing documents with others
Searching for information
Note taking
Research planning
Data collection
Referencing
Communication
Presentation of information
Data analysis
Publishing
Other
0 10 20 30 40 50
Impact of Apps on Academic Experience. The users of apps for academic purposes
rated the impact of the app usage, see Fig. 8. Nearly 80% felt their academic activity
had benefitted from mobile apps. Half the users believed their academic activity had
been conducted differently as a consequence of using apps. Eighteen percent had
experienced difficulties.
Additional Factors. Thirty-eight responses were received covering instructional
support, (in)convenience, technical aspects, pedagogical and contextual viewpoints.
Several respondents were neutral regarding the inclusion of mobiles apps into their
180 A. Hinze et al.
I know where to go to get help with apps 21% 26% 20% 24% 9%
Fig. 8. Impact of app use: from strongly agree (dark) to strongly disagree (blue) (Color figure
online)
academic practice (“I just used the camera. No big deal”). Five respondents mentioned
the need or benefit of training (“Would be great to get some training on this ”, or “It
would be great if there was some sort of online resource on the uni website that lists
and briefly explains some of the apps that might be useful when conducting research”).
These respondents indicated that their ability to place context or pedagogical potential
around the use of apps was dependent upon their understanding of the app function-
ality, for example, “I can see that the use of apps will increase in line with predictions
of increased usage of web-connected devices. The challenge will be to develop apps or
modify existing apps to suit the purpose of the user and the context of the user”. Four
respondents wished for an app to gain access to Library resources. Some respondents
were very positive about the potential of apps in the academic environment (“We are
moving into the new generation of Apps is the tool to connect with the students. Let’s
not hesitate. We need to be engaging successfully to create a sense of new age”).
The main findings indicate limited use of mobile apps with a stated preference for
organisational provision of information training and support to enable greater
engagement. This has implications for support areas of the university, including library
service planning and delivery, and their involvement in academic information beha-
viour and information management including the use of mobile apps.
core of mobile app activity is occurring across the university which may be built upon
and which would benefit from a platform of co-ordinated support. Further research in
this area is required to strengthen the recommendations possible from the snapshot
results. Here we list the main findings:
Apps for Research. Where mobile apps were used, most participants had used apps
for research, with the majority being post-graduate students. The main purposes were
storage, document sharing, searching, referencing and note taking. While nearly 30%
had used mobile apps for data collection, only eight percent had moved beyond this to
analyze their data in this manner.
From this study the reasons for this lack of use of apps during the research planning
and research analysis phases is unclear. However, comparison between the results of
app user and non-user respondents reveals both groups demonstrated preference for
apps enabling document sharing, communicating and note taking. It is interesting to
note this mirroring of preference for app functionality. Additionally, neither app users
nor non-users expressed strong preference for data analysis, referencing, or presenta-
tion apps. This co-incidence of preference may be a reflection of the identified lack of
support and training available across the university campus.
Apps for Supervision/Teaching. For teaching/supervision purposes, a clear prefer-
ence was on apps for communication or document and data sharing with colleagues and
for storage. Some of the apps were used for both teaching and research purposes.
Academic staff used apps for teaching/supervision (26%) to almost the same degree as
for their research activities (30%). Teachers/supervisors asked their students to use apps
mainly for the purposes of communicating and sharing information. Apps for planning
were barely used nor were apps for research tasks such as reviewing literature, data
collection or analysis. Responses indicate that use of apps in both teaching and research
practices focused upon the purposes of sharing documents, storage and communication
with colleagues. It is, therefore, unsurprising that teachers/supervisors requested their
students to engage in app usage for similar purposes, rather than venturing into areas of
app use with which they, themselves, were unfamiliar. This indicates that students
collecting field data for course work were expected to do so using traditional tools and
techniques.
More Support Requested by Non-users and Users. Among those not considering
apps, lack of knowledge was the primary stumbling block followed by a lack of
interest. They also challenged the university to determine the most useful apps and how
best to use them effectively. Potential users were nearly all interested in having more
appropriate or easier to use apps available, indicating that this group of respondents has
attempted to access or use apps in the past but had been discouraged. Potential app
users also wanted more practical support for finding and using apps. It appears that
non-users could move to mobile app use if they had access to information and support
on technical specifications and purpose or application. It remains the need to convince
of the overall usefulness of mobile apps “to suit the [academic] purpose and the context
of the user”.
Those respondents using apps for academic purposes had a positive attitude –
nearly 80% perceived a benefit from app use. The majority did encounter difficulties,
182 A. Hinze et al.
however, less than half the users knew where to go to get sufficient help. Only half had
found the experience of locating a suitable app for their teaching or research to be
problem-free. One participant observed that “many of the apps I now use would have
been extremely useful had I known about them when I began this degree.”
Impact Needs Further Study. Fifty percent perceived a change in research conduct
and almost as many felt their teaching was impacted. This is an area that would benefit
from further study to gather empirical evidence on the application of technology to
traditional pedagogies or research methodologies and processes.
5.2 Implications
This study provides a small snapshot of the current state of mobile app use across a
university. The following implications arise from this study and are offered for
consideration:
• The data indicates that academic staff and students involved in using mobile apps
are personally driven and motivated rather than supported by clearly-planned,
identified and integrated infrastructure across the institution.
• While some aspects of using apps for communication were reported, the majority of
usages was related to management of documents, text, and data. This indicates an
opportunity to frame and explore academic app use as an issue of personal infor-
mation management. It may also indicate a need to explore scholarly workflows and
which role apps could play when their use was embraced and supported by the
academic institution.
• Introduction to the possibilities and limitations of mobile apps for non-users pro-
vided by the institution may serve to increase the uptake of tools during teaching
and research.
• There are implications for the way in which support areas, such as libraries, are
keeping abreast of initiatives and developing trends across the institution. To ensure
teaching and learning is occurring effectively, identified information management
and digital literacy support needs to be interwoven from the earliest stages of
planning.
• It is institutional strategy to invest in innovative applications of digital technology
in research and teaching. The use of apps for academic endeavour is currently
underutilised. A coordinated approach is needed to enable digital technology
acceptance to transform digital innovation in education.
6 Conclusion
Some indicators were drawn from the survey as outlined above and they serve a useful
purpose of guiding future work in this area. Mobile apps are being used by teachers and
researchers to a limited degree, both in staff numbers and in range of mobile apps and
there is a clearly-identified need for a strong platform of support for staff and students.
It appears that non-users would consider using mobile apps if there were suitable apps
Use of Mobile Apps for Teaching and Research 183
available and if training or support was offered. Similarly, app users expressed that they
would welcome more information and guidance. We propose that libraries, particularly
academic libraries, are in a position to address this particular problem. Today, libraries
and librarians are uniquely placed to provide patrons with the means to acquire,
manage and develop relationships with information in digital form, such as via mobile
apps. Investigation into best-practices around the provision of this next generation of
support is required. Mobile apps were more likely to be used for research than teaching
purposes, but for both practices the ability to communicate, collaborate and share with
others were primary motivators for use. Users were able to perceive the benefit of
including mobile apps in their teaching or research practice but were uncertain as to the
impact of the apps upon the conduct or outcomes of their practice.
The present snapshot indicates a tertiary education environment experimenting with
technology within teaching and research practices. The use of mobile apps is an
essential component of digital literacy and has huge potential for changing teaching and
research practice. The response of our participants indicate that both individual and
shared workflows in the field, the classroom, and the office may be enhanced by these
mobile apps should appropriate digital literacy programmes be present to enable
effective use within teaching and research. However, the survey highlights that
addressing the needs of users and potential users of mobile apps for academic purposes
is an area yet to be fully explored. A larger study of academic use of mobile apps is
currently underway, with additional universities to be invited in future.
References
Anderson, M.: Technology device ownership: 2015. Report by Pew Research Center (2015).
http://www.pewinternet.org/2015/10/29/technology-device-ownership-2015/
Beddall-Hill, N.L., Jabbar, A., Al Shehri, S.: Social mobile devices as tools for qualitative
research in education: iPhones and iPads in ethnography, interviewing, and design-based
research. J. Res. Cent. Educ. Technol. 7(1), 67–90 (2011)
Bowen, K., Pistilli, M.D.: Student preferences for mobile app usage. Educause Research Bulletin
(2012)
Carlos, A.: Research on the go: Mobile tools for conducting research. Ref. Librarian 53(4), 433–
440 (2012)
Carter, A., Liddle, J., Hall, W., Chenery, H.: Mobile phones in research and treatment: ethical
guidelines and future directions. JMIR Mhealth UHealth 3(4), e95 (2015)
Davidson, J., Paulus, T., Jackson, K.: Speculating on the future of digital tools for qualitative
research. Qual. Inq. 22(7), 606–610 (2016)
Garcia, B., Welford, J., Smith, B.: Using a smartphone app in qualitative research: the good, the
bad and the ugly. Qual. Res. 16(5), 508–525 (2016)
Goble, E., Austin, W., Larsen, D., Kreitzer, L., Brintnell, E.: Habits of mind and the split-mind
effect: when computer-assisted qualitative data analysis software is used in phenomenological
research. Forum: Qual. Soc. Res. 13(2) (2012)
Hahn, J.: Undergraduate research support with optical character recognition apps. Ref. Serv. Rev.
42(2), 336–350 (2014)
Hennig, N.: Apps for librarians: using the best mobile technology to educate, create and engage.
Libraries Unlimited (2014)
184 A. Hinze et al.
Hinze, A., Vanderschantz, N., Timpany, C., Cunningham, S.J., Saravani, S.-J., Wilkinson, C.:
Use of mobile Apps for Teaching and Research, Working paper 01/2017, University of
Waikato (2017)
MacNeill, F.: Approaching apps for learning, teaching and research. In: Middleton, A. (ed.)
Smart Learning: Teaching and Learning with Smartphones and Tablets in Post Compulsory
Education, pp. 238–264 (2015)
Makori, E.O., Mauti, N.O.: Digital technology acceptance in transformation of university
libraries and higher education institutions in Kenya. Library Philos. Pract. 0_1, 1–20 (2016)
McGeeney, K.: What we learned about surveying with mobile apps. Report by Pew Research
Center, April 2015. http://www.pewresearch.org/fact-tank/2015/04/02/what-we-learned-
about-surveying-with-mobile-apps/
Pew Research Center: App vs. web for surveys of smartphone users: experimenting with mobile
apps for signal contingent experience sampling method surveys (2015). http://www.
pewresearch.org/2015/04/01/app-vs-web-for-surveys-of-smartphone-users/
Raento, M., Oulasvirta, A., Eagle, N.: Smartphones: an emerging tool for social scientists. Sociol.
Meth. Res. 37(3), 426–454 (2009)
Schepman, A., Rodway, P., Beattie, C., Lambert, J.: An observational study of undergraduate
students’ adoption of (mobile) note-taking software. Comput. Hum. Behav. 28, 308–317
(2012)
van Arnhem, J.-P.: Apps and gear for ethnographic field research. Charleston Advisor 17(2), 58–
64 (2015)
Wong, S.H.R.: Which platform do our users prefer: website of mobile app? Ref. Serv. Rev. 40
(1), 103–115 (2012)
Motivational Difference Across Gameplay
Mechanics: An Investigation in Crowdsourcing
Mobile Content
Abstract. The convergence of crowdsoucing and gaming has led to the rise of
new game genres that leverage the collective intelligence of online players.
These are called crowdsourcing games, and they have become a viable option
for garnering georeferenced metadata for digital library projects. Understanding
the phenomenon of these games requires consideration of gameplay mechanics
and their effects on players’ motivations. Given the scarcity of research in this
area, this study investigates how gameplay mechanics—collaboration and
competition—influence motivations for playing and sharing mobile content. We
conducted a between-subjects experiment using a non-game app and two
virtual-pet-themed games with the collaborative and competitive mechanics
respectively. Results indicate that crowdsourcing games lead to a higher level of
enjoyment, immersion, and socializing. Moreover, the collaborative and com-
petitive games were found to differ with respect to achievement, relaxation, task
efficiency, and skills development.
1 Introduction
With the prevalence of mobile devices providing location awareness and Internet
connectivity, people can now perform crowdsourced tasks anytime and anywhere, and
accordingly, crowdsourcing in the mobile or location-based context is likely to
accelerate. Additionally, being able to carry mobile phones gives people the oppor-
tunity to perform crowdsourced tasks that require physical presence. For instance,
people can instantly document their experience at/about their current locations. Such
documentation serves not only as a personal record but also as real-time information
and history about the associated locations that other people can retrieve and learn [17].
Although promising, a more engaging experience would attract users and encourage
sustained usage.
The game-based approach to crowdsourcing could be a viable alternative for two
main reasons. First, games are no longer an entertainment medium solely for young
males and have finally achieved a critical mass of players. This is exemplified in recent
statistics published by the Entertainment Software Association that suggest that the
average game player is 35 years old and is 41% likely to be female [7]. Second, it is
believed that tasks that require human perceptual capability and creativity can better be
accomplished by collaboration between humans and computer, rather than performed
alone by either party. Accordingly, games are used in the crowdsourcing context in
which players perform a given crowdsourced task through enjoyable gameplay, known
as crowdsourcing games, human computation games or games with a purpose [6, 9, 21].
These games are used in various application areas to harness human intelligence,
including the creation of metadata for online images and videos, sentiments of given
statements, shapes of proteins, annotations of real-world locations, and many more (e.g.,
[1, 3, 10, 17]). Hence, beyond the traditional approach of recruiting users, crowd-
sourcing enables the digital libraries community to reach out to potential participants to
tackle library-related problems on a larger scale.
One important feature of games is their gameplay mechanics, which refers to the set
of rules driving how players behave in a game [20]. Collaboration and competition are
two commonly used gameplay mechanics [24], and the specific behavior induced by
such mechanics may affect players’ motivations. Hence, understanding players’
motivations is an important first step toward developing games that cater to the needs
of potential players. Furthermore, developing crowdsourcing games that attract users
and sustain their usage imposes another challenge because such games serve dual
purposes—generating output and entertaining [9]. More specifically, players’ motiva-
tions may arise from their desire to fulfill either or both purposes.
This study presents the argument that research on gameplay mechanics and players’
motivations is important for two primary reasons: (1) the empirical evidence on the
effect of such mechanics is limited. It is necessary to investigate whether incorporating
games into crowdsourced tasks can better motivate players, and if so, how different
gameplay mechanics are in influencing motivations derived from the dual purposes of
crowdsourcing games. Moreover, (2) findings from prior research on games for pure
entertainment (e.g., [18, 23]) may not be readily applicable to crowdsourcing games
because of the contextual difference. These games are unique in that they blend gaming
with content creation; hence, this dual purpose might interplay in influencing players’
motivations.
Motivational Difference Across Gameplay Mechanics 187
2 Background
segment by placing markers for their chosen categories, such as food, café, and so on.
The player who creates the most categories wins the game.
The effects of collaboration and competition have been examined in several con-
texts including learning and entertainment (e.g., [20, 22]). Collaboration is believed to
promote positive behaviors, which in turn influence enjoyment and performance in the
task performed, whereas competition is said to promote negative behaviors and out-
comes [24]. In entertainment-oriented games with competition as their central element,
players were found to be motivated primarily by achievement [8]. In contrast, col-
laborative situations may promote mutual affiliation and social interaction among
players [23]. Hence, depending on the gameplay mechanics used, crowdsourcing
games may differ in affording motivations. Understanding the effects of gameplay is
important because its misuse may hinder players’ motivation to participate, thereby
diminishing performance and content quality. This study therefore investigates the
potential differences in motivations afforded by collaborative and competitive game-
play mechanics in crowdsourcing mobile content.
2.2 Motivations
Motivation refers to a psychological state that directs individuals’ actions toward a
desired goal [10]. Prior research (e.g., [16, 18]) has regarded several basic human needs
as motivators for playing games. These include the need for autonomy, competence,
relatedness, achievement, power, and affiliation. As these needs are innate to human
beings, it can be argued that games that fulfill psychological needs are enjoyable for
players. Based on a study of players of a massive multiplayer online game, [23]
identified three categories of motivations: achievement, socializing, and immersion.
In the context of content sharing, motivation pertains to the desire or willingness of
an individual to contribute content while he/she is on the move [9]. Several motivations
for sharing content were reported in prior studies, including entertainment, relationship
maintenance, information discovery, relaxation, socializing, task performance, com-
petition, and self-presentation (e.g., [4, 8, 10, 14]). Furthermore, altruism, self-efficacy,
sense of community, and causal importance were identified as significant motivators for
participation in crowdsourcing [11].
Taking an integrative perspective, this study considers motives relevant to both
playing and sharing content in games for crowdsourcing mobile content. Research in
this area is important for the design of game-based apps that better motivate desired
behaviors. Therefore, a key contribution of this study is to shed light on the relation-
ships between two gameplay mechanics—collaboration and competition—and moti-
vation for playing and sharing in crowdsourcing mobile content.
3 Method
similar purpose of collecting location-based data, which are known as comments in our
apps. Each comment comprises title, tags, descriptions, media elements, and ratings.
Using custom-deveoped apps enables us to have a better control over the look and feel
of the interfaces and the accessibility of the collected crowdsourced data.
Our apps are designed for people to contribute content about real-world locations
while being entertained by the gameplay. Each app is built around a core design that
uses Google Maps and the GPS functionality on an Android phone. As people move
around, they can browse the map, which is overlaid with mushroom houses, indicating
places in their vicinity (Fig. 1). People can tap on houses to see several units, each of
which holds comments that have been created. In the game-based apps, players create
comments by means of feeding the virtual pet (i.e., in-game character), which lives in
each unit, whereas users submit comments in the non-game app.
People can rate each other’s comments in the apps. This rating feature of our apps
serves as a quality control mechanism, where highly rated comments are socially
acceptable. Further, our apps use a visualization technique to facilitate the quality
judgment of crowdsourced data. The appearances of the pet or houses are dependent on
four attributes of data—quantity, quality, sentiment, and recency. Specifically, the sizes
of pets and houses are determined by the amount of content while their color is
dependent on ratings. The mood of pets and weather around the houses depend on the
number of positive or negative comments, and the age of the pets is determined by the
creation dates of comments. Comments and ratings generated by people are updated in
real time to reflect participants’ ongoing activities in the apps.
Share serves as a control, and it uses all the functionality described above except
that it does not have virtual pets. Moreover, participants are not awarded with any game
points or rewards for their activities. Instead, they can view statistics, such as the
number of comments and ratings. Share, therefore, serves as a representative app for
crowdsourcing mobile content through which to compare the perceived motivation of
game-based apps. Figure 2 shows a list of comments created in Share.
Collabo uses all the functionality described above, but unlike Share, it also
incorporates collaborative mechanics and a points-based system. This game asks
players to search for starving pets in their vicinity and team up with other players to
rescue these pets. The starving pets appear sad and have a darker tone compared to
healthier pets (Fig. 3). The pets become starved if their strength is lower than 50. To
save the pets, players need to feed them with comments or rate those comments created
by others on a five-star scale. Every new comment created increases the pet’s strength
by five, and the rating value (i.e., 1 to 5) is directly added to the strength. Bonus points
are awarded every time a new member joins the team. Once a pet is rescued, the game
allocates an equal amount of points to the team members, and at the same time, a
winning message is displayed (Fig. 4).
In Clash, players compete with others for pet ownership. The current pet owner’s
name is shown underneath the pet (Fig. 5). During gameplay, players build up their
strength by creating and rating comments. They can then challenge the pet owner to a
duel (Fig. 6). He/she will win if the total score of his/her strength and daily luck (i.e., a
random number generated at the first login of each day) is greater than that of the
challenged player. The game also considers the rating value and recency of comments
in calculating players’ strengths, so that the pet is winnable by new players. The game
190 E.P.P. Pe-Than et al.
allows owners to retain the ownership status securely for a 15-minute period. This
feature was included based on the results of the pilot testing in which players felt
frustrated about losing the pet immediately after a win.
Fig. 1. Places on the map Fig. 2. A list of comment in Fig. 3. A list of pets residing
overlaid with houses. Share. in a location.
3.2 Participants
Seventy-three participants (42 males and 31 females) were recruited from two local
universities. Their ages ranged from 21 to 35 years. Among our participants, 19.18% of
them had a background in computer science or information technology, 49.32% were
from engineering disciplines, and the remainder were from disciplines such as arts,
humanities, and social sciences, education, and business. Further, 63.01% have used
social network apps, and 45.21% have used the location check-in feature of such apps.
The majority of participants (86.30%) used mobile phones to share multimedia
information. Additionally, slightly above half of the participants (54.79%) had expe-
rience with online games, and 17.81% reported to be regular players.
3.4 Measures
The dependent variables of this study were motivations for playing and sharing content,
and the independent variable was the app type. All question items were drawn from
previous studies (e.g., [9, 14, 16]) and adapted to suit the study’s context, and they
were all measured on a 5-point Likert scale ranging from 1 (strongly disagree) to 5
(strongly agree). A total of 15 question items was used to measure motivations for
playing and they are: achievement–individuals’ desire to win and make progress;
192 E.P.P. Pe-Than et al.
4 Results
Table 1 shows the means and standard deviations of the study’s dependent variables.
One-way analysis of variance (ANOVA) was performed on these variables. The results
indicated that there were significant differences with respect to the following con-
structs: achievement [F(2,70) = 13.67, p < 0.001], socializing [F(2,70) = 7.55,
p < 0.001], relaxation [F(2,70) = 3.91, p < 0.05], enjoyment [F(2,70) = 5.13,
p < 0.01], immersion [F(2,70) = 6.73, p < 0.01], task efficiency [F(2,70) = 4.26,
p < 0.05], competitive play [F(2,70) = 7.40, p < 0.001], and improving skills [F
(2,70) = 5.36, p < 0.001]. There were, however, no statistically significant differences
among the three apps for altruism [F(2,70) = 0.002, p = 0.99] and social influence [F
(2,70) = 0.21, p = 0.81].
Table 1. Means and standard deviations for participants’ motivations for playing and sharing.
Constructs Collabo Clash Share
(N = 22) (N = 27) (N = 24)
M SD M SD M SD
Achievement** 2.71 0.87 3.74 0.74 2.72 0.83
Socializing* 3.57 0.95 3.48 0.97 2.58 1.01
Relaxation** 3.07 1.01 3.69 0.77 3.07 0.98
Enjoyment* 3.43 1.18 3.35 0.94 2.57 0.98
Immersion* 3.67 1.10 3.46 0.67 2.72 1.01
Altruism 3.74 0.86 3.75 0.74 3.75 0.62
**
Task efficiency 4.10 0.71 3.56 0.81 3.57 0.66
Competitive play** 2.90 1.08 3.60 0.65 2.74 0.83
Improving skills** 3.01 0.94 3.74 0.81 2.93 1.16
Social influence 3.25 1.22 3.35 1.01 3.44 0.88
Note. ** Statistically significant difference between games,
*
Statistically significant difference between games and the
non-game app.
Motivational Difference Across Gameplay Mechanics 193
Post-hoc comparisons using Tukey’s test were then conducted, which uncovered
the following results.
• Achievement. Participants who played Clash (M = 3.74) were more satisfied with
their achievement in the game, compared to those who used either Collabo
(M = 2.71) or Share (M = 2.71).
• Socializing. Participants reported that they were better able to socialize with others
when playing Collabo (M = 3.57) and Clash (M = 3.48) than the non-game app,
Share (M = 2.58). No significant difference was found between games.
• Relaxation. In contrast to socializing, participants stated that they were more like to
play Clash (M = 3.69) for a relaxation purpose than Collabo and Share.
• Enjoyment. Participants who played Collabo (M = 3.43) and Clash (M = 3.35)
reported to experience higher level of enjoyment than those used Share (M = 2.57).
• Immersion. Similar to enjoyment, participants were more likely to get immersed in
Collabo (M = 3.67) and Clash (M = 3.45) than in Share (M = 2.72).
• Task efficiency. With regards to information seeking and retrieval, Collabo
(M = 4.10) was more likely to be used by participants than Clash and Share.
• Competitive play. As expected, Clash (M = 3.60) outperformed Collabo and
Share regarding fostering a sense of competition among players.
• Improving skills. Again, participants reported that they were more likely to use
Clash (M = 3.74) to improve their ability in commenting about locations.
• Social Influence. There were no statistically significant differences in perceived
social influence across three apps—Collabo (M = 3.25), Clash (M = 3.35), and
Share (M = 3.44).
5 Discussion
Our results suggest that collaborative and competitive gameplay mechanics differ in
affording motivations with respect to achievement, socializing, relaxation, enjoyment,
and immersion. In particular, the competitive game, Clash, was better able to foster a
sense of accomplishment or achievement among players. This finding may imply that
players appreciate more of a reward obtained for being able to edge over other players
because such rewards are unique to them. In Clash, participants had to try on one’s
own to outperform the current pet owners to win the game, whereas in Collabo, the
rewards were equally distributed across players, perhaps diluting the sense of
achievement. Next, participants of both Collabo and Clash were more motivated to
socialize with others than those used the non-game app. Games for crowdsourcing
mobile content can use either collaborative or competitive mechanics to garner
meaningful outputs as a by-product of interaction among players.
Participants also reported that they were more likely to use Clash as a means to
escape from real life stress compared to Collabo and Share. This is interesting because
competition is a win and lose dichotomy, and that it can create tensions and stressful
situations [22]. Perhaps as competition demands attention and involvement, partici-
pants may have experienced a greater sense of satisfaction when they were closer to
win Clash or after a win, thereby leading to a more relaxation experience. Finally,
194 E.P.P. Pe-Than et al.
6 Conclusion
This study examined the motivational differences between gameplay mechanics in the
context of crowdsourcing mobile content. Findings of this study also provide several
implications for research and practice. First, this study adds evidence to the differential
effects of collaboration and competition on motivations in the crowdsourcing context.
Depending on gameplay mechanics, games were found to perform differently in
motivating with respect to playing and sharing content. This finding informs
researchers of the necessary to investigate potential factors that influence motivations
as well as the interplay between these factors. Second, while this study was conducted
in games for crowdsourcing mobile content, our results may be generalizable to other
contexts that use collaboration and competition to drive engagement and participation.
For instance, to motivate people to engage in a task (e.g., learning, physical activity),
they can be given a goal to compete or collaborate with others. Based on our findings,
people in the competitive setting may try to improve their skills to achieve the goal. In
contrast, people in the collaborative setting may be more likely to accomplish the task
through interaction with others. Therefore, designers need to be aware of the tradeoff
between the choice of collaboration and competition, and consider how to best utilize
them.
Although our work yields important findings, it is not without limitations that offer
opportunities for future research. First, this study relied on two commonly-used
gameplay mechanics. Other alternative classifications of games exist, which include
Motivational Difference Across Gameplay Mechanics 195
game genres, such as adventure and simulation as well as hybrid games that use a
combination of collaborative and competitive mechanics [16]. Therefore, future
research may investigate the differential effects of a larger set of gameplay mechanics
on motivation. Second, given the motivational differences between games, this study
calls for future research to examine whether different mechanics attract different content
types. Third, the characteristics of the sample pose further limitations. Participants in
this study were primarily undergraduate and graduate students from two local uni-
versities. A more diverse sample would validate the study’s findings. Finally, this study
was conducted in a single domain—mobile content creation. Different tasks may
demand varying levels of cognitive abilities; hence, they may yield different percep-
tions. Further, studies of other domains are needed to verify the generalizability of our
findings. Nevertheless, our findings provide deep insight on how collaboration and
competition affect players’ motivations in games for crowdsourcing mobile content.
Given the growing popularity of apps that blend games with content creation, this
study’s findings augur well for digital libraries that wish to use game-based crowd-
sourcing to tackle library-related problems.
References
1. Casey, S., Kirman, B., Rowland, D.: The gopher game: a social, mobile, locative game with
user generated content and peer review. In: Proceedings of the International Conference on
Advances in Computer Entertainment Technology, ACE 2007, Salzburg, pp. 9–16. ACM
(2007)
2. Chen, Y., Ghosh, A., Kearns, M., Roughgarden, T., Vaughan, J.W.: Mathematical
foundations for social computing. Commun. ACM 59, 102–108 (2016)
3. Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., Leaver-Fay, A., Baker,
D., Popović, Z.: Predicting protein structures with a multiplayer online game. Nature
466(7307), 756–760 (2010)
4. Cramer, H., Rost, M., Holmquist, L.E.: Performing a check-in: emerging practices, norms
and ‘conflicts’ in location-sharing using foursquare. In: Proceedings of the 13th International
Conference on Human Computer Interaction with Mobile Devices and Services, Mobilehci
2011, Stockholm, pp. 57–66. ACM (2011)
5. Djaouti, D., Alvarez, J., Jessel, J.P., Methel, G., Molinier, P.: A gameplay definition through
videogame classification. Int. J. Comput. Games Technol. 2008 (2008). Article No. 4
6. Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the World-Wide
Web. Commun. ACM 54, 86–96 (2011)
7. ESA: Essential facts about the computer and video game industry (2016). Retrieved
5 February 2017. http://www.theesa.com
8. Espinoza, F., Persson, P., Sandin, A., Nyström, H., Cacciatore, E., Bylund, M.: GeoNotes:
social and navigational aspects of location-based information systems. In: Abowd, G.D.,
Brumitt, B., Shafer, S. (eds.) UbiComp 2001. LNCS, vol. 2201, pp. 2–17. Springer,
Heidelberg (2001). doi:10.1007/3-540-45427-6_2
9. Goh, D.H.L., Ang, R.P., Lee, C.S., Chua, A.Y.: Fight or Unite: investigating game genres
for image tagging. J. Am. Soc. Inf. Sci. 62, 1311–1324 (2011)
196 E.P.P. Pe-Than et al.
10. Goh, D.H.-L., Lee, C.S., Low, G.: “I played games as there was nothing else to do”
Understanding motivations for using mobile content sharing games. Online Inf. Rev.
36, 784–806 (2012)
11. Goncalves, J., Hosio, S., Rogstadius, J., Karapanos, E., Kostakos, V.: Motivating
participation and improving quality of contribution in ubiquitous crowdsourcing. Comput.
Netw. 90, 34–48 (2015)
12. Ho, C.J., Chang, T.H., Lee, J.C., Hsu, J.Y.J., Chen, K.T.: KissKissBan: a competitive human
computation game for image annotation. In: Proceedings of the ACM SIGKDD Workshop
on Human Computation, HCOMP 2009, Paris, pp. 11–14. ACM (2009)
13. Holley, R.: Crowdsourcing: How and why should libraries do it? D-Lib Mag. 16(3/4 Ma)
(2010)
14. Lee, C.S., Goh, D.H.L., Chua, A.Y., Ang, R.P.: Indagator: investigating perceived
gratifications of an application that blends mobile content sharing with gameplay. J. Am.
Soc. Inf. Sci. 61, 1244–1257 (2010)
15. Matyas, S., Matyas, C., Schlieder, C., Kiefer, P., Mitarai, H., Kamata, M.: Designing
location-based mobile games with a purpose: collecting geospatial data with CityExplorer.
In: Proceedings of the 2008 International Conference on Advances in Computer
Entertainment Technology, ACE 2008, Yokohama, pp. 244–247. ACM (2008)
16. Pe-Than, E.P.P., Goh, D.H.L., Lee, C.S.: Making work fun: investigating antecedents of
perceived enjoyment in human computation games for information sharing. Comput. Hum.
Behav. 39, 88–99 (2014)
17. Procyk, J., Neustaedter, C.: GEMS: the design and evaluation of a location-based
storytelling game. In: Proceedings of the 17th ACM Conference on Computer Supported
Cooperative Work and Social Computing, CSCW 2014, Baltimore, pp. 1156–1166. ACM
(2014)
18. Przybylski, A.K., Rigby, C.S., Ryan, R.M.: A motivational model of video game
engagement. Rev. Gen. Psychol. 14, 154 (2010)
19. Quinn, A. J., Bederson, B.B.: Human computation: a survey and taxonomy of a growing
field. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
CHI 2011, pp. 1403–1412. ACM, Vancouver, May 2011
20. Siu, K., Zook, A., Riedl, M.O.: Collaboration versus competition: design and evaluation of
mechanics for games with a purpose. In: Proceedings of the 9th International Conference on
the Foundations of Digital Games, FDG 2014 (2014)
21. von Ahn, L., Dabbish, L.: Designing games with a purpose. Commun. ACM 51, 58–67
(2008)
22. Waddell, J.C., Peng, W.: Does it matter with whom you slay? The effects of competition,
cooperation and relationship type among video game players. Comput. Hum. Behav.
38, 331–338 (2014)
23. Yee, N.: Motivations for play in online games. Cyberpsychology Behav. 9, 772–775 (2006)
24. Zagal, J.P., Rick, J., Hsi, I.: Collaborative games: lessons learned from board games. Simul.
Gaming 37, 24–40 (2006)
Search Results Presentation
and Visualization
Interactive Displays for the Next Generation
of Entity-Centric Bibliographic Models
1 Introduction
Libraries worldwide are in the process of adopting the next generation of biblio-
graphic information models to meet the expectations of modern end users, sup-
port new ways of search and exploration as well as increase the long-term value of
the data. The E-R model of bibliographic entities defined in the IFLA Functional
Requirements for Bibliographic Records (FRBR) [1] – soon to be superseded by
IFLA Library Reference Model (IFLA LRM) [22] – represents a major transition
from the record-oriented digital card catalog to entity-centric catalogs with rich
and semantically well-defined structures of entities and relationships. The core
entities introduced in FRBR; work, expression, manifestation and item, have
slowly made their way into the common understanding of the bibliographic uni-
verse and are now aligned with current cataloguing practice (Resource Descrip-
tion and Access – RDA) [12]. Additional interesting new developments include
BIBFRAME [16], a project exploring new formats for bibliographic data, and
FRBRoo [18], which is the result of the harmonisation of FRBR with CIDOC
CRM [17]. However, the modernization of library catalogs worldwide has been
surprisingly slow and the question of how to best display FRBR or other entity-
centric data in search results remains a challenge [4,14].
Entity-centric bibliographic data, describing intellectual and artistic products
as entities at different levels of abstraction, inherently complicates the process
c Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 199–211, 2017.
https://doi.org/10.1007/978-3-319-70232-2_17
200 T. Aalberg et al.
2 Background
Improved search experience was early recognized as the key contribution of
FRBR [2] and this has been the main motivation for research and experimental
prototypes applying the model [3,6,8,11,20]. Unfortunately most systems devel-
oped so far are based on existing data, automatically transformed from MARC
records into a FRBR-based representation [15]. Due to missing and inconsis-
tent information, frbrization is incomplete, resulting in simple pragmatic sys-
tems, focusing only on works and manifestations (such as OCLC Fiction Finder
or data.bnf.fr). Even locating works throughout multiple records in current cat-
alogs is a major challenge as addressed by Carlyle [7]. The actual effect of the
FRBR model on the user experience, or the fundamental design issues that need
to be addressed are thus hard to study in these implementations. The need for
more user studies was recognized by Salaba and Zhang [14], who performed
(1) user evaluation of three FRBR-based catalogs, (2) user participatory design of
a prototype FRBR-based catalog, and (3) user evaluation of the resulting catalog.
The lack of research of how to adapt the display of FRBR-based information
to different contexts is the main motivation for the research presented in this
paper. Our previous work on display of FRBR-based information resulted in the
development of the FRBRVis prototype and extensive user testing [20,21]. There
the focus was on supporting browsing and exploration and choosing the best
visualization technique. The results show that the visualized displays in general
rated better compared to the baseline traditional faceted display in all elements
of usability, i.e. efficiency, effectiveness and user experience. The limitation of
that study was that it did not include searching and was focused on graphical
visualizations. What is presented here is a logical continuation and has a focus
on the search experience and result lists presented using UI features that are
commonly found in search interfaces.
Interactive Displays for Bibliographic Models 201
3 Design
FRBR is often presented as a model with a hierarchical structure, but is in reality
a network consisting of typed nodes for the bibliographic entities and typed
links for bibliographic relationships. Each bibliographic entity is described using
attributes – which in a graph-context can be defined as typed node values. The
main challenge when implementing searching and displaying results for such data
is (a) how to index and query the data so that a user can retrieve information
relevant to a query, and (b) how to display what is found in order to enable the
user to understand and explore the results.
The BIBSURF system utilizes an indexing strategy based on dividing the
graph into indexing units that loosely correspond to dynamically created meta-
data records which can be indexed using a text search engine. Works, expres-
sions, manifestations, and even agents represent different perspectives of the
same graph and are possible main (or root) entities for such dynamic metadata
records (see Fig. 1). Each created metadata record needs to include the attribute
values from the main entity as well as the attribute values from related entities
that are needed to support querying and retrieval. A dynamic metadata record
for a specific work will e.g. include the attributes of the work such as title and
type of work, as well as the attributes of all related agents such as names. A
search using specific keywords will then return all units for which these key-
words appear in any of the attribute values. Determining the boundaries of an
indexing unit is a question of tuning for precision and recall in the context of
an application. Expanding the graph will add more terms to the indexing unit
with increased recall but possibly reduced precision because of more irrelevant
terms.
A1 A2 A1 W1 E1 M1 A1 W1 E1 M1 A1 W1 E1 M1
W1 W2 M2 M2
A1 W1 E1 M2
E1 E2 E3 A2 W2 E2 M2 A2 W2 E2
A2 W2 E2 M2
M1 M2 M3 E3 M3 A2 W2 E3 M3 A2 W2 E3 M3
Fig. 1. Transforming a bibliographic graph into indexing units. The source graph to
the left followed by the subsets used in the indexing of works, expressions and mani-
festation, with the main (root) entity in each unit highlighted.
A search performed on the index will find the set of units matching the query
and return the identifier of the main entity of the index unit, which then can
be used to construct display units for the result listing. Each display unit is
essentially a subgraph selected for a presentation of the main entity. The choice
of entities to include in the integrated display unit will impact the understanding
202 T. Aalberg et al.
4 Implementation
The BIBSURF system is designed as a generic keyword-based bibliographic
search web interface where a user can enter terms or a phrase in a single field,
and retrieve a ranked listing of units found. The main elements of the user
interface is the search box and the result display. A filtering feature is added
to enable users to refine the listing based on names or categorical values in the
result set. Additional elements in the search interface are oriented towards the
researchers, such as an option to choose between display views, select a ranking
mechanism, and examine the underlying data. The user interface is developed
using the component-based React framework and the React Bootstrap UI-widget
library to create an interactive and responsive front-end.
On the backend side, the system uses the eXistdb1 open source native XML
database utilizing xquery to produce the search results. The eXist database has
built-in support for full text indexing using the Lucene search engine. Search
is based on an intermediary index of RDF-fragments for each of the index unit
types, mainly because dynamic support for this would add an expensive process-
ing overhead. The technology is chosen to enable rapid development and easy
management, but the same solution can in theory be based on a triple store with
flexible support for full text indexing of RDF such as described in [9].
Our test collections have been created by enhancing and transforming exist-
ing MARC 21 records into rich and well-structured FRBR data coded in RDF
using the RDA vocabularies2 . Records have been retrieved from different library
catalogs using Z39.50, and have been manually enhanced to make the inher-
ent structure more explicit, based on the techniques identified in [10]; e.g. by
adding missing uniform title and relator codes, or coding information in note
fields or responsibility statements using explicit fields. Afterwards, the data has
1
http://exist-db.org/exist/apps/homepage/index.html.
2
http://www.rdaregistry.info.
Interactive Displays for Bibliographic Models 203
As the aim of this experiment was to test only the displays, the participants
were only presented with the scenario for each task and a list of results that were
retrieved for the predefined search. Although the interface and the bibliographic
data were predominantly in English language, all students who participated in
the study had a high level of English comprehension and were not distracted by
the foreign language. The results in this paper do not include eye tracking data
or participants’ perceptions of the task difficulty.
For each scenario, researchers assigned the following measures to evaluate
how well the display supported users in discovering the correct answer and to
assess participant’s understanding of the displayed entities for the given scenario:
Success score noted whether the participant found the correct answer where
5 = complete success, 3 = partial success and 1 = no success
Description score reflected the quality of participant’s description regarding
the retrieved set of results: 5 = complete description, 4 = one element of
description missing, 3 = two elements missing; 2 = three elements missing;
1 = no relevant description
Time needed to complete the task (the overall mean task time for each scenario
was afterwords calculated only for those tasks with success score 5 or 3).
To give context and explain the content of each result listing, we have imple-
mented a simple notation to show the core W-E-M chains with numbers indi-
cating how many of each entity type that is found in a result set (as shown in
Table 2). The rightmost E is for expressions in the contents listing and the corre-
sponding work count is redundant due to the 1:1 relationship from an expression
to its parent work.
206 T. Aalberg et al.
6 Results
Our main research question is focused on the usefulness of each of the displays.
The collected data can provide an insight and give some conclusions for the
FRBR entity displays. Table 1 shows the final results of our test with end users for
individual scenarios. As our main research question was focused on the usefulness
of the three displays for different user tasks, the results were not analysed from
the viewpoint of overall score per display, but individually for each scenario.
For scenario 1, the success score was highest for the manifestation display,
while the descriptions of the retrieved results were most comprehensive using
the expression display. The low score for the work display and higher scores for
manifestation and expression display reflect the use case scenario where the main
emphasis was to identify manifestations that embody expressions of the work in
a particular language.
Scenario 2 also asked the user to identify manifestations that embody a spe-
cific expression. As shown in Fig. 1, this scenario is also well reflected in the high
success and description scores that were the same for the manifestation and
expression display. In the work view, some participants had difficulty locating
the sought information as it was displayed among other expressions of the work.
Scenario 3 required the user to focus on the different versions of a chosen
work and the results indicate that the expression view was the most appropriate
for this task.
In contrast to all other use cases, the scores for scenario 4, where the user
was primarily interested in the works of an author, reveal a high advantage of
the work display, particularly in comparison to the manifestation display. In
manifestation view, participants not only spent more time to identify individual
works, but also made more errors, viewing some expressions (translations) and
manifestations (collections) as new works written by Agatha Christie. A smaller
difference in the scores appeared in scenario 5, but again the results from the
user test, where the highest scores were achieved using expression view, coincide
well with the scenario.
In some scenarios (for example scenario 4 and 2) low or high scores also
correlate with the mean time needed to complete the task, but not in others
(scenario 3). Overall however, it seems that participants needed more time using
the expression display, which might be connected to the fact that such display is
quite novel to the users (in contrast to manifestation display), but at the same
time gives a longer list of results than the work display.
208 T. Aalberg et al.
of works that have parts, manifestation that have parts, or aggregates (e.g.
collections of short murder stories by different authors or text augmented by
illustrations). This is a topic that has been discussed in theory, but real world
experiments are needed to establish best practice representation and determine
which entities are needed to offer specific functionality – or not needed – to
include and manage in the database.
References
1. Standing Committee and IFLA Study Group: Functional Requirements for Bibli-
ographic Records: final report, vol. 19, K.G. Saur (1998)
2. Hegna, K., Murtomaa, E.: Data Mining MARC to Find: FRBR? (2002). http://
folk.uio.no/knuthe/dok/frbr/datamining.pdf
3. Kilner, K.: The AustLit gateway and scholarly bibliography: a specialist imple-
mentation of the FRBR. Cataloging Classif. Q. 39, 87–102 (2005)
4. Yee, M.: FRBRization: a method for turning online public finding lists into online
public catalogs. Inf. Technol. Libr. 24(3), 77–95 (2005)
5. Aalberg, T.: A process and tool for the conversion of MARC records to a normalized
FRBR implementation. In: Sugimoto, S., Hunter, J., Rauber, A., Morishima, A.
(eds.) ICADL 2006. LNCS, vol. 4312, pp. 283–292. Springer, Heidelberg (2006).
https://doi.org/10.1007/11931584 31
6. Ercegovac, Z.: Multiple-version resources in digital libraries: towards user-centered
displays. JASIST 57(8), 1023–1032 (2006)
7. Carlyle, A., Ranger, S., Summerlin, J.: Making the pieces fit: little women, works,
and the pursuit of quality. Cataloging Classif. Q. 46(1), 35–63 (2008)
8. Dickey, T.J.: FRBRization of a library catalog: better collocation of records, leading
to enhanced search, retrieval, and display. Inf. Technol. Libr. 27, 23–32 (2008).
ISSN 07309295
9. Minack, E., et al.: The Sesame LuceneSail: RDF Queries with Full-text Search.
NEPOMUK Technical report (2008)
10. Aalberg, T., Merčun, T., Žumer, M.: Coding FRBR-structured bibliographic infor-
mation in MARC. In: Xing, C., Crestani, F., Rauber, A. (eds.) ICADL 2011.
LNCS, vol. 7008, pp. 128–137. Springer, Heidelberg (2011). https://doi.org/10.
1007/978-3-642-24826-9 18
11. Notess, M., Dunn, J.W., Hardesty, J.L.: Scherzo: a FRBR-based music discovery
system. In: International Conference on Dublin Core and Metadata Applications,
pp. 182–183 (2011)
12. Riva, P., Oliver, C.: Evaluation of RDA as an implementation of FRBR and FRAD.
Cataloging Classif. Q. 50(5–7), 564–586 (2012). https://doi.org/10.1080/01639374.
2012.680848
13. Sielski, K., Walkowska, J., Werla, M.: Methodology for dynamic extraction of
highly relevant information describing particular object from semantic web knowl-
edge base. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farru-
gia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 260–271. Springer, Heidelberg
(2013). https://doi.org/10.1007/978-3-642-40501-3 26
14. Zhang, Y., Salaba, A.: What do users tell us about FRBR-based catalogs? Cata-
loging Classif. Q. 50(5–7), 705–723 (2012)
15. Aalberg, T., Žumer, M.: The value of MARC data, or, challenges of frbrisation. J.
Documentation 69(6), 851–872 (2013)
Interactive Displays for Bibliographic Models 211
16. Kroeger, A.: The road to BIBFRAME: the evolution of the idea of bibliographic
transition into a Post-MARC future. Cataloging Classif. Q. 51(8), 873–890 (2013)
17. Ore, C.E., et al.: Definition of the CIDOC Conceptual Reference Model (2015).
http://www.cidoc-crm.org/Version/version-6.2
18. Working Group on FRBR/CRM Dialogue: Definition of FRBRoo: A Concep-
tual Model for Bibliographic Information in Object-Oriented Formalism (2015).
https://www.ifla.org/publications/node/11240
19. Aalberg, T., Merčun, T., Žumer, M.: BIBSURF: discover bibliographic entities by
searching for units of interest, ranking and filtering. In: Proceedings of the 16th
ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 207–208. ACM, New
York (2016)
20. Merčun, T., Žumer, M., Aalberg, T.: Presenting bibliographic families: designing an
FRBR-based prototype using information visualization. J. Documentation 72(3),
490–526 (2016)
21. Merčun, T., Žumer, M., Aalberg, T.: Presenting bibliographic families using infor-
mation visualization: evaluation of FRBR-based prototype and hierarchical visu-
alizations. J. Assoc. Inf. Sci. Technol. 68(2), 392–411 (2016)
22. Riva, P., Le Bæuf, P., Žumer, M.: IFLA Library Reference Model (LRM) (2017).
https://www.ifla.org/publications/node/11412
Writers of the Lost Paper: A Case Study
on Barriers to (Re-) Finding Publications
1 Introduction
This paper was inspired by the experiences of the authors as we met to plan an
extension to our previous research on semantic search (the Capisco project). An
obvious first step was to review our previously published work, but we quickly realized
that none of us maintain a personal archive; instead, we rely on access to online digital
libraries for the final, published version of our research. In this present paper, we detail
the unexpected difficulties encountered when we attempted the seemingly straightfor-
ward task of locating copies of four papers detailing our own Capisco system, all
known to be published in the ACM Digital Library (references [1–4], hereafter labeled
[JCDL16A, JCDL16B, JCDL15, SIGWEB]).
These difficulties are primarily based on metadata errors that have crept into, and
propagated across, scholarly document collections; related work investigating the scope
and extent of these issues is summarized in Sect. 2. Section 3 presents a case study of
problems that can be encountered when conducting known-item searches (here, while
hunting for full text copies of our own papers), beginning with interface issues in the
ACM DL that led to confusion as to whether or not full-text searching was being used
(Sect. 3.1). We then report on the outcomes of searching through Google Scholar,
where full-text search is the default, but the results returned could still miss returning a
© Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 212–224, 2017.
https://doi.org/10.1007/978-3-319-70232-2_18
Writers of the Lost Paper 213
matching document, even when that lexical term appears in the full-text (Sect. 3.2).
These exploratory searches turned up additional search issues caused by errors in
document metadata (Sect. 3.3), by special characters (Sects. 3.4 and 3.5), and as a
by-product of stemming in indexing (Sect. 3.6). Section 4 outlines a potential solution:
a proxy-based approach that uses JavaScript for manipulation of the Document Object
Model (DOM) to modify a user’s queries so as to overcome the identified issues.
Section 5 presents our conclusions.
2 Related Work
To place our findings in context, we here discuss briefly work on problems in metadata
creation and correction.
Gladney [6] discusses different approaches of creating metadata and makes a strong
case for using author-generated metadata, which could have resolved the problems
detailed in Sect. 3.3. A lack of formal investigation into the metadata creation process
has also been noted before (e.g., [9–11]). They point to similar issues as those observed
in our case study, such as inaccurate data entry.
Currier [9] especially describes the issues that are faced in commercial resource
discovery as a consequence of metadata errors, referring to the bargains to be had when
searching for “Plam Pilots” on eBay. The problem of metadata quality has also been
previously acknowledged by Beall [7], who observed the main types of data quality
errors in digital libraries and particularly highlighted the problem of blocked access.
Bui and Park [8] evaluated metadata quality at the American National Science
Digital Library (NSDL), analyzing more than one million Dublin Core metadata
records. They found that for about 17% of the data the creator (i.e., author) of the
resource was not specified at all and that there are whole collections without specified
creator metadata.
Park & Tosaka [12] further acknowledged the challenges in creating metadata,
especially for rapidly developing large-scale digital repositories. They identify as one
of the issues that existing semi-automatic metadata tools often target only selected
metadata elements, leading to the necessity for interoperability between tools and their
output. An alternative, or addition, to automatically created metadata is the user-driven
correction of available metadata: [5] describes a prototype system that allows users to
correct disambiguation and collocation errors, while [13] argues for a combination of
author-provided metadata with automatic data to improve findability.
3 Problems Encountered
This section details the problems that we encountered when attempting to locate copies
of four papers ([JCDL16A, JCDL16B, JCDL15, SIGWEB]) summarizing progress to
date on our Capisco project. All four were known to be archived in the ACM Digital
Library and so are also included—through an information sharing agreement—in
Google Scholar.
214 D. Bainbridge et al.
A. Simple Search box. In our initial attempt to locate the four papers, we searched the
ACM DL for the term Capisco (a common strategy for a known-item search is to use
what is believed to be a relatively uncommon term in the document as a query term).
We entered this search in the simple search box on the home page; given the similarity
of this search box to that of Google and Google Scholar, together with the statement on
that webpage that the underlying collection is, “The Full-Text Collection of all ACM
publications” (emphasis in the original), we assumed that this query would match to all
documents in the DL containing the term Capisco. This search uncovered [SIGWEB]
and [JCDL15] papers but not [JCDL16A] and [JCDL16B]. Inspection of the two
returned papers determined that they both include Capisco in the abstract and in the
case of [SIGWEB], in the title also. The other two papers only used the term Capisco in
the main text of the articles. After further experimentation with different searches we
determined that the most readily encountered search box to the ACM Digital Library
does not in fact search on the full text of the paper, but rather on the text of the
metadata (title, keywords, abstract, etc.).
B. Advanced Search: Any. We
next attempted to use the Advanced
Search facility, which defaults to
the Full-Text Collection and ‘Any
Field’. This search again yielded the
same two papers: [SIGWEB] and
[JCDL15]. On closer inspection of
the Advanced Search options
(Fig. 1), we noticed that ‘Full-text’
is one of the options provided. Even
though ‘Full-text’ is listed under
‘Common Fields’ (our emphasis),
the observed result from searching
was that it is not included in an
‘Any field’ search.
C. Advanced Search: Full text.
On searching for Capisco with
Advanced Search ! Full-text, all
four papers were retrieved. Through
further experimentation we deter-
mined that ‘Any field’ refers to
Fig. 1. Advanced Search options of the ACM DL. metadata fields but not the text of
the documents, while a ‘Full-text’
search in some way combines both
the metadata and the document
text. We say ‘in some way combines’ because it is not clear whether this is actually a
union operation of text and metadata, or if it just so happens that the processed full text
Writers of the Lost Paper 215
contains title, abstract, keywords, etc. Certainly it is the case that for a search for the
term Matamua (the name of an author for the [JCDL15] paper) and then using the
‘result highlights’ option on the results list (which displays and highlights the matches
for the query), we see that the term Matamua is highlighted in both the document’s
extracted text and in the metadata (Fig. 2).
Fig. 2. Matches for search term Matamua – one of the co-authors of [3].
Fig. 3. First three hits from a Google Scholar search for Capisco, English documents only,
results sorted by relevance.
the Google Scholar search engine conflates CP (Complementizer Phrase, a syntax tree
structure) with Capisco—though it is not clear why it would occur for this paper and
not for other linguistics papers.
We note that this metadata error is propagated across some, but not all, digital
libraries, repositories, and databases (e.g., the error is present in the University of
Writers of the Lost Paper 217
Illinois1 archive and Scopus2 but not in SemanticScholar3), and will have an unpre-
dictable impact on searches for the paper. Google Scholar, for example, is more for-
giving than the ACM DL; Google Scholar returns this paper as the sole result for all
permutations of a query on the title (the correct title both with and without quotes; the
title with Through and Semantic concatenated both with and without quotes).
Where could this error have come from? We note that the title of this paper runs to
two lines on the printed page, with the break occurring between the two words con-
catenated in the ACM DL metadata (Fig. 4a). We conjecture that the error is related to
line breaks on the printed page. To explore this hypothesis we searched for “Disam-
biguationAnnika” (the next potential concatenation error) and, consulting the ‘result
highlights’, see further erroneous concatenations across both line and column breaks
(Fig. 4b). We are at a loss as to why this issue occurs for the [JCDL15] paper and not
for the other three.
Fig. 5a. Advanced search for “David Copperfield” in the ACM DL.
1
https://experts.illinois.edu/en/publications/improving-access-to-large-scale-digital-libraries-
throughsemantic.
2
http://bit.ly/2kTTxd3.
3
https://pdfs.semanticscholar.org/0936/fece67ba70cf263dcf8fdef6e2aa77ea1145.pdf.
218 D. Bainbridge et al.
Fig. 5b. Search result for query inadvertently including the fi ligature.
We know from our experiences detailed in Sect. 3.1 to use the Advanced Search,
Full-Text option, to pick up on mentions of the novel in the document text (since its
title is unlikely to appear in the metadata for technical research papers). This search
yields a single result—and it is not our [JCDL16B] paper (Fig. 5b). Inspecting the
matched text, we realize that we were unlucky in our choice of source for the text to
copy/paste; of the two mentions in our paper, we chose the italicized version that
included a ligature between the f and the i (fi).
A search for “David Copperfield” without the ligature yields a more plausible 41
hits (including [JCDL16B], but not the paper in Fig. 5b. We note that searching for
“David Copperfield” (ligature included) in Google Scholar returns documents both
with and without the ligature.
Fig. 6a. Reference [5] title in the print version of the paper.
Writers of the Lost Paper 219
Fig. 7b. Search result demonstrating that punctuation is stripped and query terms are stemmed.
57 documents were returned. An examination of the results shows that the hyphen
in the phrase was stripped out, and that the terms in the phrase were stemmed (Fig. 7b).
That hyphens are stripped is not surprising, as many search engines do not index
4
http://dl.acm.org/documentation/Types.htm#phrases.
220 D. Bainbridge et al.
punctuation. It is, however, misleading for the transformed query to include the
hyphen. It was completely unexpected to see that terms within a phrase are stemmed,
contrary to convention and to the provided documentation. We do, however, note that
in practice this implementation of phrase searching can be useful (in this case, the
search retrieved both [SIGWEB] and [JCDL16]).
Figure 8a shows a snapshot of the user visiting the ACM Digital Library through
MEDDLE. The MEDDLE information box at the top of the page indicates that the
quick search box has been modified to perform full text searching and to help find
search terms that are accidentally concatenated due to line wrap issues (Sect. 3.3).
Figure 8b shows the result of a search for Capisco where all four relevant documents
are returned, as was originally expected by the authors (that is, MEDDLE has con-
ducted a full-text search via the quick search box). Not shown in the figure, had the user
5
https://bedrock.resnet.cms.waikato.ac.nz/meddle/.
Writers of the Lost Paper 221
searched using the intended title of the first document returned (with Through Semantic
rather than ThroughSemantic) with MEDDLE then this too would result in a successful
query. What MEDDLE does in this situation is deliberately add in additional query
terms that are the concatenation of adjacent pairs of terms in the query that the user has
entered.
A further feature of MEDDLE addresses issues connected with ligatures and
accents. The bottom line of the MEDDLE information box (Fig. 8a) contains check
boxes for ligature expansion and accent folding. If either or both of these check boxes
are selected when a query is submitted then the injected JavaScript in the page checks
the query terms for instances of these types of characters and substitutes suitable
replacements (e.g., a query including José Borbinha will be changed to Jose Borbinha).
Even if the user has not activated these options, MEDDLE still monitors and alerts a
user if an accent or ligature is present in the search terms—giving the user the
opportunity to enable the option for that query.
Changes in the way that a digital library handles indexing, text extraction, and
search features may necessitate changes to its MEDDLE extension. These changes
might make MEDDLE superfluous (a problem that the authors would welcome!) or
necessitate modifications to the MEDDLE extension to achieve the same result. They
may also introduce new issues for MEDDLE to address. The challenge then is to
include monitoring functionality to help identify when a MEDDLE-ed with digital
library has changed. One approach would be a form of digital library ‘unit testing’,
where the monitoring software periodically conducts pre-determined queries and
compares their expected results to the hits returned by that digital library.
And, of course, it might not be possible to develop a MEDDLE-based work-around
for every issue identified. For example, the anomalies in full text searching in Google
Scholar appear to be based in the implementation of the index and the matching
algorithm, and as such are not amenable to correction by manipulating the query string.
When researchers publish their research, they hope that others will find it and build
upon it. This paper illustrates that making a paper available in online document col-
lections does not necessarily make it findable, despite the best efforts of authors and
system developers. For example, in 2014 we chose a project title—Capisco—believing
that it was distinctive and would allow others to easily isolate the project’s research
papers. However, in 2017 we discovered that this strategy is not domain-agnostic
(Sect. 3.2): while our Capisco project is a highly distinctive ‘brand’ across IT publi-
cations, we see that in a broader collection such as Google Scholar that uniqueness is
more difficult to achieve (Sect. 3.3).
Moreover, the usefulness of a distinctive term for searchers is impacted by the
search engine algorithms: a search for Capisco retrieves [JCDL16b] in the ACM DL
under full text search but not under an Any Field search (the Advanced Search default),
because the term appears in the document text but not in the metadata (Sect. 3.1); and
the search does not retrieve [JCDL16b] from Google Scholar for less clear reasons. The
important lesson here appears to be that researchers should be much more careful in
creating metadata—particularly author-specified keywords and abstract terms—and not
rely on the presence of a term in the document body to support findability. Similarly,
glitches in automated metadata extractions can also severely impact the findability of
individual papers (Sect. 3.3). Text extraction errors are difficult to predict and can have
erratic effects on search results, even—or especially—when the search terms are
copy/pasted from the document or reference (Sects. 3.4 and 3.5).
Since metadata-related errors and general findability problems appear to resist
simple automated solutions, we believe that involving human quality checks is war-
ranted. We suggest that as a condition of inclusion in the ACM DL (or similar) that
authors be required to verify the findability of their papers by reviewing metadata as
seen in the DL prior to the paper being made available publicly. Adding this simple
additional step to the publication cycle would provide low-cost improvements to DLs
that would greatly enhance their value as research communication tools.
Writers of the Lost Paper 223
We also call for more transparency and clarity on the part of digital libraries, as to
why a given document is returned as a match to a query and how the query is
understood by the search engine. The ACM DL’s ‘results highlights’ facility is a step in
the right direction, as is its query summary—neither of which are provided by Google
Scholar. However, our experiments indicate that, in their present implementation, these
facilities do not provide sufficient detail for the searcher to fully understand the results
displayed—and so can lead unwary searchers astray, thinking that they have a com-
prehensive set of papers on a given topic when it is in actuality incomplete.
As the authors have personally experienced, it can take quite an unusual search
activity to realize that a digital library is not functioning as the user expects—but the
MEDDLE approach is premised on knowledge of such issues. While Sect. 3 makes
some inroads on generalizing the types of problems to look for, it cannot be claimed to
be exhaustive; there is scope for a principled and comprehensive review of the major
research digital libraries to identify further issues and techniques for addressing them.
Key to that approach would be the compilation of queries that trigger these issues
together with a set of ideal results for each query. Through this, as we have done in
Sect. 3, one can reason about what the difference is between your understanding of the
digital library’s search capability and its actual implementation.
The MEDDLE approach described here is a refinement of the technique first
developed in our collaboration with the HathiTrust Digital Library (HTDL): we created
a web browser add-in to create a mashup of three websites—the HTDL and two
web-based offerings operated independently by HTRC [14]. The user interacts with the
HTDL as usual, and at strategic locations in the interface functionality drawn from the
research systems—which take account of the user’s current context—is seamlessly
blended in. The advantage of the MEDDLE approach is that the user does not need to
install extensions to their browser.
Additionally, we note that the identified issues and their suggested work-arounds
could be offered to the digital library provider to suggest further refinements for that
system. A prime example would be the MEDDLE identification and handling of
accents and ligatures, which could be easily incorporated into a digital library.
References
1. JCDL16A: Hinze, A., Bainbridge, D., Cunningham, S.J., Downie, J.S.: Low-cost semantic
enhancement to digital library metadata and indexing: simple yet effective strategies. In:
JCDL 2016, pp. 93–102 (2016)
2. JCDL16B: Hinze, A., Coleman, M., Cunningham, S.J., Bainbridge, D.: Semantic
bookworm: mining literary resources revisited. In: JCDL 2016, pp. 227–228 (2016)
3. JCDL15: Hinze, A., Taube-Schock, C., Bainbridge, D., Matamua, R., Downie. J.S.:
Improving access to large-scale digital libraries through semantic-enhanced search and
disambiguation. In: JCDL 2015, pp. 147–156 (2015)
4. SIGWEB: Hinze, A., Taube-Schock, C., Bainbridge, D., Cunningham, S.J., Downie. J.S.:
Introducing capisco: a semantically-enhanced search and discovery system for large-scale
text corpora. SIGWEB Newslett. 14, November 2015. Autumn 2015, Article 4
224 D. Bainbridge et al.
5. Bainbridge, D., Twidale, M.B., Nichols, D.M.: That’s ‘é’ not ‘þ’ ‘?’ or ‘☐’: a user-driven
context-aware approach to erroneous metadata in digital libraries. In: Proceedings of the 11th
Annual International ACM/IEEE Joint Conference on Digital libraries (JCDL 2011), pp. 39–48.
ACM, New York, NY, USA (2011)
6. Gladney, H.: Preserving Digital Information. Springer, Heidelberg (2007). doi:10.1007/978-
3-540-37887-7
7. Beall, J.: Metadata and data quality problems in the digital library. J. Digital Inf. 6(3),
1368–7506 (2006)
8. Bui, Y., Park, J.-R.: An assessment of metadata quality: a case study of the national science
digital library metadata repository. In: Proceedings of the Annual Conference of CAIS/Actes
du Congrès Annuel de l’ACSI (2013)
9. Currier, S., Barton, J., O’Beirne, R., Ryan, B.: Quality assurance for digital learning object
repositories: issues for the metadata creation process. ALT-J Res. Learn. Technol. 12(1),
5–20 (2004)
10. Guy, M., Powell, A., Day, M.: Improving the quality of metadata in e-print archives.
Ariadne 38 (2004)
11. Park, J.-R.: Metadata quality in digital repositories: a survey of the current state of the art.
Cataloging Classif. Q. 47(3–4), 213–228 (2009)
12. Park, J.-R., Tosaka, Y.: Metadata creation practices in digital repositories and collections:
schemata, selection criteria, and interoperability. Inf. Technol. Libr. 29(3), 104–116 (2010)
13. Maurer, M.B., McCutcheon, S., Schwing, T.: Who’s doing what? Findability and
author-supplied etd metadata in the library catalog. Cataloging Classif. Q. 49(4), 277–310
(2011)
14. Bainbridge, D., Downie, J.S.: All for one and one for all: reconciling research and
production values at the hathitrust through user-scripting. In: Joint Conference on Digital
Libraries JCDL 2017, pp. 283–284 (2017)
Result Set Diversification in Digital Libraries
Through the Use of Paper’s Claims
1 Introduction
type of information need no doubt that Digital Libraries have better quality content than
the Web. For instance, today a user interested in discovering whether a drug is ben-
eficial or not regarding a specific disease, she would have to do an exploratory search
submitting several queries. For each query, the user will basically try to get a “con-
sensus” of what the research community has found. Is there a better alternative? In this
work, we explore the idea of diversification of the returned set of a given query to help
the user in such a task. In particular, we focus on a key aspect of research papers to help
the user in her quest: claims. By claims in scientific papers we mean statements that
express associations between entities. This is of particular relevance in the medical
domain where the consumption of a drug, a substance, a fruit, etc., has an effect on a
disease. One of the challenges of considering the claims of papers is that the association
between two entities can be subject to different interpretations. Thus, in this paper, we
model a particular case that can arise when interpreting some of the associations
between the entities: controversy. One instance of the existence of several controversial
claims was found and reported first by [2]. The authors manually discovered, by
submitting several queries to PubMed and analyzing the result set relating 50 sub-
stances to cancer, that basically most of the substances could increase the risk of cancer
and decrease it! The existence of such cases motivates our work to ease the discovery
of such cases. Herein, we propose to implement a mechanism to diversify the result set
of a query to help the user discover entities that may be in a controversial case.
In this work, we aim at modeling the claims of research to perform a re-ranking of
the result set of a query represented by two entities. Our approach consists of three
basic steps given a pair of \entity; disease [ : firstly, extract from research papers,
associations between the pair; secondly, represent the associations using a neural
embedding representation of documents and thirdly, deliver a re-ranking of the result
set to ease the discovery of controversial claims.
Our proposed approach will bring several benefits: for the information’s provider, it
will add more value to its current retrieval mechanisms. For the user, the possibility of
making an informed decision that can potentially save her life. Moreover, researchers
in the medical domain who are in the quest of solving complex problems can also
benefit from our approach: they will be able to find controversial claims that basically
are in the need of further investigation.
Aiming at this challenge, in this paper, we focus on the design and implementation
of a technique that can re-rank documents based on a fundamental aspect of research
papers: claims. The remainder of this paper is organized as follows. Section 2 provides
definitions and the problem we aim at solving in this paper. Section 3 overviews related
work. Sections 4 and 5 describe the experimental setup and the evaluation of our
proposed approach. Lastly, Sect. 6 presents our concluding remarks.
In this section, we provide definitions and the problem we aim at solving in this paper.
Let’s first define what a claim is:
Result Set Diversification in Digital Libraries 227
3 Related Work
Our research is related to efforts found in the Web search community towards allevi-
ating biases. Indeed, biases have been a constant problem on the Web and have
received considerable attention from different aspects. For instance, in [6] domain bias
was investigated in Web search. Domain bias is defined as the user’s propensity to
believe that a page is more relevant just because it comes from a particular domain. In
[1] it was found that users show biases by favoring information that confirms what their
beliefs when conducting a search. Researchers proved by a series of experiments the
urgent need of search engines to cope with what they called bias and accuracy problem
in the result set of a query. To deal with the problem of bias, several approaches to
deliver result diversification have been proposed. These approaches could be catego-
rized as either implicit or explicit [7]. Basically, they differ in how they account for the
different query aspects that can help to diversify the result set for a given query.
228 J.M. González Pinto and W.-T. Balke
Implicit approaches make the assumption that similar documents will cover similar
aspects of the query and should therefore be in the final ranking. The challenge for
these methods is to discover the possible different aspects in an unsupervised fashion.
A pioneering example presented in [8] introduces a method that basically combines
query-relevance with information-novelty in the context of retrieval and summariza-
tion. In a similar line of thought in [9] a method was introduced that exploits statistical
language modeling to cope with redundancy and relevance. In their work, the problem
of sub-topical retrieval is introduced. Basically, the idea is to find documents that cover
different sub-topics (aspects) of a query. In [10] the use of clustering was introduced to
improve the effectiveness of the diversification of the results of a query. Basically, the
idea is to first cluster the candidate documents and then restrict the diversified approach
to documents associated with clusters that potentially contain many relevant docu-
ments. A study comparing implicit diversification techniques with cluster based
approaches that select cluster centroids as the representative documents in the final
result list is given in [11]. They concluded that clustering is usually a better approach
for single sub-topics of a given query. However, diversification implicit methods turned
out to be better for quick coverage of distinct sub-topics. Another line of research takes
diversification with a different perspective. These efforts model specifically the query
aspects considered relevant for a specific domain. Usually, some type of external
knowledge is exploited to account for these aspects. For instance, in [3] they look at the
problem of diversification by assuming that a taxonomy exists. With this assumption,
diversification is achieved by favoring documents from different categories and
penalizing those that fall into already covered categories. A similar approach is used for
product search in [12] where in addition to the categories of products, attributes within
each category were considered. In [13] the query aspects were taken from the query log
of a commercial search engine. Then, they proposed a ranking to satisfy each aspect of
the original query. Another approach that exploits the idea of automatic query refor-
mulations using TREC subtopics is the work of [7]. The researchers introduced a
probabilistic approach that explicitly considers the aspects of the query as given by the
sub-topics track in the TREC diversification task. The presented approach favors
documents that cover those aspects that are not yet covered in the current results set of
the generated candidate list. Our work is related to the explicit category of diversifi-
cation. In our work, we promote claims as first-class citizens and how controversial
claims, in particular, can raise in health-related queries.
4 Methodology
In this section, we introduce our methods to solve our novel problem of Claim
Diversification, to explicitly rank the result set of a query represented as the pair
\entity; disease [ .
Result Set Diversification in Digital Libraries 229
4.1 Dataset
To rely on high quality content, we used PubMed as our main source of documents. For
each pair \entity; disease [ , we submitted a query represented as the following query
pattern in PubMed:
(help AND prevent) OR (lower AND risk) OR (increase OR increment AND risk)
OR (decrease OR diminish AND risk) OR (factor AND risk) OR (associated AND risk)
AND (entity AND disease).
The ranking provided from PubMed’ retrieval system is our initial set of ranked
documents . However, not all the documents retrieved from the query were used in our
experiments. The main reason was that we wanted to be sure that a claim corresponds
to the main contribution of a paper. Thus, we proceeded as follows: firstly, we filtered
out documents with no conclusions metadata. Secondly, we split each document in
sentences. And thirdly, for each sentence in each document, we selected as the claim of
the document the sentence that contained \entity; disease [ . This preprocessing step
had a positive impact in the quality of the documents that we used.
as the set of internal documents, i.e., which lie inside the cluster ðc; rÞ, and
as the set of external documents. Clustering is applied recursively in the external set.
The function d in our case is WMD. The algorithm ends when all documents have been
assigned to a cluster. Afterwards, the centers are promoted to the top of the ranking.
Furthermore, a center chosen first has preference over the subsequent ones. After that,
the remaining of the documents are returned in the order given by its internal mem-
bership with respect to its corresponding center. More formally, Algorithm 1 shows
how to compute the List of Clusters Diversification (LCD).
Result Set Diversification in Digital Libraries 231
Let’s clarify two important aspects of the algorithm. Firstly, the selection of cluster
centers. The algorithm uses a ranked list of results and takes as the first cluster the top
result. After that, to select cluster centers, line 11 of the algorithm, in [4] was exten-
sively investigated using different heuristics. Experimentally, it was shown that the best
strategy is to choose the next center as the object that maximizes the sum of distances
to the previous centers. We used in our work the same heuristic. Secondly, the
parameter k of the algorithm is used to set the size of the clusters. Empirically, it has
been shown that when working with high dimensional metric spaces, the value of k can
be dynamically increased as many documents may have the same distance to the center.
This is helpful because the number of computations required to select the centers of the
clusters can be dramatically reduced. Using a large value of k can help alleviate the cost
of distance computations. In our experiments, we set k to six after evaluating a range of
values and manually assessing the tradeoffs of the computational cost versus diversity
of the result set.
5 Experiments
We are aware that the TREC09 and TREC10 collections [19], provide data samples and
queries related to the diversification problem in Information Retrieval. Unfortunately,
no such data is available for the novel problem presented in this paper where claims are
first class citizens. Thus, to evaluate our results, we conducted a series of experiments
by querying PubMed as indicated in Sect. 4.1. Moreover, we propose to use as a metric
of our evaluation the Entropy at the top t documents to measure the amount of
information expressed in the documents at each t. Basically, the idea is that if we
achieve higher diversification than the initial result set delivered by PubMed, then we
should have a higher entropy. In other words, our proposed method should more evenly
divide its probability mass across the documents. Thus, a lower entropy would imply
narrow focus of the result set (bias). More formally, entropy is defined as:
X
m
H ðX Þ ¼ pðxi Þ log pðxi Þ ð4Þ
i¼1
We performed experiments with 16 entities related to cancer: wine, tea, sugar, salt,
potato, pork, onion, olive, milk, lycopene, lemon, egg, coffee, cigar, beef and bacon.
We selected these entities for our analysis taken from the cases studied by [2]. We
begin in the following paragraphs with a brief discussion of the three main cases found
among the 16 entities we analyzed. More specifically, we explain three entities that
reflect our main findings: tea, wine and coffee.
In Figs. 1, 2 and 3 we plot the entropies at the top 5, 10, 15 and 20 result set with
three queries representing three different entities related to cancer: tea, wine and coffee.
The label “no diversification” in the plots means the retrieved list of documents where
our approach is not used. The label “with diversification” is the one that corresponds to
our proposed approach.
232 J.M. González Pinto and W.-T. Balke
The first case, tea and cancer are shown in Fig. 1. We can observe that when
diversification is applied there is a constant positive difference with respect to the
default result set. According to our hypothesis, when diversification is applied up to the
top 20 results the user could be better informed.
The second case shown in Fig. 2 corresponds to wine and cancer. As it can be
observed, it is a different situation: up to the top 10 results our approach could
potentially help the user to be aware of a broader set of associations between the
entities. However, beginning at the top 15 the differences can be neglected.
Result Set Diversification in Digital Libraries 233
In Fig. 3 we have the case of coffee and cancer. It seems that our approach is able to
diversify the result set. In this particular case, the differences between our approach and
the default result set remain constant.
In summary, what we learned from these preliminary experiments is that up to the
top 10 results diversification makes a different for this type of data. Even though the
differences look small, please notice that our preprocessing step cleaned a lot of data.
Because of this preprocessing, the differences do not seem to be as relevant as they
could have been expected.
Comparison with MMR. To further validate our proposed solution, we also consid-
ered in our work the diversity-based re-ranking method called Maximal Margin Rel-
evance (MMR) [8]. We proceeded as follows: we used two metrics to evaluate the
differences between the two methods using top 10 results. Firstly, we used entropy as
before. And secondly, we computed correlation of word frequencies between each
method and the first 10 results with “no diversification”. The idea behind this metric is
simple but powerful: the performance of one method is worse than the other, the more
correlated is with the set of “no diversification”. In this work, we used Pearson’s
correlation with 95 confidence intervals.
To our surprise, the differences between the two methods when using entropy as
our main metric are not statistically significant. In Fig. 4 we observe the comparisons
with each entity and there is no clear winner: in some cases, MMR is better but in half
of them LCD does a better job.
However, when we computed the correlation of word frequencies between each
method and the top 10 results with no diversification, LCD turned out to be slightly
better. In particular, it outperformed MMR in 10 out of the 16 entities.
Discussion. One limitation of our current analysis is that qualitatively speaking, we
cannot evaluate our approach. We can only observe some differences using entropy as
our metric in favor of the idea of allowing a user to get a better overview of a result set.
234 J.M. González Pinto and W.-T. Balke
Fig. 4. Entropies of MMR (left bar in each pair) and the LCD model (right bar in each pair)
Nevertheless, this is a rather complicated and interesting query type and further work is
needed to overcome our current limitations. On the other hand, we could manually
observe examples where our approach seems promising. Consider for instance the
following top 5 results of our approach for the pair <tea, cancer>:
1. “over consumption of fish sauce, pickled food, moldy cereals, irregularly taking
meals and familial history of malignancy may be the local risk factors for high
occurrence of gastric cancer, and fresh vegetables and fruits, green tea may have
protective effects on it”
2. “our results did not show a protective role of tea in five major cancers”
3. “tea consumption protects against oral cancer in non-smokers or non-alcohol
drinkers, but this effect may be obscured in smokers or alcohol drinkers”
4. drinking hot tea, a habit common in golestan province, was strongly associated with
a higher risk of esophageal cancer
5. “we observed evidence to support a potential beneficial influence for breast cancer
associated with moderate levels of tea consumption (three or more cups per day)
among younger women”.
We motivated and presented the novel Claim Diversification Problem for Digital
Libraries. In particular, for queries in the medical domain where one entity (a sub-
stance, a drug, a medicine, a product, etc.) has some influence with respect to a disease.
We build on previous work on Web search where diversification was introduced to deal
with the bias on the result set with complex ambiguous queries. In our case, we model
specifically one key aspect of scientific papers: claims. Claims in this work are the
Result Set Diversification in Digital Libraries 235
sentences used in medical research papers to assess the association between two
entities.
Our results look promising, and we envision future work to specifically assess the
value of promoting claims as the text snippets to present to users from real world
queries. Furthermore, we would like to validate the diversification approach that we
proposed in this paper with user’s feedback. Moreover, we would like to improve our
current approach to account for more complex cases where the claims involve more
than two entities. Currently, we do not support this type of queries. To accomplish such
a task, we would investigate more sophisticated models of the Natural Language
community to extract and represent semantically these cases.
We also believe that “time” in the medical domain should be considered as a
relevant factor in the diversification process. Therefore, we will incorporate this
important factor in our work.
References
1. White, R.: Beliefs and biases in web search. In: Proceedings of 36th International ACM
SIGIR conference on research and development in Information Retrieval - SIGIR 2013, p. 3
(2013)
2. Schoenfeld, J.D.: Is everything we eat associated with cancer? A systematic. Am. J. Clin.
Nutr. 97, 127–134 (2013)
3. Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In:
Proceedings of the Second ACM International Conference on Web Search and Data Mining -
WSDM 2009, p. 5 (2009)
4. Chávez, E., Navarro, G.: A compact space decomposition for effective metric indexing.
Pattern Recognit. Lett. 26, 1363–1376 (2005)
5. Gil-Costa, V., Santos, R.L.T., MacDonald, C., Ounis, I.: Modelling efficient novelty-based
search result diversification in metric spaces. J. Discret. Algorithms 18, 75–88 (2013)
6. Ieong, S., Mishra, N., Sadikov, E., Zhang, L.: Domain bias in web search. In: WSDM 2012
Proceedings of Fifth ACM International Conference on Web Search and Data Mining,
pp. 413–422 (2012)
7. Santos, R.L.T.T., Macdonald, C., Ounis, I.: Exploiting query reformulations for web search
result diversification. In: Proceedings of 19th International Conference on World Wide Web,
pp. 881–890 (2010)
8. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering
documents and producing summaries. In: Proceedings of the 21st Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval - SIGIR 1998,
pp. 335–336 (1998)
9. Zhai, C.X., Cohen, W.W., Lafferty, J.: Beyond independent relevance: methods and
evaluation metrics for subtopic retrieval. In: Proceedings of the 26th Annual Interna-
tional ACM SIGIR Conference on Research and Development in Informaion Retrieval,
pp. 10–17 (2003)
10. He, J., Meij, E., De Rijke, M.: Result diversification based on query-specific cluster ranking.
J. Am. Soc. Inf. Sci. Technol. 62, 550–571 (2011)
11. Carpineto, C., D’Amico, M., Romano, G.: Evaluating subtopic retrieval methods: clustering
versus diversification of search results. Inf. Process. Manag. 48, 358–373 (2012)
236 J.M. González Pinto and W.-T. Balke
12. Chen, X., Wang, H., Sun, X., Pan, J., Yu, Y.: Diversifying product search results. In: SIGIR,
pp. 1093–1094 (2011)
13. Radlinski, F., Dumais, S.: Improving personalized web search using result diversification. In:
Proc. 29th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 2006, p. 691 (2006)
14. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in
vector space. In: Proceedings of International Conference on Learning Representation (ICLR
2013), pp. 1–12 (2013)
15. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In:
International Conference on Machine Learning - ICML 2014, vol. 32, pp. 1188–1196 (2014)
16. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou, S.: Distributional semantics
resources for biomedical text processing. In: Proceedings of LBM 2013 (2013)
17. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 1532–1543 (2014)
18. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document
distances. In: Proceedings of 32nd International Conference on Machine Learning, vol. 37,
pp. 957–966 (2015)
19. Hawking, D.: Overview of the TREC-9 web track. In: NIST Special Publication 500-249:
The Ninth Text REtrieval Conference (TREC-9), pp. 87–102 (2001)
20. Manning, C.D., Raghavan, P.: An introduction to information retrieval (2009). http://dspace.
cusat.ac.in/dspace/handle/123456789/2538
Identifying Key Elements of Search Results
for Document Selection in the Digital Age:
An Observational Study
Abstract. Academic database systems are vitally important tools for enabling
researchers to find relevant, useful articles. Identifying how researchers select
documents from search results is an extremely useful measure for improving the
functions or interfaces of academic retrieval systems. This study aims to reveal
which elements are checked, and in what order, when researchers select from
among search results. It consists of two steps: an observational study of search
sessions performed by researchers who volunteered, and a questionnaire to
confirm whether extracted elements and patterns are used. This article reports
findings from the observational study and introduces questions we developed
based on the study. In the observational study we obtained data on nine par-
ticipants who were asked to search for documents using information retrieval
systems. The search sessions were recorded using a voice recorder and by
capturing screen images. The participants were also asked to state which ele-
ments they checked in selecting documents, along with the reasons for their
selections. Three patterns of order of checking were found. In pattern 1, seven
researchers used titles and abstracts as the primary elements. In pattern 2, the
others used titles and then accessed the full text before making a decision on
their selection. In pattern 3, one participant searched for images and accessed the
full text from the link in those pictures. We also found participants used novel
elements for selecting. We subsequently developed items for a questionnaire
reflecting the findings.
1 Introduction
Search engines are tools widely used by the public. Yet while academic databases
remain an important information resource for researchers [1], the spread of search
engines may affect personal preferences for search interfaces. Pajić [2] evaluated a
2 Related Research
Three similar published studies exist on clarifying which elements are used for doc-
ument selection. Wang et al. [3] proposed six document selection components: doc-
ument information elements, user criteria, document values, personal knowledge,
decision rules, and decision. Macedo-Rouet et al. [4] observed researchers’ search
strategies when using the PubMed database. That study showed that 84% of the par-
ticipants checked the abstracts after reviewing the titles. Xie et al. [5] also conducted an
empirical study, also finding that most participants read the abstract prior to selecting a
document. Nicolas et al. [6] used a questionnaire and found that abstracts were used in
the present-day environment wherein users have complete full-text access. These
results indicate reading the title and abstract are common elements in the process of
document selection. Such studies have typically applied observational and/or used
questionnaires in their methodology; therefore, we opted to use both.
3 Methodology
databases of preference. After performing their searches, they selected the documents
they wished to read in full based on the search results. During the search session, the
participants were asked to describe which elements they were checking in the search
results, which sentences they noted in abstracts, and their reasons for selecting par-
ticular documents. Audio of these statements was recorded, and a number of screen
images were captured. During the session and in the follow-up interviews, we occa-
sionally asked the participants to confirm which elements and sentences they checked.
The study was approved by the research ethics committee of the Department of
Informatics, Graduate School of Information Science and Electrical Engineering,
Kyushu University.
In the search results, we analyzed which elements were checked and in what order to
select documents. Two main patterns of order of checking elements were identified in
the nine participants, as shown in two representative samples in Fig. 1. Though the
participants conducted multiple searches, we summarized them to focus on elements
240 Y. Hagiwara et al.
be noted that researchers in this field are generally inclined to seek fundamental ref-
erences. Furthermore, participant C mentioned he sought to source a few of the fun-
damental references because he had recently started his research and his knowledge of
the topic was still quite limited. Participant G searched for documents regarding
devices for experiments on his research topic. He mentioned he would determine which
documents he wished to read by checking the figures showing each paper’s device
tests. Based on this approach, it appears the research topic, research phase, and par-
ticular aspects of the research field may affect some researchers’ decisions on which
elements are checked.
Participant F also conducted a process unique from the other participants and
cannot therefore be categorized into pattern 1 or 2. He first searched for images of a
device, specifically for chemical storage, on Google Images. He then clicked on images
of devices that seemed related to his research. After that, he accessed the full texts or
abstracts available on the publishers’ websites linked to the pictures presented in the
Google Images results. Finally, he decided on selection by reading the abstract or full
text. He explained that by viewing images of the devices discussed in the documents he
could judge the documents’ relevance. Additionally, he noted his weekly search habits
for gathering information in his research field were browsing contents of specific
journals and glancing at search results using predetermined keywords.
5 Discussion
This observational study showed that titles and abstracts are highly important and
among the primary elements in decision making when selecting relevant research
documents. We also found some participants examined references, figures, and/or
images from full-text versions. Decisions were also based on the online availability of
the full text. From our analysis of participants’ statements and from related research on
document triage [9, 10], we can assume phases of research, i.e. topic familiarization
and objectives of browsing, influence which elements are checked. However nine
participants constitute a small sample that may restrict generalizability. We plan to
conduct a questionnaire to confirm whether these findings on information retrieval are
general or are behavior idiosyncratic to these respondents. We have developed two
items for questionnaire inclusion in this regard.
(1) Which elements—including new elements such as full text, images, and online
availability of full text—are checked when selecting documents from among search
results? To confirm the relation between research phases and elements, we will ask
respondents to select elements by using three objectives: (a) find previous research in
an unfamiliar field, (b) learn current research trends in their fields, and (c) identify
methods for adapting to their research problems. Respondents who select full text in the
above question 1 will also be asked which part of the text they check (e.g., intro-
duction, references, tables, figures, formulas).
(2) What types of patterns of checked elements and order are found? We will show
orders of elements from the study results and ask respondents to select the closest ones
that they normally employ. Options for example orders contain traditional patterns
(e.g., title > abstract, title only), patterns including new elements found in the study
242 Y. Hagiwara et al.
(e.g., title > full text, image search > full text, title > abstract > online availability of
the full text), and unique patterns (e.g., document type > title > abstract, checking
language only) to confirm generality.
We hope to fully uncover the entire process of researchers’ information-seeking
behavior. In the questionnaire items, we will also address areas such as which databases
are used and from which forms of media documents are accessed. We plan to carry out
such a questionnaire among researchers and clarify the current status of document
selection using the most contemporary information retrieval system environments.
References
1. Liyana, S., Noorhidawati, A.: How graduate students seek for information: convenience or
guaranteed result? Malays. J. Libr. Inf. Sci. 19(2), 1–15 (2014)
2. Pajić, D.: Browse to search, visualize to explore: who needs an alternative information
retrieving model? Comput. Hum. Behav. 39, 145–153 (2014)
3. Wang, P., Soergel, D.: A cognitive model of document use during a research project.
Study I. Document selection. J. Am. Soc. Inf. Sci. 49(2), 115–133 (1998)
4. Macedo-Rouet, M., Rouet, J.-F., Ros, C., et al.: How do scientists select articles in the
PubMed database? An empirical study of criteria and strategies. Eur. Rev. Appl. Psychol. 62
(2), 63–72 (2012)
5. Xie, I., Benoit III, E.: Search result list evaluation versus document evaluation: similarities
and differences. J. Documentation 69(1), 49–80 (2013)
6. Nicholas, D., Huntington, P., Jamali, H.R.: The use, users, and role of abstracts in the digital
scholarly environment. J. Acad. Librarianship 33(4), 446–453 (2007)
7. Hagiwara, Y., Wu, M., Mizutani, E., et al.: An experiment to identify how researchers select
documents from search results. In: Proceedings of CiSAP Workshop 2015, pp. 8–11 (2015)
8. Hagiwara, Y., Ishita, E., Mizutani, E., et al.: A preliminary study and analysis to identify key
elements in document selection. In: Information Seeking in Context, ISIC 2016, Zadar
(2016)
9. Bae, S., Marshall, C.C., Meintanis, K., et al.: Patterns of reading and organizing information
in document triage. In: ASIST Proceedings, vol. 43, no. 1, pp. 1–27 (2007)
10. Stelmaszewska, H., Blandford, A.: From physical to digital: a case study of computer
scientists’ behavior in physical libraries. Int. J. Digit. Libr. 4(2), 82–92 (2004)
Social Media
Information Seeking Behaviour of Aspiring
Undergraduates on Social Media:
Who Are They Interacting with?
1 Introduction
Information behaviour and HCI research has focused on understanding how people
access, interact, use and share information in different contexts for decades; especially
in the digital world. In addition to the traditional channels of scholarly communica-
tions, new platforms and media of communications have emerged in the recent past,
which on one hand have immensely enlarged the scope and opportunities for infor-
mation creation, access and sharing, but, on the other, have opened new challenges and
opportunities for research to understand the changing or emerging information beha-
viour of people. Consequently, numerous research projects and activities have been
undertaken over the past few years that have aimed to understand how people behave –
communicate, access and share information – on the web and social media (e.g. [1]).
The research reported in this paper is part of an ongoing project that aims to understand
2.1 Context
The period of transition that prospective students go through as they leave
school/college requires them to make potentially life-changing decisions, repeatedly
(detailed below), and in a time sensitive environment. In 2013 Ofsted, the British
educational standards inspector and regulator, concluded that only 20% of learners
aged 17 to 18 were receiving adequate levels of careers advice/support. However,
contrary to what might then be suspected, the numbers of aspiring undergraduates
applying for, and ultimately attending university has not dropped [6]. This raises
intriguing questions. If the prospective undergraduates of the future are not getting their
information through traditional in-house channels, how are they navigating this key
period of progression? Perhaps more and more students are using the social media for
acquiring and sharing the relevant information? These questions triggered a PhD
research part of which is reported in this paper.
experiences of others [7]. In a wider sense it also provides multiple channels for
interpersonal feedback, peer acceptance and reinforcement of group norms [8].
Conversely the fact that people are central to many interactions on Twitter and
behave like hubs that join up information [9], is potentially a double-edged sword.
Critically, whilst most tweets are truthful they can also carry rumours and misinfor-
mation, albeit often unintentionally [10]. However, there is a risk in millennials
adopting a default position of trust as it takes more effort to be proactively critical than
trusting [11]. A study conducted by Flanagin and Metzger [12] also found that people
rarely verified web-based information and considered it to be as credible as television,
radio and magazines.
The nature of Twitter arguably facilitates some types and/or topics of conversation
better than others. Users tend to communicate with like-minded people and are quicker
to rebroadcast rather than address information and/or enter into a debate [13]. In
addition students have been found to be reluctant to engage with educational organi-
sations via social media as they are seen as belonging to two different worlds, work and
education versus leisure and play [3]. However, there is an opportunity here. Lovejoy
and Saxton [14] demonstrated Twitter’s apt capacity for stakeholder engagement and
showed it to be more effective than mass communication and information that is
already available on websites. Ultimately little is also known to date about how
prospective undergraduates are making use of online social resources, whether it is
beneficial and critically with whom they are engaging. Prior work has demonstrated
that these information ‘hubs’ [16] not only exist but are a critical component of online
information behaviour and so we seek to investigate this.
3 Methodology
their application in Period 1, because of their not meeting the condition in terms of
exam grades, may check and seek admission to other universities if places are still
available. Up to this point university offers are typically conditional, so the grades
received at this stage will affect the options available. Depending on the outcome of
their results the prospective undergraduates must then decide based on the offers
available which they wish to pursue (if any).
Period 3 - After. From the beginning of September until the end of December. The
last data collection stage covers enrolment at university, their first week (known as
freshers’ week in the UK) and their first semester.
We located and captured relevant posts on Twitter in the following manner:
1. We started with the specific term “UCAS”, which is uniquely specific to those in the
UK (all university applications go through UCAS’s online system). Stemmed
variants of this were then also captured (e.g. #UCAS).
2. Queries were expanded to capture terms such as application or applying and uni-
versity that might suggest someone was considering or talking about university
applications.
3. Query results were sampled and checked in order to locate other words, hashtags
(terms proceeded by hash signs indicate the subject of a tweet and therefore can be a
useful tool for identifying relevant content) or phrases that might also be relevant.
It was found to be prudent to conduct manual checks of the results and alter or
remove queries which were obviously not relevant. For example, terms such as uni-
versity, which is used in many countries required a geographic filter. Similar care had to
be taken with abbreviations such as uni as this is also a type of sushi.
In total the number of tweets retrieved across all three periods of progression
totaled 494,180. The figures, broken down by period, are shown in Table 1.
4 Classification of Stakeholders
2. The evidence itself was used to identify stakeholder terms. Term frequency was
used to identify agents. The cut-off point for identifying terms has been set at 1,000
references per data collection period, past which point the stakeholder in question is
being referred to less than 1% of the tweets during that time.
Whilst naturally some overlap occurred by employing both of the methods detailed
above, the approach proved to be prudent. As the evidence goes on to demonstrate,
some key stakeholders (e.g. the National Careers Council) were not present or refer-
enced at all; however this absence is in itself interesting and might otherwise have been
missed if we had only relied solely on term frequency to locate and identify agents.
Table 2. Stakeholders with the highest number of references during the application process
Stakeholders
Universities – 11,745 references UCAS – 9,897 references
@ucas – 9,462 references #ucas – 6,336 references
Students – 4,932 references Colleges – 4,257 references
Schools – 4,843 references @gapyear – 3,054 references
People – 2,466 references Families – 2,298 references
Table 3. Stakeholders with the highest number of references during results day/clearing
Stakeholders
Universities – 19,109 references Students – 16,073 references
Freshers – 1,773 references Everyone – 9,997 references
UCAS – 9,972 references @ucas – 9,986 references
Colleges – 7,479 references Schools – 7,768 references
#university – 5,479 references People – 4,176 references
Aspiring Undergraduates on Social Media 251
Table 4. Stakeholders with the highest number of references during students’ first semester
Stakeholders
Universities – 19,609 references Students – 17,471 references
Freshers – 14,101 references #freshers – 9,983 references
Instagram – 7,251 references @freshers – 6,964 references
freshershome – 6,613 references Colleges – 6,412 references
dlvr – 5,722 references neuvoo – 5,973 references
We have initially considered only the ten most prominent actors in each period in
this case here as the total number of stakeholders recorded in each period is considered
separately below (Fig. 4).
The stakeholders identified here (Tables 2, 3 and 4) reflect the environments and
provide insight into the key online actors present on Twitter during each data collection
period. Stakeholder terms (e.g. UCAS) have been kept verbatim and not grouped
together as stakeholder tokens here as the differences (e.g. UCAS and @ucas) can
differentiate, for example, whether actors are being talked about, or, to. The spelling
here (e.g. singular versus plural) is also indicative of the nature of the references, for
example ‘universities’ rather than ‘university’ reflects a more casual referral to Higher
Education institutions as a group than specific references to a particular organisation.
There are subtle shifts between the most prominent stakeholders at each stage, for
example four of the top ten terms during student’s first semester are referring to
peers/other students. This reflects a shift to a more social, peer orientated information
environment, where references to family and institutions such as UCAS and schools
have all but disappeared.
Fig. 4. Total number of different stakeholder tokens during each data collection
The total number of different stakeholder tokens increased notably between each
period as Fig. 4 illustrates. There were a total of 34 stakeholders identified during the
application process compared with 59 during the exam results/start of clearing, which
252 L. Dodd et al.
Fig. 5. Stakeholders present (shown by number of Tweets; y axis)) during all three stages
As Fig. 5 shows there were only five stakeholder tokens present during all three
periods of progression, of these universities and students increased in frequency (as
shown in the Y axis in Fig. 5) whilst references to colleges and schools decreased.
There were no overlaps (for exceptions see Figs. 5 and 6), which given the number
of stakeholders identified shows just how much from beginning to end the online social
environment had changed. Given the limited overlap it is worth highlighting the
stakeholders, or patterns of stakeholders, that are only present in conversations during
certain stages of a student’s progression. In particular:
• References to families only occur to any significance (more than 1%) during the
application phase.
• The nature of commercial individual users that were prevalent during each stage
changed and were specific to the decisions being made at that point. For example, in
order, relative to each data collection period; @gapyear, @alevelresults, and
@jobsplane.
Table 5. Example of stakeholders during the application process with few or no references
Tokens with the fewest references
National Careers Council – 0 references Jobcenter – 0 references
Children’s Trusts – references Ofsted – 1 reference
Local Authorities – 1 reference National Careers Service – 5 references
Department of Education – 6 references Careers Advisers – 80 references
254 L. Dodd et al.
In some cases, for example for tokens for ‘brothers’ and ‘sisters’ (189 and 89
references respectively) low figures are unsurprising. Even if an actor is actively
communicating with their sibling online, it is potentially unlikely that they will actively
use a term to clearly identify their relationship every time. However, several official
organisations that have been identified as being key sources of support are not present
to any significant degree.
5 Conclusion
The research reported here is novel in that it employs an atypical approach to provide
new knowledge and insight into the information behaviour of young people in the
specific context of university admission process. In response to the research questions
originally posed, the key findings can be summarised as follows.
In regard to ‘typical’ patterns of communication and considering how these change,
we can see that contextual factors such as time factor considerably in patterns in the
volume of communication during each period of progression for aspiring/new under-
graduates. As such each period of progression is unique and accurately reflects patterns
and events as they happen. When the aspiring undergraduates, the study population, are
at school, the highest level of communication takes place during the week (Fig. 1).
However, this pattern does not show up in period 3 when the subjects are at the
university, and moreover the volume of communications surges within the first few
weeks and then it drops (Fig. 3). Of course the volume of communication is very high
during the clearing week (Fig. 2). These findings show a clear relationship between
people’s lifestyle and the nature of their communications on Twitter in reference to a
particular subject, in this case university admission.
If we consider the second research question, which seeks to identify the importance
and roles of stakeholders we can see that the data suggests distinctly different online
environments during each stage of progression. For the first time, this study identifies
the key stakeholders identified in Twitter communications by aspiring undergraduates
(Tables 2, 3 and 4). Indeed, we can see that as prospective students progress more
actors join the conversation and the environment becomes increasingly diverse.
Comparatively very few stakeholders are actively present during all three stages of
progression. Most stakeholders are active for only one, possibly two periods of the
progression.
Despite students’ known reluctance to engage with educational institutions online
[3], three of the five stakeholder tokens that were continually being referenced during
all three datasets were universities, schools and colleges. Of course there is nothing to
suggest here that users were talking to these institutions, merely that they were being
referenced. It would therefore make an interesting line of investigation going forward
to consider a deeper form of discourse analysis that might address why Twitter is such
a suitable medium for users to talk about institutions rather than directly to them. As a
reflection of this and as a wider consideration UCAS would appear to have some
success breaking this convention and stands in stark contrast to other central agencies
that were referenced little, if at all.
Aspiring Undergraduates on Social Media 255
There are wider lessons to be learnt here; not least as the methodology could be
easily adapted and employed in other contexts, but would also facilitate additional
qualitative lines of investigation (e.g. sampling). These findings may prove insightful
for wider audiences given that they not only identify positive exchanges of commu-
nication (e.g. UCAS), but also can identify information black holes, where there are
notable absences from key information providers.
References
1. Wakefield, R., Wakefield, K.: Social media network behavior: a study of user passion and
affect. J. Strateg. Inf. Syst. 25(2), 140–156 (2016)
2. Department for Education: Youth Matters (2005). http://webarchive.nationalarchives.gov.uk/
20130401151715/, https://www.education.gov.uk/publications/standard/publicationDetail/
Page1/Cm6629#downloadableparts. Accessed 15 June 2015
3. Jones, M., Harvey, M.: Library 2.0: the effectiveness of social media as a marketing tool for
libraries in educational institutions. J. Librarianship Inf. Sci. (JOLIS) (2016). doi:10.1177/
0961000616668959. Sage
4. Gil de Zúñiga, H., Jung, N., Valenzuela, S.: Social media use for news and individuals’
social capital, civic engagement and political participation. J. Comput. Mediated Commun.
17(3), 319–336 (2012)
5. Macskassy, S.A.: On the study of social interactions in twitter. In: ICWSM (2012)
6. UCAS: Four per cent rise in UK and EU students starting university and college courses (2014).
http://www.ucas.com/news-events/news/2014/four-cent-rise-uk-and-eu-students-starting-
university-and-college-courses. Accessed 21 Oct 2014
7. Hurlock, J., Wilson, M.L.: Searching Twitter: separating the tweet from the chaff. In:
ICWSM, pp. 161–168 (2011)
8. Papacharissi, Z. (ed.): A Networked Self: Identity, Community, and Culture on Social
Network Sites. Routledge, London (2010)
9. Elsweiler, D., Harvey, M.: Engaging and maintaining a sense of being informed:
understanding the tasks motivating twitter search. J. Assoc. Inf. Sci. Technol. 66(2),
264–281 (2015)
10. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on twitter. In: Proceedings of
the 20th International Conference on World Wide Web, pp. 675–684. ACM, March 2011
11. Lewandowsky, S., et al.: Misinformation and its correction, continued influence and
successful debiasing. Psychol. Sci. Public Interest 13(3), 106–131 (2012)
12. Flanagin, A.J., Metzger, M.J.: Perceptions of internet information credibility. J. Mass
Commun. Q. 77(3), 515–540 (2000)
13. Smith, L.M., Zhu, L., Lerman, K., Kozareva, Z.: The role of social media in the discussion
of controversial topics. In: 2013 International Conference on Social Computing
(SocialCom), pp. 236–243. IEEE (2013)
14. Lovejoy, K., Saxton, G.D.: Information, community, and action: how nonprofit organiza-
tions use social media. J. Comput. Mediated Commun. 17(3), 337–353 (2012)
15. Ofsted: Going in the right direction? Careers guidance in schools from September 2012
(2013). http://www.ofsted.gov.uk/resources/going-right-direction-careers-guidance-schools-
september-2012. Accessed 9 Nov 2014
16. Khoo, C.: Issues in information behaviour on social media. Libres 24(2), 75–96 (2014)
An Analysis of Rumor and Counter-Rumor
Messages in Social Media
Abstract. Social media platforms are one of the fastest ways to disseminate
information but they have also been used as a means to spread rumors. If left
unchecked, rumors have serious consequences. Counter-rumors, messages used
to refute rumors, are an important means of rumor curtailment. The objective of
this paper is to examine the types of rumor and counter-rumor messages gen-
erated in Twitter in response to the falsely reported death of a politician, Lee
Kuan Yew, who was Singapore’s first Prime Minister. Our content analysis of
4321Twitter tweets about Lee’s death revealed six categories of rumor mes-
sages, four categories of counter-rumor messages and two categories belonging
to neither type. Interestingly, there were more counter-rumor messages than
rumor messages. Our results thus suggest that, at least in the context of our
study, online users do make an attempt to stop the spread of false rumors
through counter-rumors.
1 Introduction
Social media platforms such as Twitter are one of the fastest ways to disseminate
information. Unfortunately, they have also been used as a means to spread rumors and
other forms of misinformation. For example, following the June 2017 terrorist attacks
in London, rumors began circulating online that London mayor Sadiq Khan defended
September 11th terrorists. Such a claim was of course false, and originated from an
unrelated video of the mayor. In Asia, rumors swirled in social media that the ill-fated
Malaysia Airlines flight MH370 from Kuala Lumpur to Beijing actually made a safe
emergency landing somewhere in China, bring false hope to families and loved ones.
Online rumors, if left unchecked, have serious consequences especially if they turn
out to be false. They may negatively impact social media platforms in terms of dis-
seminating accurate information. They may damage the reputations of individuals and
organizations. Finally, they may harm social cohesion. Rumor correction is hence of
utmost importance to control the negative effects from the spread of misinformation.
One way to do this is through counter-rumors. In this paper, counter-rumors refer to
messages used to refute rumors and spread the truth. Prior work suggests that
counter-rumors are effective in combating rumors on the Internet [1]. This is because
exposure to such messages reduces people’s belief in the rumor in question, hence
lowering their propensity to share that rumor [2].
Traditionally, rumors have been tackled by governments, affected organizations
and mainstream news media [3]. However, on social media, the community of users
play this role as well, although results have been mixed. On the one hand, some work
has suggested that online communities are capable of self-correction and self-policing
when presented with dubious information [4, 5], and that counter-rumors may be
effective [2]. On the other hand, some research suggests that counter-rumors could
reinforce misperceptions [6, 7].
One gap that motivates the current research is the relative lack of attention paid to
the content generated by the online community in response to a rumor. For example,
what types of messages do the community spread in a rumor situation? Importantly,
what types of counter-rumor messages do the community create in response? Such
questions are not addressed in existing work. We argue that understanding the nature of
such messages created by online communities would translate into useful insights that
will not only advance research but also benefit individuals and organizations in
rebutting rumors.
Hence, the objective of this study is to examine the types of rumor and
counter-rumor messages generated in Twitter in response to the falsely reported death
of a politician, Lee Kuan Yew, who was Singapore’s first Prime Minister. The rest of
the paper is organized as follows. Literature on rumor and rumor correction is
reviewed. Data collection and analysis methods are next described, and the types of
messages created are then presented. Thereafter the findings are discussed, together
with implications of the work.
2 Related Work
life of their own and may spread uncontrollably. Thus, deliberate correction mecha-
nisms, also known as counter-rumors, may be required [10, 11]. Rumors often carry
some truth and counter-rumors confirming that part of the rumor that is true may be
sufficient to neutralize its impact. Denial is a popular counter-rumor used to refute
rumors [12] but its effectiveness has been questioned [13]. Other rumor coping tactics
include providing the information that is in demand and enhancing trust and credibility
by engaging in public relations [14, 15].
The increased use of social media and other online platforms to share information
means that as an unfortunate side-effect, people have also used them to spread rumors
and other forms of misinformation. This phenomenon has correspondingly attracted
research attention. One stream of work deals with identifying rumors in online mes-
sages. Here, [16] developed and compared classifiers to predict whether images on
Twitter about Hurricane Sandy were real or doctored. In so doing, they demonstrated
that machine learning techniques could be used to identify fake images that may fuel
rumors. Likewise, [17] investigated factors in online social networks that influenced
judgments of information credibility. Using these results, they developed an automated
method to identify and rank credible information sources and users for any given topic.
Another stream of work concerns the effectiveness of counter-rumors to curtail the
dissemination of rumors. For example, [18] examined the effect of exposure to
counter-rumors on people’s decision to spread rumors in social media. They found that
when people were exposed to counter-rumors before rumors, they were more likely to
stop the spread of rumors than when the converse was true. Next, [2] showed that
appropriate message design could reduce the spread of health-related rumors on social
media. This included the use of warnings that the content has appeared in rumor
websites and presenting counter-rumors generated by other users and sources.
While such research advances knowledge, one gap present is the relative lack of
work done in analyzing the actual content of rumor and counter-rumors. We argue that
understanding the nature of such content would lead to a better ways of curtailing the
spread of false information.
3 Methodology
3.1 Background: The Rumored Death of Lee Kuan Yew
The death of an important political leader can significantly impact a country’s social
fabric and its economy. Unsurprisingly, there have been may instances where false
rumors of the deaths of leaders have spread quickly, including Barack Obama and Kim
Jong-Un. If left uncorrected, such rumors may have negative effects.
In this paper, we study the rumored death of Singapore’s first Prime Minister, Lee
Kuan Yew. In February 2015, Lee was admitted to Singapore General Hospital for
treatment for severe pneumonia. Rumors of his passing began circulating on social
media as his conditioned worsened. Things came to a head on 18 March 2015, when a
doctored screen capture of an official announcement of his death, purportedly issued
from the Prime Minister’s Office (PMO), went viral on social media. The fake
announcement stated that Lee, aged 91, had passed away at the Singapore General
An Analysis of Rumor and Counter-Rumor Messages in Social Media 259
Hospital on 5.30 pm that day. As the screen capture resembled official press released
from the PMO, it misled many, including the foreign news media, who prematurely
reported Lee’s passing. Soon after this incident, the PMO responded that the press
release was fake. Subsequent police investigations revealed the culprit of the doctored
screen capture to be a 16 year old student.
The rationale for selecting the four days are as follows: As news of Lee’s worsening
health condition was publicized in the news, people began sharing their concerns,
well-wishes, and rumors on Twitter. This online expression reached its peak on 18
March 2015 [20]. On that same day, the fake announcement of Lee’s death was
released at 2000 h, which led to further spikes in tweets. Soon, a local news channel
(ChannelNewsAsia) announced that it had verified that the image was fake and
debunked the rumor. Other correction tweets sent out by the local newspaper (The
Straits Times) were retweeted widely too. The rumor messages eventually began
subsiding around 2300 h on the same day, and eventually tailed off a few days after.
We hence selected 18 March to collect the tweets, as well as 17 March and 19–20
March, which were the days before and after the main rumor event respectively.
recorded in a codebook where they were fully explained to coders. The final set of
categories and their definitions are presented in Table 2.
In the present study, three coders were independently involved in the content
analysis procedure, and the final intercoder reliability using Cohen’s kappa was found
to be 0.96. This value is above the recommended average [21].
4 Results
Table 2 divides the categories uncovered into three groups: those that fuelled the
rumor, those that attempted to counter the rumor, and those that did not belong to the
former two. In addition, Table 3 shows the distribution of categories within the rumor
group while Table 4 shows the distribution for the counter-rumor group. A description
of these categories is presented in the following paragraphs, together with excerpts
from relevant tweets.
Within the categories that were rumor oriented statements, it was unsurprising that
the largest number of tweets belonged to the Belief category. This comprised 20.1% of
all tweets in our analyzed dataset as well as 63% among all rumor tweets. Essentially,
these tweets indicated the person’s belief that the rumor was true, that indeed, Lee
Kuan Yew had passed away. It would appear therefore that those who generated such
tweets believed that the doctored image was from the PMO. These tweets contained
prayers, well-wishes or hope for Lee. Examples of tweets include “Wishes from [name
An Analysis of Rumor and Counter-Rumor Messages in Social Media 261
removed] (-: #LeeKuanYew”, “praying really hard for #LeeKuanYew am really wor-
ried. hear that his condition has worsened”, “Our thoughts go out to #LeeKuanYew
and his family. #LKY. #GetWellSoonMrLee”, and “May you RIP, and you will be
missed. #LeeKuanYew”).
The next two largest categories in this group were Providing Information (5.9% of
all tweets; 15.2% of rumor tweets) and Personal Involvement (4.8%; 15.2%). The
former refers to tweets that include information relevant or in support of the rumor.
Here, the majority of tweets quoted from various sources including traditional media
outlets and non-traditional ones such as blogs and other online platforms. In particular,
to support the notion that Lee had passed away, the tweets focused on verified infor-
mation that he had been ill preceding the death announcement. Examples include a
retweet from another user “MM Lee’s condition has deteriorated further” and a retweet
from a new source “Former prime minister #LeeKuanYew is critically ill, condition has
deteriorated”. The Personal Involvement category refers to tweets that describe the
person’s involvement with the rumor. Unlike Providing Information, this category
contained information from an individual’s perspective, leading to a more personal
touch. For example, a user tweeted a photo of people keeping vigil at the hospital
(Singapore General Hospital - SGH) where Lee was, “The surreal scene at SGH
tonight. Eating. Drinking. Waiting. Repeat. #LeeKuanYew [link removed]”.
The remaining categories in this group of rumor oriented statements were small in
number, with each comprising about 1% or less of the entire analyzed dataset:
• Apprehensive tweets (1.2%; 3.9%) expressed a range of negative emotions such as
fear, dread and anxiety over the death of Lee. In particular, concerns were about the
262 D.H.-L. Goh et al.
future of Singapore, as Lee had been instrumental in building the country (“without
him, I’m scared for our future”).
• Prudent tweets (0.5%; 1.5%) were those that expressed caution while providing
information related to the rumor. This sense of hesitancy was probably appropriate
given the momentous event in the country’s history. For example, a user claimed
that there was an announcement from the PMO’s office about Lee’s death, but was
unsure about its existence “There is a photo being circulated on the PMO website
about #LKY. Until I see it up on the site, I’m unable to verify if photo is real”.
In terms of counter-rumor oriented statements, the largest category belonged to
Refutation tweets and it was also the largest among all our uncovered categories at
23.3% of the dataset as well as 49.5% of all counter-rumor tweets. Essentially, these
tweets attempted to debunk the rumor of Lee’s death by providing various forms of
evidence, such as retweeting content from various traditional and new media sources.
Examples include “RT @STcom: PMO lodging police report about fake website
announcing death of Mr Lee Kuan Yew [link removed] #LeeKuanYew” and “#Lee-
KuanYew is dead according to this #PMO website screengrab sent to Redwire. Hoax?
Yes says the PMO. Cops notified. [link removed]”. Closely related to Refutation was
the Disbelief category which comprised tweets expressing skepticism about the rumor.
This was the second largest counter-rumor category at 14.2% of the entire dataset and
30% of counter-rumors. However, unlike the former category, the tweets here did not
provide evidence from other sources but were more personal in terms of expression.
One example would be: “1. LKY is not dead yet. 2. Stop saying he is dead. 3. If you
have nothing better to say about him, don’t say. #LeeKuanYew”.
Next, the Guide category (6.2%; 13.1%) referred to tweets which provided
instructions or advice to others about refuting the rumor of Lee’s death. Put differently,
such tweets went beyond providing evidence of the false rumor and included a call to
action for stopping its dissemination. An example of this category is a plea from a user
“Kindly do not spread rumours about Mr #LeeKuanYew. The image that is spreading
is edited from that of Mrs #LKY. [link removed]” while another tweeted “He’s a
person. The media does not pronounce him dead, a doctor does. Until then, stop
jumping the gun. #LKY”.
The Sarcastic category (3.2%; 6.9%) contained tweets that ridiculed other users and
tweets that supported the rumor of Lee’s death. Perhaps users were frustrated or
concerned about the spread of the false rumor and poured scorn on those that believed
it. Examples include “Fail. @[name removed] falls for a hoax. #LeeKuanYew” and
“This is how rumors get around. Blind leading the blind. Ugh.”. Finally, Interrogatory
tweets (0.2%; 0.4%) were questions seeking more information about the rumor.
A typical example included “Serious, did #LeeKuanYew die?” Given the uncertainty
surrounding Lee’s death, the number of questions asked was surprisingly small.
There were also two categories that did not belong to either the rumor or
counter-rumor category that were uncovered during our analysis. First, the Apprecia-
tion category comprised tweets that were thankful of Lee’s sacrifices and contributions
towards nation-building such as “Thankful for Mr #LeeKuanYew. Some people devote
a specific period to doing something, this man devoted his life” and even a simple hash
tag “#ThankYouLKY”. It should be noted that these tweets neither supported that Lee
An Analysis of Rumor and Counter-Rumor Messages in Social Media 263
had passed away or not, but that this rumor reminded them of his work for the country.
Second, the Uncodable category (10.5%) consisted of tweets that were spam, not
meaningful, or not related to the rumor. Examples include a context-less “#LKY”,
punctuation/special characters or links to irrelevant websites.
5 Discussion
The primary objective of the present study was to uncover the types content generated
by the online community arising from a rumor. We used the rumored death of a
Singapore politician, Lee Kuan Yew, as the context of our work and analyzed 4321
tweets harvested from Twitter. Our results yielded the following insights.
First, our analysis showed that there were more counter-rumor messages than rumor
messages. The former comprised 47.14% of the dataset while the latter totaled 31.7%.
This corroborates with prior work that online communities have the potential to correct
misinformation [5] through counter-rumors. Our dataset indicates that as rumor ori-
ented messages started circulating on Twitter in response to the fake announcement of
Lee’s death, other users began posting tweets to stop the rumor. These counter-rumor
messages were predominantly of the Refutation category where evidence from local
news reports were quoted to dissuade those who wrongly believed in Lee’s death. At
the same time, users also posted tweets belonging to the Guide category, telling others
that the rumor was false and that they should not circulate such content further (e.g.
“What’s this fake news being circulated about Mr #LeeKuanYew passing away?
Pls DONT post anything unless you’re V V sure.”). There were also other users who
were frustrated with the rumor-mongering despite the evidence and resorted to posting
tweets in the Sarcastic category to insult those who perpetuated the rumor (e.g. “So
many dumb people that believe he’s dead. #LeeKuanYew”). In sum, the fact that there
Twitter users who actively posted various types of messages to debunk and stop the
false rumor of Lee’s death bodes well for the use of social media to disseminate
counter-rumors.
Next and on a related note, our study highlights the importance of source credibility
in the use of counter-rumors [25]. In particular, Twitter users who posted messages to
debunk Lee’s rumored death extensively retweeted from local news outlets such as the
Straits Times (newspaper) and ChannelNewsAsia (TV news channel), which are
considered authoritative and credible in the Singapore context. It would seem that by
doing so, the hope was that people’s perceptions could be shaped to achieve corrective
behavior, that is, the curtailment of the rumor. Ironically, it was the foreign news
outlets that wrongly believed in the fake announcement and prematurely reported Lee’s
demise. Unsurprisingly, a number of tweets belonging to the Sarcastic category were
directed at them (e.g. “Can’t believe [news outlet name removed] is so dumb not to
verify the source #Singapore #LKY”). This finding also suggests that online users are
able to distinguish between real and fake information even if the sources appear
credible.
Lastly, our analysis reveals an interesting observation that counter-rumor messages
were largely evidence-based while rumor messages were mostly personal opinions.
This is seen in Table 2 where Refutation was the biggest counter-rumor category, while
264 D.H.-L. Goh et al.
Belief was the biggest rumor category. As mentioned previously, Refutation messages
provided evidence (e.g. “RT @[name removed]: China’s CCTV official weibo apolo-
gises for unverified news update on #LeeKuanYew. [link removed]”) from credible
sources while Belief messages contained expressions that indicated that the rumor was
true without any evidence (e.g. “RIP…. You will be dearly missed. #LeeKuanYew”).
Put differently, counter-rumor messages were factually driven while rumor messages
were emotionally driven. This finding lends support to prior work [26, 27] that emo-
tions such as anxiety fuel rumor transmission, but also extends such work that
counter-rumor transmission is primarily evidence-based.
6 Conclusion
Acknowledgement. This work was supported by the Ministry of Education Research Grant
AcRF Tier 2 (MOE2014-T2-2-020).
References
1. Bordia, P., DiFonzo, N., Haines, R., Chaseling, E.: Rumors denials as persuasive messages:
effects of personal relevance, source, and message characteristics. J. Appl. Soc. Psychol. 35,
1301–1331 (2005)
2. Ozturk, P., Li, H., Sakamoto, Y.: Combating rumor spread on social media: the effectiveness
of refutation and warning. In: Proceedings of the Hawaii International Conference on System
Sciences, pp. 2406–2414. IEEE Press (2015)
3. Donovan, P.: How idle is idle talk? One hundred years of rumor research. Diogenes 54,
59–82 (2007)
4. Shklovski, I., Palen, L., Sutton, J.: Finding community through information and
communication technology in disaster response. In: Proceedings of the 2008 ACM
Conference on Computer Supported Cooperative Work, pp. 127–136. ACM Press (2008)
5. Starbird, K., Maddock, J., Orand, M., Achterman, P., Mason, R.M.: Rumors, false flags, and
digital vigilantes: misinformation on Twitter after the 2013 Boston Marathon Bombing. In:
Proceedings of iConference 2014 (2014)
6. Nyhan, B., Reifler, J.: When corrections fail: the persistence of political misperceptions.
Polit. Behav. 32, 303–330 (2010)
7. Schwarz, N., Sanna, L.J., Skurnik, I., Yoon, C.: Metacognitive experiences and the
intricacies of setting people straight: implications for debiasing and public information
campaigns. Adv. Exp. Soc. Psychol. 39, 127–161 (2007)
8. DiFonzo, N., Bordia, P.: Rumor Psychology: Social and Organizational Approaches.
American Psychological Association, Washington, DC (2007)
9. Oh, O., Agrawal, M., Rao, H.R.: Community intelligence and social media services: a rumor
theoretic analysis of tweets during social crises. MIS Q. 37, 407–426 (2013)
10. Bernard, S., Bouza, G., Piétrus, A.: An optimal control approach for E-rumor. Revista
Investigacion Operacional 36, 108–114 (2014)
11. Tripathy, R.M., Bagchi, A., Mehta, S.: A study of rumor control strategies on social
networks. In: Proceedings of the 19th ACM International Conference on Information and
Knowledge Management. ACM Press (2010)
12. Rosnow, R.L.: Communications as cultural science. J. Commun. 24, 26–38 (1974)
13. DiFonzo, N., Bordia, P., Rosnow, R.L.: Reining in rumors. Org. Dyn. 23, 47–62 (1994)
14. Kimmel, A.J.: Rumors and Rumor Control. Lawrence Erlbaum Associates, Mahwah (2004)
15. Kimmel, A.J., Audrain-Pontevia, A.: Analysis of commercial rumors from the perspective of
marketing managers: rumor prevalence, effects, and control tactics. J. Mark. Commun.
16, 239–253 (2010)
16. Gupta, A., Lamba, H., Kumaraguru, P., Joshi, A.: Faking sandy: characterizing and
identifying fake images on Twitter during hurricane sandy. In: Proceedings of the 22nd
International Conference on World Wide Web, pp. 729–736. ACM Press (2013)
17. Canini, K.R., Suh, B., Pirolli, P.L.: Finding credible information sources in social networks
based on content and social structure. In: 2011 IEEE Third International Conference on
Social Computing, pp. 1–8. IEEE Press (2011)
18. Tanaka, Y., Sakamoto, Y., Matsuka, T.: Toward a social-technological system that
inactivates false rumors through the critical thinking of crowds. In: Proceedings of the 46th
Hawaii International Conference on System Sciences, pp. 649–658. IEEE Press (2013)
266 D.H.-L. Goh et al.
19. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media?
In: Proceedings of the 19th International Conference on World Wide Web. ACM Press
(2010)
20. Lin, Z., Pazos, R., Benites, Y.: 1923–2015 Lee Kuan Yew: How the Twittersphere reacted to
the news (2015). http://leekuanyew.straitstimes.com/ST/recap/index.html. Accessed 5 Aug
2015
21. Neuendorf, K.A.: The Content Analysis Guidebook. Sage Publications, Thousand Oaks
(2002)
22. Bordia, P., DiFonzo, N.: Problem solving in social interactions on the internet: rumor as
social cognition. Soc. Psychol. Q. 67, 33–49 (2004)
23. Pendleton, S.C.: Rumor research revisited and expanded. Lang. Commun. 18, 69–86 (1998)
24. Heit, E.: Properties of inductive reasoning. Psychon. Bull. Rev. 7, 569–592 (2000)
25. Oh, O., Kwon, K.H., Rao, H.R.: An exploration of social media in extreme events: rumour
theory and Twitter during the Haiti earthquake 2010. In: Proceedings of the International
Conference in Information Systems (2010)
26. Allport, F., Lepkin, M.: Wartime rumors of waste and special privilege: why some people
believe them. J. Abnorm. Soc. Psychol. 40, 3–36 (1945)
27. Liu, F., Burton-Jones, A., Xu, D.: Rumors on social media in disasters: extending
transmission to retransmission. In: Proceedings of the 18th Pacific Asia Conference on
Information Systems (2014)
Automatic Discovery of Abusive Thai Language
Usages in Social Networks
1 Introduction
Social networks such as Twitter, Facebook, and Google+ have become a norm
for colloquial communication when face-to-face interaction is unavailable. A wide
variety of social media services are currently and publicly available with diverse
purposes and target users. In 2017, 20.4 millions Facebook users in Thailand
are reported active1 . This number is expected to grow to 21.6 millions by 2018.
With colloquial settings in nature, social networks house various dimensions of
communication, ranging from organizations’ official channels to groups of cyber-
bullies. Oftentimes, language usages in social networks not only deviate from the
standard language usages (e.g. emoticons, undefined terms, broken grammars,
incomplete sentences, etc.), but also are not well-mannered and inappropriate.
We call such dialect abusive languages.
In Thailand, existence of abusive languages in social networks is considered
normal. Messages like “ ” (equivalent to ‘‘Shit! I’m about
1
https://www.statista.com/statistics/490467/number-of-thailand-facebook-users/.
c Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 267–278, 2017.
https://doi.org/10.1007/978-3-319-70232-2_23
268 S. Tuarob and J.L. Mitrpanont
With rapid growth of social media availability and user base, concerns have
arisen that involve the control policy of inappropriate contents, especially when
they are a few clicks away from being exposed to children. In this section, we
review existing works related to ours.
SVM classifier to classify a newswire article whether it is satirical or not [6]. The
features extracted from an article include those from the headline and verbal
quotes, which are assumed to be available in all articles. Their method is specifi-
cally crafted for newswire articles whose lengths and features are not compatible
with social media messages (social media messages do not have headlines and
verbal quotes), hence would not be applicable to our problem.
Another dimension of approaches to detect abuse in language usages would
be to employ the knowledge from linguistic features. Chen et al. proposed the
Lexical Syntactic Feature (LSF) architecture to detect offensive content and
identify potential offensive users in social networks [7]. Warner and Hirschberg
used a SVM classifier to learn the characteristics of hate speeches from their
sentence structures [23]. Nobata et al. also presented a machine learning strategy
that detects hate speeches, derogatory, and profanity in online user-generated
content [16]. Besides n-grams features, they introduced English based linguistic
and syntactic features to the feature space. Recently, Xu and Zhu proposed a
method that analyzes each sentence in a textual document using English based
parse trees, to identify sentences that are offensive [16]. While showing promising
results, these methods were designed for English text only, and hence would not
be applicable to our problem, where candidate messages are composed in Thai.
3 Methodology
The abusive Thai language detection in social media is transformed into a clas-
sification problem where a social media message is classified as positive if its
content is offensive, and negative otherwise. A message is represented with a
vector of feature values, each of which represents the weight of the correspond-
ing term. All the distinct terms are collected from all the messages and stored in
the dictionary. In this work, we experiment on multiple term weighting schemes,
namely binary, term frequency (TF), inverse document frequency (IDF), and
term frequency-inverse document frequency (TF-IDF). Featurized training data
is then used to train machine learning based classifiers, drawn from diverse fam-
ilies of classification algorithms.
message, and 0 otherwise. Regardless of its simplicity, the binary term weight-
ing scheme may disregard the length of the document when terms are dupli-
cated. For example, the message ‘‘Well, well, well... There there. lol
lol lol’’ will produce the same binary term vector as the message ‘‘Well,
there. lol’’. Mathematically,
1 if vi ∈ t and vi ∈ V
fibin =
0 otherwise
Term Frequency (TF). The term frequency weighting scheme counts the
occurrences of each term in the message, hence taking the length of the message
into account even when terms are duplicated. Mathematically,
Where fif req is the TF value of the term vi , and T F (vi , t) is the number of
occurrences of term vi in message t.
Where fitf idf is the TF-IDF value of the term vi . Note that, the term frequency
part is normalized by the maximum number of distinct terms so that this portion
would range from [0, 1], and would be consistent when combining with the IDF
portion.
272 S. Tuarob and J.L. Mitrpanont
Nine classification algorithms are considered, that are drawn from different fam-
ilies of supervised learning algorithms, including Bernoulli NaiveBayes (NB)
[13], Discriminative Multinomial Naive Bayes (DMNB) [19], Maximum Entropy
(MaxEnt) [14], Support Vector Machine (SVM) [4], k-Nearest Neighbor (kNN)
[1], Decision Table/Naive Bayes Hybrid (DTNB) [11], Random Forest (RF) [5],
Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [8], and
C4.5 Decision Tree [17]. We use LibSVM2 implementation for SVM, and Weka3
implementation for the other classifiers.
4 Empirical Study
4.1 Dataset
We use the Facebook Graph API5 to crawl recent comments from selected Face-
book pages, some of which house communities that use abusive languages on
6 7 8
regular basis, such as , , , etc. Such pages
often expose contents that ignite mass colloquial debates, hence creating an
atmosphere that facilitates inappropriate usages of Thai languages.
Each collected social media message is retrieved in a JSON format with
additional metadata such as user ID of the poster, message content, message ID,
timestamp, and Like information. However, since we make a minimal assumption
about information that comes with a message, only message content (as in Thai
text) is scraped and stored. Each message is tokenized by Classifier-based Thai
Word Tokenizer (CTWT)9 which reported the F-measure of 93% when evaluated
on BEST2010 corpus10 , and is the best practical open-source Thai text tokenizer
we have tested on social media text in addition to LexTo11 and BreakIterator12 .
2
http://www.csie.ntu.edu.tw/cjlin/libsvm/.
3
http://www.cs.waikato.ac.nz/ml/weka/.
4
https://www.facebook.com/.
5
https://developers.facebook.com/docs/graph-api.
6
https://www.facebook.com/ejeab/.
7
https://www.facebook.com/nongngneverdie/.
8
https://www.facebook.com/sudlokomteen/.
9
https://github.com/wittawatj/ctwt.
10
http://thailang.nectec.or.th/best/?q=node/21.
11
http://www.sansarn.com/lexto/.
12
https://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html.
Automatic Discovery of Abusive Thai Language Usages in Social Networks 273
Table 1. Each message is tagged with relevant labels. Avg. Words is the average
number of words in a message. SD Words is the standard deviation of the message
lengths. *Note: the total number of abusive messages is not the sum of the numbers
of messages tagged with abusive labels, since a message can fall into multiple abusive
categories.
We prepare four sets of featurized data, each of which uses a different term
weighting technique. Standard classification evaluation metrics are used, includ-
ing precision, recall, and F-measure, that focus on the positive (abusive) class.
10-fold cross validation is used to comprehensively validate the efficacy of each
pair of a feature set and a classification model. For each fold, a 10% of the train-
ing data (hold-out) is set aside for tuning the classification probability threshold
to maximize the F-measure.
To see which term weighting technique would best represent the dataset, all the
four feature sets derived from different term weighting techniques (i.e. Binary,
TF, IDF, and TFIDF) are tested on each classifier.
0.8601
0.8601
0.8601
0.8586
0.8561
0 .8 6 0 1
0.8524
0.8524
0.8359
0.8359
0.8359
0 .8 3 5 9
0.7612
0.7612
0.7612
0 .7 6 1 2
0.7302
0.7300
0.7292
0.7288
0.7281
0.7251
0.7219
0.7200
0 .7 2 2 5
0.7103
0.7103
0.7103
0.7095
0.7058
0 .7 1 0 3
0.7023
0.6857
0.6857
0.6857
0 .6 8 5 7
F-M EASURE
0.4188
0.4188
0.4188
0 .4 1 8 8
by the different term weighting schemes. For NB, DMNB, and DTNB classifiers,
the underlying algorithm is Bernoulli Naive Bayes, which is only applicable for
binary attributes (and hence would convert non-zero attribute value to 1, result-
ing in the same feature values as those of binary weighting scheme). For kNN
and C4.5 classifiers, since each social media is quite short (16 words on aver-
age), it is likely that a message only contains one occurrence of each word that
appears in the message. If such an assumption holds, the binary word vector of
such a message would be exactly the same as its TF word vector, and would
not affect the location of the neighbors (kNN) or the splitting criteria (C4.5)
much. Figure 1 illustrates the information in Table 2, and clearly shows that the
average f-measure of the IDF features is highest. Hence, we will use the IDF
features for further analysis.
5 Conclusions
Abusive language usages have been a major problem in Thailand. This problem
has become exponential with the growth of social media services, that allow a
massive pool of users to have access to the online content and interact with each
other. Our work presented in this paper is an attempt to automatically discover
abusive language usages in large scale social networks. We proposed a machine
learning based methodology that automatically classifies a social media message
whether it contains abusive content (i.e. rude, figurative, offensive, and dirty) or
not. The classifier that implements the Discriminative Multinomial Naive Bayes
algorithm, trained with inverse document frequency term features, performs the
best with the f-measure of 86%, using a dataset of 3,497 Thai Facebook social
media messages. To the best of our knowledge, we are the first to investigate the
problem of abusive language detection in Thai language domain, whose natural
language processing technology is still in its infancy. As future works, we could
test our algorithms on different datasets collected from different social media
services. We could also explore the use of the grammatical structure of Thai
language to improve the efficacy of the classification.
References
1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach.
Learn. 6(1), 37–66 (1991)
2. Aref, A., Tran, T.: Using ensemble of Bayesian classifying algorithms for medical
systematic reviews. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI),
vol. 8436, pp. 263–268. Springer, Cham (2014). doi:10.1007/978-3-319-06483-3 23
Automatic Discovery of Abusive Thai Language Usages in Social Networks 277
21. Tuarob, S., Mitra, P., Giles, C.L.: A hybrid approach to discover semantic hierar-
chical sections in scholarly documents. In: 2015 13th International Conference on
Document Analysis and Recognition (ICDAR), pp. 1081–1085. IEEE (2015)
22. Tuarob, S., Tucker, C.S., Kumara, S., Giles, C.L., Pincus, A.L., Conroy, D.E.,
Ram, N.: How are you feeling? A personalized methodology for predicting mental
states from temporally observable physical and behavioral information. J. Biomed.
Inform. 68, 1–19 (2017)
23. Warner, W., Hirschberg, J.: Detecting hate speech on the world wide web. In:
Proceedings of the Second Workshop on Language in Social Media, pp. 19–26.
Association for Computational Linguistics (2012)
User Behaviors
Effects of Search Tactic on Affective Transition
While Using Google: A Quasi-Experimental
Study of Undergraduate Students
1 Introduction
Emotion is a part of human cognition and behavior. It has helped humans survive
and express personalities. For instance, someone who is curious feels satisfaction
when he or she discovers the answer to a critical question, perhaps leading to
sustain quest for knowledge and a thirst to understand his/her surroundings.
Anger may contribute to a drive to overcome obstacles and reach a certain goal,
while fear may be associated with cautiousness and thoroughness. Emotions play
essential roles in social interactions [33].
Emotion also plays an important role in interaction with information and
related tools. Numerous studies found that human interactions with informa-
tion, from various aspects and viewpoints, are affected by and associated with
emotions (e.g., [3,5,15,38]). An increasing number of studies are exploring the
roles of emotion in human-information interaction, particularly in the context of
c Springer International Publishing AG 2017
S. Choemprayong et al. (Eds.): ICADL 2017, LNCS 10647, pp. 281–294, 2017.
https://doi.org/10.1007/978-3-319-70232-2_24
282 S. Choemprayong and T. Atikij
2 Literature Review
To understand the relationship between affective state and search tactic, this
literature review covers the discussion in three areas including: (1) emotional
Effects of Search Tactic on Affective Transition While Using Google 283
state and its classification, (2) search tactic, and (3) the relationship between
emotion and search behavior.
Emotion has been given various definitions. For example, the Oxford Advanced
American Dictionary [26] defines emotion as “a strong feeling” and explains that
a decision is dependent on emotion, rather than rational thought (pp. 486–487).
Random House Webster’s College Dictionary [30] identifies emotion as an affec-
tive state of an awareness (p. 403). Emotion can be considered to be experience
emerged from self-awareness. It includes feeling and outburst generated from
experience.
Emotion can be categorized into multiple states and levels. The classifica-
tion can be viewed from multiple perspectives. Lopatovska [19] distinguishes
between emotion, mood, and feeling and argues that the literature regarding
affective information behavior uses various definitions of emotion. Perhaps this
may be because these three constructs are complex and, therefore, it is difficult
to operationalize them.
In a more concrete approach, Mehrabian [22] identifies two forms of emotion:
emotional temperament and emotional state (p. 262). An emotional state can
change quickly depending upon different situations. An emotional temperament
is a static personality trait that is exhibited over a longer period. It is a fun-
damental state of a human being that is responding to different situations in
his/her life.
Instead of attempting to cover static emotions, this study focuses on emotions
as fleeting emotional states. Operationally these are defined as a mental state
and physical reaction sensitive to external stimulation in a particular moment.
It can be considered as an outcome of human experience expressing through
behaviors or actions, whether or not the individual is aware.
To identify an exact expression of emotion is a difficult task [8]. Shiota and
Kalat [34] compare the classification of emotion to the classification of color
where the mixture of emotion can occur in diverse ways. So far, the measurement
of emotions can be achieved via either discrete or dimensional approach [20].
The most primitive classification of emotion divides emotion into positive
and negative groups. Positive and Negative Affect Scale (PANAS), one of the
most popular emotion measures, is a measure developed corresponding to this
approach [37]. There are other classification systems manifesting emotion in a
more multidimensional perspective.
For instance, Plutchik [28] categorizes emotion into eight primary categories,
including surprise, sadness, disgust, anger, anticipation, joy and trust (p. 56). In
this classification scheme, Plutchik also considers emotional intensity, similarity,
and polarity in the scheme. Ekman [7] divides emotion into 6 primary categories:
fear, anger, disgust, happiness, sadness and surprise. Ekman recognizes other
emotions as sub-level categories. In addition, Russell [31] explains emotion as
a mixture of emotional constructs within a complex circle of three polarities:
284 S. Choemprayong and T. Atikij
pleasure vs. displeasure (P), arousal vs. non-arousal (A), and dominance vs.
submissiveness (D).
One of the most popular classification system of emotion is the Inventory of
Emotion by Parrott [27]. Derived from the results of cluster analysis, Parrott’s
inventory uses emotion terms to divide emotion expressions into 3 hierarchi-
cal levels (i.e., primary, secondary, and tertiary emotion). The inventory com-
prises of 135 terms representing emotional expressions. The primary emotion
expressions include: (1) love (including affection, lust/sexual desire, and long-
ing as secondary expressions); (2) joy (including cheerfulness, zest; contentment;
pride, optimism, enthrallment, and relief as secondary expressions); (3) surprise
(including only surprise as a secondary expression); (4) anger (including irrita-
tion, exasperation, rage, disgust, envy, and torment as secondary expressions);
(5) sadness (including suffering, sadness, disappointment, shame, neglect, and
sympathy as secondary expressions) and; (6) fear (including horror, and ner-
vousness as secondary expressions). The rest of the terms are classified in the
tertiary level under secondary expressions. It is apparent that the hierarchical
tree-structure of Parrott’s inventory can facilitate the operationalization of emo-
tion classification where self-report technique is utilized. Therefore, this study
applies Parrott’s inventory to classify expressions of emotional state.
page may be negatively associated with fear, while left mouse double-click may
be associated with disgust and sadness in a positive direction.
As discussed above, while the causal-effect relationships between emotion in
general and search behavior have been widely observed, a micro perspective of
such association has yet to be well recognized and may need to be explored to
extend the understanding between emotional transition and search behavior in
a more specific detail which can be useful for practical and theoretical develop-
ment.
While a number of search constructs have been observed regarding their
influences on emotion, other related factors have found to interact with the rela-
tionship between online search and emotion. These factors include perceptions
toward search task (i.e., difficulty, familiarity, interest, relevant, and uncertainty)
(e.g., [1,13,14,21]), and search task completion and performance (e.g., [4,36]).
In addition, Internet self-efficacy also influence search success and searcher’s
emotion during completing search tasks [25].
3 Methods
This study applies a quasi-experimental approach which was approved by Chu-
lalongkorn University’s Research Ethics Review Committee. The data was col-
lected from January to February 2017. Participants of this study include 38
currently enrolled undergraduate students from two large public universities
in Bangkok metropolitan area. Applying a convenient sampling technique, the
recruitment ads were posted in numerous spots throughout both campuses. Prior
to the beginning of the data collection the participants were asked to provide
informed consent.
The participants were asked to perform three search tasks, one at a time. A
pre-test evaluation of their emotional state was administered prior to the begin-
ning of the first search task to provide a benchmark. During each search tactic
performed, the participants were asked to think aloud their thoughts and emo-
tions. Even though the participants were informed about the Parrott’s Inventory
of Emotion [27] at the beginning of the study, they were allowed to express emo-
tions in their own terms. The researcher mapped each participant’s expression
to the Parrott’s Inventory of Emotion. The emotion mapping was done by one of
the researchers simultaneously as the think aloud protocol was performed. The
researcher matched the participant’s expression to a closest term in the Inven-
tory, regardless of the emotion hierarchies. For example, emotions such as liking
and fondness are considered as the tertiary emotion of love, while satisfaction,
delight, and happiness are the tertiary emotions of joy. These tertiary emotions
were eventually recoded to their primary emotions. To validate the mapping
process, the researcher probed the participant to confirm each chosen emotional
state before continuing to the next search action.
In addition to emotions listed in Parrott’s Inventory, neutral state was added
to an instrument. The result from a pilot study shows that neutral is among one
of the most common emotional expressions occurred during search.
Effects of Search Tactic on Affective Transition While Using Google 287
For each task, participants were given no more than 20 min to complete
their search. The search history was reset after the answers from each task were
submitted. There were short breaks in between the three tasks to minimize
the possibility of carry-over effect. This study also used Screenpresso for Win-
dows Version 1.5.6, a screen capture program, to monitor activities made on the
screen, and Sound Recorder program on Windows to record the subjects’ verbal
responses.
Since search task is a controlled variable in this study, we developed three
search tasks based on Kim’s classification of search tasks [12] in conjunction
with Ramdeen and Heminger’s tasks [29]. The tasks were tested and adjusted
to improve face validity during a pilot study. Each search task asks participants
to use Google to find appropriate answers. For factual search, the participants
were asked to provide one correct answer for a closed question (i.e., provide an
author’s name). For analytical search, the participants needed to analyze by
evaluating and selecting appropriate answers (i.e., select the best weight control
technique for oneself). For exploratory search, the participants were provided
with an open-ended question (i.e., explore recipes to cook and prepare for a
party with a specific theme). Critical evaluation is not required for exploratory
search, while it is necessary for analytical search.
The study also collected other search task variables including participants’
perceptions toward search task (i.e., task difficulty, complexity, interest, rele-
vance, and uncertainty) collected at the completion of each of the three tasks,
search success (in LaPlace ratio, indicating how likely the next search action
will succeed) (see [16,17]), time used for each task, and number of search tac-
tics performed. Participants’ variables, including gender, field of study, and level
of Internet self-efficacy (ISE), were categorized using a translated Eastin and
LaRose’s Internet Self-Efficacy Scale [6] (consisting of 8 items on a 7-level Likert
scale collected prior to the first task).
In addition, we also observed at what stage a search tactic was performed in
reference to the whole search task (i.e., indicating whether a search tactic was
performed near the beginning or the end of each search task). These variables
were reported as associated factors of either search tactic or emotional state
elsewhere (e.g., [1,12,18,32]). All measurement scales for latent variables (e.g.,
search tactics, Internet self-efficacy, emotional scale) were tested for reliability
using Cronbach’s Alpha yielding acceptable results (α > 0.7).
Since the emotional transition in this study considers only the change of
emotional expressions in a primary level, emotion expressions were recoded by
comparing expression of emotions on a primary level before and after performing
a search tactic, yielding only two values for emotional transition (i.e., changed
and unchanged). The emotional transition becomes a dependent variable in this
study.
Data were analyzed by using SPSS for Windows Version 24 and Microsoft
Excel 2010. Descriptive statistics, such as metrics of frequency, percentage,
means, and the measure of association (i.e., Chi-square) were applied. Since the
analysis in this study has two units of analysis (participant and search task), a
288 S. Choemprayong and T. Atikij
multilevel hierarchical logistic regression (mixed effect) was applied to assess the
effect of search tactics on emotional transition controlling for other independent
variables. Pearson correlation analysis was performed among all independent
variables. The results indicate no strong relationship among all independent
variables.
4 Results
For emotional transition, we found that the ratios of primary emotional tran-
sition per 100 search actions are not much different between changed and
unchanged emotional expressions (Table 2). There are about 607 search actions
(approximately 53%) that result in changes of emotional states, while 688 search
Effects of Search Tactic on Affective Transition While Using Google 289
actions (about 47%) yield unchanged emotional states. The analytical search
yielded more changed than unchanged emotional states, while the distributions
are in the opposite direction for the other two search tasks. However, an ad-hoc
Chi-square analysis found no significant relationship between search tasks and
emotional transition.
Emotional transition Factual search Analytical search Exploratory search Total (N = 38)
f % f % f % f %
Unchanged 227 55 181 48 280 56 688 53
Changed 189 45 196 52 222 44 607 47
Table 4. Top 10 pre/post emotional states for File Structure and Evaluation tactics
For File Structure tactics, Joy→Joy is by far the most common emotional
transition that emerged. It is apparent that other unchanged emotional states
are well presented, including Neutral→Neutral and Fear→Fear. Additionally, it
is interesting to note that all the rest of pre/post emotional states in the Top 10
are the changes from negative (e.g., Sadness, Fear, Anger) to positive expressions
(i.e., Joy and Love).
For Evaluation tactics, unlike File Structure tactics, the number of changed
emotional states are more strongly represented in the table. We found only two
unchanged emotional states in the Top 10 (i.e., Joy→Joy and Neutral→Neutral).
The changed emotional states in the Top 10 include Joy→Sadness, Sadness→Joy,
Joy→Love, Joy→Fear, Neutral→Joy, Joy→Anger, and Love→Joy.
Effects of Search Tactic on Affective Transition While Using Google 291
Based on this study, there are numerous opportunities for future studies to
extend understanding of affective elements in information search and retrieval.
For instance, since this study relies heavily on a self-report technique for observ-
ing emotional states, different observation techniques (e.g., facial expression,
voice expression, or electrocardiogram) could help validate the results. Other
classification systems of emotion may be applied in order to facilitate catego-
rizing the fluid nature of human emotion. In addition, the pool of participants
could also be extended to include the general population.
Most importantly, understanding the stimulus of emotional transition is quite
complex since various unobserved factors may influence user’s emotional state,
such as, room temperature and content displayed on screen. It is very hard to
control these external variables even in an experimental setting.
References
1. Arapakis, I., Jose, J., Gray, P.: Affective feedback: an investigation into the role
of emotions in the information seeking process. In: Proceedings of the 31st Annual
International ACM SIGIR Conference on Research and Development in Informa-
tion Retrieval, pp. 395–402. ACM (2008). doi:10.1145/1390334.1390403
2. Bates, M.: Information search tactics. J. Am. Soc. Inform. Sci. 30(4), 205–214
(1979). doi:10.1002/asi.4630300406
3. Bates, M.: The design of browsing and berrypicking techniques for the online search
interface. Online Rev. 13(5), 407–424 (1989). doi:10.1108/eb024320
4. Bilal, D., Kirby, J.: Differences and similarities in information seeking: children
and adults as web users. Inf. Process Manag. 38(5), 649–670 (2002). doi:10.1016/
S0306-4573(01)00057-7
5. Dervin, B.: An overview of sense-making: concepts, methods, and results to
date. In: ICA Annual Meeting, Dallas, Texas (1983). http://communication.sbs.
ohio-state.edu/sense-making/art/artdervin83.html
6. Eastin, M., LaRose, R.: Internet self-efficacy and the psychology of the digital
divide. J. Comput. Mediated Commun. 6(1) (2000). doi:10.1111/j.1083-6101.2000.
tb00110.x
7. Ekman, P.: Expression and the nature of emotion. In: Scherer, K., Ekman, P. (eds.)
Approaches to Emotion, Chap. 3, pp. 319–344. Psychology Press, New York (1984)
8. Ellis, D., Tucker, I.: Social Psychology of Emotion. Sage, Los Angeles (2015)
9. Fidel, R.: Theoretical constructs and models in information-seeking behavior. In:
Human Information Interaction: An Ecological Approach to Information Behavior,
pp. 49–82. MIT Press, Cambridge (2012), doi:10.7551/mitpress/9780262017008.
001.0001
Effects of Search Tactic on Affective Transition While Using Google 293
10. Fourie, I., Heidi, J.: Ending the dance: a research agenda for affect and emotion
in studies of information behaviour. In: Proceedings of ISIC: The Information
Behaviour Conference. Leeds, 2–5 September 2014. http://www.informationr.net/
ir/19-4/isic/isic09.html
11. Hearst, M.: Search User Interfaces. Cambridge University Press, Cambridge (2009).
http://searchuserinterfaces.com/
12. Kim, K.S.: Effects of emotion control and task on web searching behavior. Inf.
Process. Manag. 44(1), 373–385 (2008). doi:10.1016/j.ipm.2006.11.008
13. Kracker, J.: Research anxiety and students’ perceptions of research: an experiment.
Part I. Effect of teaching Kuhlthau’s ISP model. J. Am. Soc. Inf. Sci. Technol.
53(4), 282–294 (2002). doi:10.1002/asi.10040
14. Kracker, J., Wang, P.: Research anxiety and students’ perceptions of research: an
experiment. Part II. Content analysis of their writings on two experiences. J. Am.
Soc. Inf. Sci. Technol. 53(4), 295–307 (2002). doi:10.1002/asi.10041
15. Kuhlthau, C.: Inside the search process: information seeking from the user’s
perspective. J. Am. Soc. Inform. Sci. 42(5), 361–371 (1991). https://doi.org/
10.1002/(SICI)1097-4571(199106)42:5<361::AID-ASI6>3.0.CO;2-#
16. Laplace, P.S.: Théorie Analytique des Probabilités. Courcier, Paris (1820)
17. Lewis, J.R., Sauro, J.: When 100% really isn’t 100%: improving the accuracy of
small-sample estimates of completion rates. J. Usab. Stud. 1(3), 136–150 (2006)
18. Liu, J., Kim, C.: Information seeking tasks: why do searchers feel them difficult?
In: Proceedings of HCIR 2012, Cambridge MA, 4–5 October 2012. doi:10.1002/
meet.14505001125
19. Lopatovska, I.: Toward a model of emotions and mood in the online information
search process. J. Am. Soc. Inf. Sci. Technol. 65(9), 1775–1793 (2014). doi:10.1002/
asi.23078
20. Lopatovska, I., Arapakis, I.: Theories, methods and current research on emotions in
library and information science, information retrieval and human-computer inter-
action. Inf. Process. Manag. 47(4), 575–592 (2011). doi:10.1016/j.ipm.2010.09.001
21. Lopatovska, I., Mokros, H.: Willingness to pay and experienced utility as mea-
sures of affective value of information objects: users’ accounts. Inf. Process. Manag.
44(1), 92–104 (2008). doi:10.1016/j.ipm.2007.01.020
22. Mehrabian, A.: Pleasure-arousal-dominance: a general framework for describing
and measuring individual differences in temperament. Curr. Psychol. 14(4), 261–
292 (1996). doi:10.1007/BF02686918
23. Mooney, C., Scully, M., Jones, G.J.F., Smeaton, A.F.: Investigating biometric
response for information retrieval applications. In: Lalmas, M., MacFarlane, A.,
Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol.
3936, pp. 570–574. Springer, Heidelberg (2006). doi:10.1007/11735106 67
24. Nahl, D.: Learning the internet and the structure of information behavior. J. Am.
Soc. Inform. Sci. 49(11), 1017–1023 (1998). doi:10.1002/(SICI)1097-4571(1998)49:
111017::AID-ASI83.0.CO;2-Z
25. Nahl, D., Tenopir, C.: Affective and cognitive searching behavior of novice end-
users of a full-text database. J. Am. Soc. Inform. Sci. 47(4), 276–286 (1996). doi:10.
1002/(SICI)1097-4571(199604)47:4276::AID-ASI33.0.CO;2-U
26. Oxford Advanced American Dictionary for Learners of English. Oxford University
Press, New York (2011)
27. Parrott, W.: Emotions in Social Psychology: Essential Readings. Psychology Press,
Philadelphia (2001)
294 S. Choemprayong and T. Atikij
28. Plutchik, R.: A general psycho evolutionary theory of emotion. In: Plutchik, R.,
Kellerman, H. (eds.) Theories of Emotion, vol. 1, pp. 3–33. Academic Press, New
York (1980)
29. Ramdeen, S., Hemminger, B.: A tale of two interfaces: how facets affect the library
catalog search. J. Am. Soc. Inf. Sci. Technol. 63(4), 702–715 (2012). doi:10.1002/
asi.21689
30. House, R.: Random House Webster’s College Dictionary. Random House Reference,
New York (2005)
31. Russell, J.: Is there universal recognition of emotion from facial expression? A
review of the cross-cultural studies. Psychol. Bull. 115(1), 102–141 (1994). doi:10.
1037/0033-2909.115.1.102
32. Savolainen, R.: Information seeking and searching strategies as plans and patterns
of action: a conceptual analysis. J. Doc. 72(6), 1154–1180 (2016). doi:10.1108/
JD-03-2016-0033
33. Scherer, K.: What are emotions? And how can they be measured? Soc. Sci. Inform.
44(4), 693–727 (2005). doi:10.1177/0539018405058216
34. Shiota, M., Kalat, J.: Emotion. Wadsworth Cengage Learning, Belmont (2012)
35. Smith, A.: Internet search tactics. Online Inform. Rev. 36(1), 7–20 (2012). doi:10.
1108/14684521211219481
36. Wang, P., Hawk, W., Tenopir, C.: Users’ interaction with world wide web resources:
an exploratory study using a holistic approach. Inf. Process. Manag. 36(2), 229–251
(2000). doi:10.1016/S0306-4573(99)00059-X
37. Watson, D., Clark, L.: The PANAS-X: manual for the positive and negative affect
schedule-expanded form (1994). http://ir.uiowa.edu/psychology pubs/11/
38. Wilson, T.: Models in information behaviour research. J. Doc. 55(3), 249–270
(1999). doi:10.1108/EUM0000000007145
Exploring Personal Music Collection Behavior
1 Introduction
In this paper we explore how users develop, organize and access their personal music
collections. We do so through an ethnographic study. More specifically, we present here
our findings from collating and analyzing over 150 pages of auto-ethnographies,
resulting from a user study with 28 participants designed to solicit information on their
personal use of music. A key theme that emerged from this process was a user’s notion of
ownership of music, which we frame in the paper as ‘what makes a song mine?’ This in
turn builds on the forms of storage and modes of access to music a user engages with—
additional aspects that rose to prominence through the analysis of the study. We detail
these aspects in the paper and discuss their implications.
In Sect. 2 we review related work, before describing our methodology (Sect. 3)
which includes details about how the data was collected, and a summary of the
demographics of the users that participated in the study. Section 4 presents the results
of our analysis, which is broken down into four parts: collection storage; collection
organization; patterns of sharing (or not); and ownership. We conclude with a summary
of our findings and a discussion on practical ways this can be utilized by users.
2 Related Work
Reviews of the Music Information Retrieval literature have noted that there are rela-
tively few empirical studies of human music behavior, and that the existing studies are
mainly usability analysis of research prototypes or lab-based experiments [4, 5, 10, 12].
Studies outside MIR tend to focus on summarizing behavior over large populations
rather than examining the variation in individual activities (for example, the economic
effects of music piracy [7, 13]). The few naturalistic studies of authentic information
behavior ‘in the wild’—exploring music acquisition and collection organization [2, 3,
5, 9] and use (e.g., playlist and mix construction [1])—have dated. The set of readily
available, commercial music services and the potential music activities that they sup-
port have changed rapidly over the past few years. This present paper is one attempt to
capture the music information behaviors of real users as they navigate this current (as
of 2016) music environment.
3 Methodology
4 Results
collections that the user has purchased and/or put significant effort into organizing
(even when, as in the quote from P14 above, that collection is less frequently accessed);
and, surprisingly given the massive size of the streaming service archives, an inability
to source all music of interest from a single service.
This last point results from a limitation in the large-scale licensing approach taken
by music streaming companies. Students interested in music from “lesser known
groups” [P13]—for example non-Western music, the work of local or regional artists,
or newly released music from non-commercial labels—look to YouTube when
streaming services fail them. YouTube also provides a video accompaniment that is
sometimes preferred over the static cover art image available from streaming services
(“It’s nice to watch the accompanied music video sometimes while listening. … I like
singing along to music I listen to. This is why lyric videos on YouTube are my pref-
erence..” [P21]). This associated video can add to the personal music experience
(though the individual does not necessarily always wish to view video when listening
to a song; “However, after viewing this a few times, I tend to just listen to the
recording.” [P3]) or can serve to enhance social gatherings (P1 appreciates “…a nice
background video when listening to songs at a party or other setting”). Participants
report incorporating songs and video found on YouTube into their collection directly
(by adding the downloaded file to their music management application) or indirectly
(via channel subscriptions and favoriting inside YouTube).
the day’ [P17]; ‘Christmas’ [P21]); by genre (‘anime, k-pop, etc’ [P10]), country of
origin (‘Original Philippine Music’ [P21]) or artist (‘Bill Evans’ [P24]); by the mood
that the songs would reflect or influence (‘sleepy music’ [P15]; ‘low mood playlist’
[P1]); by current enjoyment of the individual songs (e.g., P10 maintains a ‘favs’
playlist); and general playlists to fill in time (“I have many playlist [sic] to organize
some of the music that I wouldn’t mind listening to if I just needed to listen to music in
general.” [P26]).
The ordering of songs within a playlist is only significant for playlists organized by
artist—in which case there is a preference for ordering the songs by album and then by
track within the album. P24, for example, prefers “to listen to songs as part of a
collection of songs intended to be played together such as an album. … [I prefer]a
sense of harmony and unity during a music listening session. [I find] the change in
style and tone from one artist to another jarring and distracting.” Otherwise, the
participants are generally satisfied with accepting the service’s default ordering or by
listening to the songs on shuffle. Only P25 reported a personal strategy for playlist
ordering, which cleverly took advantage of the system’s record of her collection
management activities and the shuffle facility:
Spotify automatically puts the songs into a playlist from oldest to newest so any new songs are
put at the end of a playlist, this means that whenever I go to listen to my playlist I always scroll
to the bottom to begin playing from there. Although I always begin my playlist with one of the
newer songs added I always listen to my music on shuffle… I believe I do this because I
genuinely enjoy being surprised by which song is played, the anticipation you get from listening
to the radio and not knowing if a recent hit or country throwback is going to be played is
replicated by this system of listening to music.
album. All albums should have album artwork.” Ad hoc playlists can assist in iden-
tifying songs that do not meet these standards: for example, “I use iTunes “Smart
Playlists” feature to filter songs that are missing artwork for example so I can add
artwork to these item…” [P7]
P6 describes an elaborate, playlist-based screening process for adding songs to his
collection on Spotify:
1. I will identify it using the shazam app, which is linked to my Spotify account and
will automatically add it to a Spotify playlist. 2. I will then place it in a maybe playlist
which contains songs that show potential for my music collection. 3. I will then listen to
the maybe songs over a period of a week. 4. If I am no longer happy with the song, I
will remove it. 5. If I like the song, but only would like to listen to it occasionally – it
goes into my Sometimes playlist. 6. If I am still happy with the song, I will officially add
it to my music collection.
This strategy is interesting as it illustrates the porous boundaries of many collec-
tions—here a song is only officially ‘in’ P6’s collection after passing through this series
of hurdles, yet all the while the song is ‘in’ P6’s Spotify account. We return to this
question of when a song is part of a collection—when it is ‘mine’—in Sect. 4.3.
Users of Spotify also found value in its automatically generated playlists—par-
ticularly Spotify Discover, which once a week presents a user with a customized
playlist based on their current collection and listening habits—and in the specially
curate playlists prepared by other users, Spotify curators, and artists. A key benefit is
that these playlists reduce the effort required identify new artists and genres of interest,
making it easier to expand a collection by adopting their contents as a whole or in part:
“They … are an excellent way to easily find new music, and enjoy all music based on a
particular genre. I find it is the best option for expanding my listening tastes, with very
little hassle or research. … Spotify has changed the way I listen to music. When
previously I would stick to the music I had always listened to due to the high level of
work required to source new music that I like, I now enjoy large varieties of music and
get bored quickly of the same music over and over.” [P2].
4.3 Sharing
Spotify has been used by most of the study participants for sharing music as well as
creating their own collections. The students appreciated the fact that using Spotify for
streaming music meant less time having to deal with other options of more dubious
legality [P6] and felt that it took little effort to follow playlists “that have been made by
other people” [P21]. One student had even created a joint playlist with a family
member in which they both had the right to veto songs. This same student was also
quite happy to use Spotify’s social filter “which broadcasts to other friends using
Spotify what you are listening to” [P2]. This resulted in a wide circle of friends having
access to music options recommended from within their social group.
The other main method of music sharing noted by study participants was through
YouTube and music blogs (P15, P13, P12). This can be through methods as simple as
sending a popular YouTube music video link to a friend through social media [P12].
For some students, YouTube as a medium was preferred over Spotify because “Spotify
does not offer any way for me to view music videos” [P13]. YouTube’s “suggested
Exploring Personal Music Collection Behavior 303
The act of streaming alone was not considered by participants to constitute own-
ership of a song, nor was moving it to playlist necessarily sufficient (see, for example,
P6’s tiers of temporary playlists in Sect. 4.2). Downloading a song from a streaming
service to access offline (possible only for premium Spotify users) was seen as a solid
commitment to the song (P7). A song appearing on auto-generated playlists such as
Spotify’s “Recently played” was seen as being in “a bit of a grey area”, as it was “in a
stage where I haven’t full [sic] committed to wanting it, but it’s nice to have available
as a song that grows on me.” [P13] Continuous access—whether via streaming or
offline—is seen as a primary requirement for ‘mineness’ by some (“I would consider a
song as ‘mine’ if I am able to repeatedly play it when I want.” [P17]) and irrelevant by
others (“I classify music as mine if I have it in my collection and listen to it or if there is
a song I really enjoy listening to but haven’t got around to adding to my collection yet.
Or, a song I have really enjoyed in the past and no longer have in my collection but
still recognize it as mine or “My jam”.” [P19]).
Small wonder, then, that some participants moved to more personal definitions of
‘their’ music, based on their personal engagement with a song—where the decision that
a song was ‘mine’ precedes its actual acquisition: “… it’s [mine] when I’ve heard a
song a couple of times or enough times that I consciously think “I like this
song/album/playlist and I will want to listen to this again”, and only then will I save
the music to my Spotify or my hard disk, thereby calling the music ‘mine’.” [P11]
Given the diversity of media, devices, and music services involved in a given
collection, as well as these diverse and changing views what constitutes a collection, it
is not surprising that the students often struggled to estimate the size of theirs. What in
the past was a straightforward question—how many CDs do you have? How many
albums?—is now difficult to find a metric for. For file based collections, students could
more easily cite the number of gigabytes than the number of songs. For collections
based in streaming services, what does one count? Where with albums one could
confidently estimate the total number of songs in the collection, a playlist has no lower
or upper limit—and songs can (and are frequently) repeated across playlists within a
collection. Where physical media collections tended to grow monotonically (“I don’t
throw away the physical CDs … because I am a bit of a collector and because I bought
them” [P8]), digital file collections and playlists are frequently in a state of flux: songs
are added and removed, playlists are created and discarded. And, of course, there is
often overlap between the holdings across different physical media (five participants
stored their collection across more than four devices) and the different collection
management services (seven participants used more than one music management
application).
For users making increasing use of music streaming services and cloud storage, our
ethnographic study leads us to recommend that (in terms of what is mine) it is a user’s
list of songs that is best to identify with as their own. This is what captures the
intellectual thought processes they have gone through in forming their personal music
Exploring Personal Music Collection Behavior 305
References
1. Cunningham, S.J., Bainbridge, D., Falconer, A.: More of an art than a science, supporting
the creation of playlists and mixes. In: ISMIR, pp. 240–245 (2006)
2. Cunningham, S.J., Jones, M., Jones, S.: Organizing digital music for use: an examination of
personal music collections. In: ISMIR (2004)
3. Cunningham, S.J., Masoodian, M.: Management and usage of large personal music and
photo collections. In: Proceedings of IADIS, pp. 163–168 (2007)
4. Lee, J.H., Cunningham, S.J.: Toward an understanding of the history and impact of user
studies in music information retrieval. J. Intell. Inf. Syst. 41(3), 499–521 (2013)
5. Brinegar, J., Capra, R.: Understanding personal digital music collections. Proc. Am. Soc. Inf.
Sci. Technol. 47(1), 1–2 (2010)
6. CNET. https://www.cnet.com/how-to/best-music-streaming-service/. Accessed 28 Apr 2017
7. Glaser, B., Strauss, A.: The Discovery of Grounded Theory: Strategies for Qualitative
Research. Weidenfield and Nicholson, London (1967)
8. Kibby, M.: Collect yourself: negotiating personal music archives. Inf. Commun. Soc. 12(3),
428–443 (2009)
9. Mörchen, F., Ultsch, A., Nöcker, M., Stamm, C.: Visual mining in music collections. In:
From Data and Information Analysis to Knowledge Engineering, pp. 724–731 (2006)
10. Pampalk, E., Rauber, A., Merkl, D.: Content-based organization and visualization of music
archives. In: Proceedings of the Tenth ACM International Conference on Multimedia,
pp. 570–579. ACM, New York (2002)
306 S.J. Cunningham et al.
11. Resch, A., Berk, J., Akers, L.: Recognizing and Conducting Opportunistic Experiments in
Education: A Guide for Policymakers and Researchers. National Center for Education
Evaluation and Regional Assistance, USA, REL 2014-037 (2014)
12. Weigl, D., Catherine, G.: User studies in the music information retrieval literature. In:
ISMIR, pp. 335–340 (2011)
13. Rob, R., Waldfogel, J.: Piracy on the high C’s: music downloading, sales displacement, and
social welfare in a sample of college students. J. Law Econ. 49(1), 29–62 (2006)
Doctor-Patient Communication of Health
Information Found Online: Preliminary
Results from South East Asia
Anushia Inthiran(&)
Abstract. Citizens in the South East Asia (SEA) region are active on the
Internet. Some information on general online health information searching
behaviour in the SEA region is known. However, not much is known about the
doctor patient communication of health information found online. In this study,
50 participants who have performed an online health search were interviewed.
Participants were asked to describe the doctor-patient communication process of
having looked up information online. Preliminary results indicate participants
who spoke to the doctor about information found online fell into the elementary
or guarded category. Participants who did not talk about information found took
an unreceptive approach. Results of this study provide theoretical information to
advance the field of information science in the SEA region.
1 Introduction
An activity that is slowly gaining popularity within the SEA region is online health
information searching [1–3]. The availability of free and publicly available online
resources coupled with fast and affordable Internet access have encouraged citizens in
this region use online means to obtain health information. Existing studies conducted in
SEA provide information on general online search behaviour of health consumers. For
example, results of a study conducted in Singapore indicate Singaporean youths search
for information online pertaining to diseases such as diabetes, cancer, information on
sexually transmitted diseases, pregnancy, birth control and HIV/AIDS [1]. In an urban
city in Malaysia, health searchers dominantly use Google. However, sites like Med-
linePlus, Medline, The Mayo Clinic, The National Institute of Health Website (NIH),
The Johns Hopkins University Website and WebMD were also utilised [2]. When
searching for health information for their child, parents in SEA performed a doctor and
non-doctor type visit search [3]. On the other hand, undergraduate students in Thailand
use online sources to obtain information on general health, disease, treatment, and
nutrition [4].
2 Related Work
In SEA, there is a large gap in communication between doctor and patient due to
patients’ unpreparedness for participatory consultation style and adherence to pater-
nalistic communication styles [9]. Patients expect doctors to sort out their concerns,
confusion and hesitance within the context of polite communication [9]. On the other
hand, doctors predominantly use biomedical utterance and adhere to their own medical
agenda. This method of communication conflicted with patients need for the use of
social-emotional utterances [10]. In some cases, whilst doctors encouraged commu-
nication, patients were reluctant to ask for clarification and prefer that the doctor
communicates in a manner that is easily understood by the patient [12].
Strong cultural hierarchy and social norms within SEA such as respect towards
people of higher status (doctor) add to the burden of having to communicate. This
cultural and social norm include the importance of maintaining harmony and not
wanting to disagree with the doctor [12]. Thus, indicating patients would rather take on
a passive role when communicating with the doctor. Doctor–patient communication
also appears to be affected by other cultural characteristics such as social distance and
closeness of relationships [10]. Patients felt that in order to preserve social distance and
closeness of relationship, patients need to demonstrate ‘respect’ to the doctor by
minimising conversation or by not appearing to be ‘difficult’.
Whilst doctors and patients indicate they would prefer a partnership-oriented style
of communication, the one-way communication method is mostly practiced [10]. The
two main reasons for this is because of the setup of healthcare systems in SEA and time
limitation during patient consultation [10]. Interestingly, results of a study conducted in
Malaysia indicate patients were unsatisfied with time spent communicating during the
Doctor-Patient Communication of Health Information Found Online 309
consultation [11]. Thus, whilst patients wanted to communicate, barriers that existed
within the communication setting prevented them from doing so.
Low health literacy rates in SEA also limited the possibility of communicating with
the doctor [13]. Patients with low health literacy may feel uncomfortable with having a
conversation with the doctor. Due to the lack of medical knowledge, patients may take
on a passive role by merely agreeing with the doctor.
3 Methodology
A purposeful sampling technique was used. The criteria was that patients must have
performed an online health search in the past. Participants were recruited via call for
participation notices placed in universities and bulletin boards at community centres.
A questionnaire and semi-structured interview was used to collect data. The ques-
tionnaire was designed to collect socio-demographic and health search experience
details. The interview questions were mostly designed to be open ended and adapted
from a previous study [3, 14]. Specifically in this study, participants were asked the
following questions: (i) if they had spoken to the doctor about information found,
(ii) why did they speak/did not speak to the doctor about information found, (iii) to
describe their experience of having to initiate the conversation and (iv) to describe the
conversation process. The English language was the medium used to conduct the
interview. A pilot test was conducted prior to the main experiment with 5 participants.
As a result of the pilot test, the interview questions were fine-tuned.
The interview was audio recorded and transcribed verbatim. This technique was
selected to allow close links to be created between the data and the researcher [15].
Open coding was used and coding categories were derived inductively from the audio
recording to fit the grounded theory approach [15]. Audio recording was transcribed
verbatim. The conventional qualitative content analysis technique [16] was used.
A master list of codes was first created based on induction. These codes were revisited
after every third participant. These codes were then reduced to themes using the
constant comparative method. Responses from participants were first categorised based
on if they had spoken to the doctor about information found. Thereafter, responses
from each category were coded into themes.
4 Results
5 Discussion
There were two ways in which participants who always spoke to the doctor initiated the
conversation: elementary and guarded. In the elementary category, the conversation
had little depth and the aim was to obtain the doctors confirmation about information
read. Participants appear quite comfortable in initiating the conversation and there were
no reservations. It is noted that the doctor had encouraged participants to search and yet
cautioned them about believing information on the Internet. This behaviour is to be
lauded as it empowers patients, keeps communication lines open and creates awareness
on the authenticity of health information online. In the guarded category, participants
indicate cautiously informing the doctor about information found and in some cases
implied implicitly that a search had been conducted. The conversation had more depth
and participants took ownership in considering alternatives. In this category, elements
of patient centred healthcare and shared decision making were exhibited [9]. Partici-
pants in the guarded category could possibility be health literate and therefore were
keen to know more of the doctors thought and decision making process. However, it is
noted that participants ensured communication took place in a respectful manner, hence
observing matters pertaining to cultural norms and hierarchy [9, 10].
Participants in the external motivator category added new information to the
domain knowledge. Results of previous studies indicate communication style [10, 11,
13], cultural and social norms [11, 13], communication setting [11, 12] and health
literacy levels [13] hindered doctor patient communication. Results of this study
312 A. Inthiran
indicate seriousness of illness and patient category (child) were motivators in initiating
a conversation with the doctor. Conversation took place openly without any reservation
for communication style [10, 11, 13] or adherence to cultural and social norms [11, 13].
It is postulated that perhaps the seriousness of the illness and the patient in question
superseded any inhibitions. Participants who did not speak to the doctor about health
information were unreceptive. There could be several possible reasons for this. Par-
ticipants may feel that the doctor knows best and therefore choose not to have a
conversation. In the same vein, participants may feel that the doctor is not interested to
hear from them. This aspect requires further investigation.
Practical contributions include the need for patient and doctor training and edu-
cation programmes to encourage and foster continued communication and discourse.
For example, doctors need to be taught on the need to communicate with patients to
understand their health beliefs, as well as the need to encourage patients to commu-
nicate. It is pertinent that doctors take on a collaborative role rather than a consultative
role during the communication process. Such an initiative was heralded in Indonesia
successfully [17]. Community awareness campaigns should advise patients on
acceptable methods in which communication should take place as well as the benefits
of having a discussion with the doctor. Online health portals could provide suggested
questions that patients should talk to the doctor about. This will help patients initiate a
conversation confidently. In future work, phase two of the experiment will be con-
ducted with a larger group of participants. It is also the intention to conduct a similar
study from the perspective of the doctor. For example, do doctors encourage com-
munication? It is acknowledged that results of this study are preliminary however it
does provide rich information on doctor-patient communication initiation and details of
the conversation from the viewpoint of the patient.
References
1. Rao, P., Theng, Y.L.: Assessing young adults’ web searching for health information: an
exploratory study in Singapore, Medicine 2.0. In: World Congress on Social Media, Mobile
Apps, Internet, Web 2.0 (2014). http://www.medicine20congress.com/ocs/index.php/med/
med2012/paper/view/1039?trendmd-shared=0. Accessed 16th Jan 2017
2. Inthiran, A., Alhashmi, S.M., Ahmed, P.K.: Online consumer health: a Malaysian
perspective. In: International Federation Information Processing (IFIP) WG9.4 Newsletter
(2013) - Information Technology in Developing Countries. http://www.iimahd.ernet.in/
egov/ifip/jun2013/inthiran.htm
3. Inthiran, A., Soyiri, I.: Searching for health information online for my child: a perspective
from South East Asia. In: International Conference on Asian Digital Libraries, pp. 76–81
(2015)
4. Kitikannakorn, N., Sitthiworanan, C.: Searching for health information on the Internet by
undergraduate students in Phitsanulok, Thailand, vol. 3, pp. 313–318 (2009)
5. Boston Children’s Hospital: Communication breakdown: how can we get patients and
doctors talking again. http://www.childrenshospital.org/clinician-resources/resources/
getting-patients-and-doctors-talking. 15th June 2017
Doctor-Patient Communication of Health Information Found Online 313
1 Introduction
An information gap exists in every individual and information seeking to fill such gaps
is an essential human information behavior. Information retrieval is the process of
presenting this knowledge gap or information need [1] to an information system that
will match this query against its collection of information, and present the results to the
information seeker. Information seeking behavior (ISB) models are often used by
researchers to study the dynamics of interaction between humans and systems.
Internet search engines, such as Google, have shaped the way people seek infor-
mation on the Internet. Such changes and developments have led researchers to revisit
information seeking models [2]. Different information seeking models have been used in
different contexts to present alternative information seeking perspectives. Students today
have grown up and immersed themselves on the Internet. There are no doubts that
students have espoused the Internet to seek resources for entertainment as well as
learning [3]. The technology integration into classrooms and proliferation of the Internet
have brought changes to teaching approaches and practices [4, 5], influencing how
learning might take place. Learning approaches such as ambient learning and
cyberlearning [6] have been embraced by students. This suggested self-directed learning
(SDL) activities adopted by the students. While seemingly technologically savvy, studies
revealed that people growing up in the digital age may not have the information literacy
to discern the quality of resources they find and used more experimental or trial and error
strategies to attain their resources [7, 8]. Coupled with the advent of socially generated
resources and the lack of gatekeepers on the quality of information found on the Internet,
it is challenging to find quality learning resources effectively.
Post-secondary students in Singapore are represented by adolescents, between 16 to
22 years of age, who are continuing education beyond secondary school. In a report by
the Ministry of Education, Singapore [9], a total of 109,439 students enrolled in the
year 2015 by vocational institutes (such as Institute of Technical Education (ITE)), art
schools (such as Nanyang Academy of Fine Arts), and Polytechnics (such as Singapore
Polytechnic). Studies showed that secondary school students in Singapore have poor
information literacy skills [10, 11], despite being relatively IT literate. The video
presents itself as a resource to support learning for students, especially of vocational
nature where there is an emphasis on practical learning. Videos can often satisfy a
learner’s information needs when textual information sources may not be adequate,
such as demonstration of a technique or skill.
Adolescent learners are developing cognitively and often satisfice when finding
information. The transit from formal classroom learning to self-directed learning, calls
for independence in finding resources to support learning. Although many interface
features have been developed to support better searches they are not widely used by
these searchers. Many adolescent learners prefer to use convenient and easy to use
features, such as the keyword search. This often results in poor search results, espe-
cially for complex tasks, such as exploratory search. There is a need to know what are
the desired interface features by these adolescent learners and explore how these
interface features help them in performing better video searches. An examination of
their video seeking behavior can provide insights to how and what interface features
they use to locate and select videos.
This paper presents part of the study to explore the desirable video retrieval
interfaces and features when young learners find videos to support their learning in
vocational learning. This study used the think-aloud protocol to understand how
adolescents, in the age range of 18–22, locate and select videos to support their SDL.
This can offer interface designers insights in the development of user-centered video
retrieval interfaces and services that can better support the video seeking process.
2 Related Literature
pixels. These make the indexing of video semantics challenging. Metadata can express
the semantic information of videos. Being multidimensional, richer forms of metadata
are extracted from a video as compared to textual resources. This can afford a wide
array of search features, suggesting that video searching can be rendered differently as
compared to textual resources.
In establishing an interface design framework for a digital video library, Lee and
Smeaton [15] analyzed several abstract information seeking behavior models.
A user-centric approach was used to identify salient interface features and functionalities.
The consideration of user interactions and search strategies is pertinent in defining such
interface frameworks. Five stages of video seeking identified are browsing/selecting
video, querying within the video, browsing within the video, playing the video, and re-
querying. The framework proposed a mixed approach in supporting the different stages.
Combining content-based (visual searching) and semantic (keyword searching)
approaches [14], an interaction framework was presented to guide the development of
video retrieval systems. The researchers posit that semantic approach produces the best
results for querying while content-based information is suitable for browsing and
selection. Hence, a combination of both approaches is required for the different stages in
video searching. By understanding the video seeking behavior in a situated context,
effective video interfaces are developed that best fits the seeking behavior.
In SDL, the process of finding and evaluating resources is integral as learners have
autonomy over their learning [16]. Finding resources to aid their learning can involve
two types of search tasks, lookup fact finding and exploratory search [17]. People can
change their searching behavior as search tasks become more complex [18]. This is due
to the increasing level of uncertainty [19]. The challenge to find the right resource for
learning is compounded by topics that are new as the search task becomes complex. As
a search task becomes more complex, finding appropriate resources can become more
cognitively loaded. Research has shown that many learners faced difficulties in finding
the appropriate resources when learning autonomously [20]. Hence, in the search for
unstructured data on the Internet, exploratory searches present challenges for the
self-directed learners. As self-directed learning requires quality information, the
learners need to be able to search for information effectively and the younger
self-directed learners may find such exploratory video searches even more challenging.
Following the premise that by understanding the video seeking behavior of
post-secondary students, desirable and effective video retrieval interfaces can be
developed to make video search for SDL more efficacious. The richness of the video
retrieval interface features influences the amount of interaction with the retrieval system
as well as having implications on the decision-making process [21]. In turn, this affects
how interface designers develop effective interfaces. This paper describes part of a
doctoral study that explores the video seeking behavior of post-secondary students.
3 Methodology
This study adopted an inductive research approach to explore the video seeking
behavior and to uncover salient interface features and services used by young learners
during video seeking for learning and to identify search techniques and strategies used.
Video Seeking Behavior of Young Adults for Self Directed Learning 317
The participants performed two exploratory search tasks. A combination of think aloud
technique with video capture of the screen actions and interviews were used to collect
data, providing data collection and data sources triangulation. A pilot run was con-
ducted to validate the procedure the questions used in the interview sessions.
3.1 Participants
A total of 14 post-secondary school students (male: M1 to M10, female: F1 to F4) were
recruited to perform two exploratory tasks each for the study. These students have
completed their secondary school education and are currently enrolled post-secondary
institutions such as polytechnics or ITE. The participants were recruited through
advertisements placed in their institutions, word-of-mouth and lightning talks to stu-
dents when given access to the lectures.
The participants are enrolled in a variety of courses, ranging from a foundational
program that introduces various diploma pathways to specialized vocational certificates
such as Aerospace Engineering and Chemical Process Engineering. These different
courses that the participants are enrolled in provided more insights to the video seeking
behavior as compared to having participants from a single institution and from only one
type of course. The inclusion criteria of having experience in video searching in public
video repositories were indicated in the advertisement as well as in the lightning talks
conducted. This purposeful inclusion was to ensure that the participants would have
sufficient experience that will exhibit their video seeking behavior when applied to their
learning needs. Participation in the study was voluntary and the data were kept con-
fidential. Upon completion of the search tasks and interviews, the participants were
each given SGD10 as incentive for their participation.
several search iterations for decision-making. An example of this task would be asking
the participant to scope a capstone project related to optical lens and eye diseases. The
participants were free to perform their video search using any public video repositories
that they wanted and have encountered. This resulted in 28 search sessions from 14
participants.
transcription and kept within parenthesis. The inclusion of actions into the transcripts
was to provide a complete account of the video seeking behavior as well as to confirm
the verbal accounts. An example of the inclusion of screen actions in the transcription
is as follows:
“This video is too static. (Mouse over the progress bar of the video and view the
thumbnails along the way. Stopped when participant spotted something that he is
familiar with). At least they put more images rather than text.” M2. Having the screen
actions embedded in the transcripts allowed verification of verbalisations and offered
additional point of analysis. Through the transcription process, the researcher started
familiarizing with the data.
A codebook was developed inductively through repeated examination of the data.
The codebook structure contained the code label, a brief description, a full description
that explained the inclusion and exclusion criteria, and an example. The codebook
provided consistency in operationalization of the codes.
The first cycle of coding used structural and emotion coding [25]. Structural coding
provides a basic and focused filter of the raw data while emotion coding allows sub-
jective video seeking experience of the participants to be explicitly identified. The
transcriptions were examined in a sentence-by-sentence manner and labelled with a
conceptual phrase that represents the sentence. This revealed the various interactions
with the retrieval system as well as the salient interface features that the participants
like, desire, or frustrate them. The first cycle coding produced 26 codes related to the
video seeking behavior and the retrieval interfaces used by the young adolescent
learners when performing exploratory search for learning videos. The second cycle of
coding used thematic analysis. The thematic analysis identified themes that emerged
from the data, capturing patterns in relation to the phenomena of interest. Initial the-
matic analysis revealed 10 candidate themes. The final thematic analysis produced five
main themes that relate to the video seeking behavior of the participants and interface
features. Candidate themes emerged from the initial examination of the first cycle codes
and the final themes were produced through iterations of code examination. An
independent coder was recruited and trained using the codebook to perform coding
with two randomly selected transcripts. The Cohen’s kappa score of 0.82 was reported.
The complexity of a video and the advancement of retrieval technology have changed
how people search for videos for different purposes. The participants turned to You-
Tube as the de facto video repository. The data analysis yielded five themes related to
video seeking behavior of the post-secondary students in a learning context. The five
themes relating to video seeking behavior are: (1) the selection of video resources;
(2) query formulation/reformulation; (3) selecting the video(s) for preview; (4) pre-
viewing the video, and (5) decision for search task. The last two themes are discussed
in this paper.
In video seeking, the participants exhibited at least two levels of assessment to
determine the video(s) that satisfy the search task. The first level of assessment was
performed on the result list after the search query. This assessment shortlisted the video
320 C. Loke et al.
(s) for preview. The cues were processed heuristically in this level of assessment. As
the result list from the query could contain a large number of videos, it is possible that
heuristic-based assessment can allow the shortlisting of video with minimal effort.
‘I will fast forward a bit while looking at the thumbnail on the progress bar. (mouse over the
progress bar). This allows me to give an idea what the video is about.’ (M2)
‘I feel as if browsing the video to get a sense of the video. And I do not think that I will be using
this video as the way they presented this is like unusual to me. I am not used to this kind of style
of presentation. Looking through just some of the content of this video, I feel that it is not going
to be relevant to me.’ (M4)
Participants formed an impression on what the video may offer in the preview
stage. Previewing also allowed the participants to affirm their initial assumptions
Video Seeking Behavior of Young Adults for Self Directed Learning 321
formed on the video. Participant M3 and M4 performed video skimming through the
content to find scenes that match what they had anticipated from prior assessment
during the selection for preview. The video seekers used various techniques to skim the
content and match it to the requirements that they have set implicitly. Automatic
video-skim techniques that can communicate the essential content of the video with
less time [28] could improve the understanding of the video’s content.
The video content can be also be summarized by using video metadata such as
descriptions. The implicit concern over ‘click-baits’ could undermine the confidence in
video seeker’s when referring to author-input metadata. The use of socially generated
metadata could potentially be more neutral. However, participants appeared wanting to
examine the content of the video rather than stopping their evaluation of the video on
metadata. More interactive video exploration techniques like elastic skimming [29]
may help these video seekers form a better impression on the video. Sometimes, the
video seeking process may be deemed as completed when the video previewed strongly
matches what the video seeker is expecting. In some other cases, the video seeker may
seek more cues to affirm their selection.
the number of comments exceeds a threshold. This could reduce the cognitive load on
the video seeker when such opinions are summarized using visual representations. For
less popular videos, the number of comments might not be sufficient to give an indi-
cation on the quality of the video.
When the video resource is deemed to be positive, video seekers might be
encouraged to leverage on that video resource to seek out other similar resources. The
“Up-Next’ feature found on YouTube is a common feature leveraged on by the video
seekers when they wanted to continue the search for more related videos. As noted by
participant, M1, ‘So because I can’t find any other videos from the search list, what I
intend to do is to look at the right side of the video, where related videos are rec-
ommended.’ Video seekers use features that are conveniently located and easy to use.
Features such as the “Up-Next” listing offers both ease of use and convenience.
However, the labeling of the feature should be self-explanatory to avoid ambiguity and
lead to false assumptions by video seekers.
5 Conclusions
Exploratory search tasks that often occur during SDL present fuzziness and ambiguity
in the search process as the self-directed learners venture into a learning domain that is
new and not fully conversant with them. Searching for information on the Internet to
support learning is common. Videos are fundamentally used as support for learning,
hence effort and cognitive load should be minimal in video seeking. In video seeking,
the study revealed that two levels of assessment took place to determine the video(s)
that satisfy the exploratory search task. The first assessment was performed using
heuristic cues on the video results list to shortlist video(s) for preview. The next
assessment was performed after the preview of the shortlisted video(s) to decide
whether to accept the video or to continue searching or browsing for more videos. Both
levels of assessment suggest that the video seekers looked for cues and video metadata
that can acquaint them with the video’s content with as little effort as possible.
Techniques such as elastic skimming [29] may help in reducing the effort in previewing
the video.
Video resources can offer more metadata so that the content of the video can be
searched more explicitly. Socially generated cues are known to influence decision
making, especially in more subjective context. This pooling of socially generated
information has been consulted heuristically to make quick evaluation of resources
online [31]. Such metadata can be developed into useful decision-making and evalu-
ation features. More investigations are needed to understand the type of heuristic
metadata, such as date of upload and socially generated cues, such as comments, and
how they can be presented, help perform the shortlist and affirm the video selection
better.
As part of future studies, the list of desirable video retrieval interface and features
will be identified and used as a basis to build a mock-up that would be used to validate
the interface and features. The number of self-directed learners using public video
repositories to support their learning is increasing. Young self-directed learners, such as
post-secondary students, will find video retrieval more effective with the
Video Seeking Behavior of Young Adults for Self Directed Learning 323
References
1. Belkin, N.J., Oddy, R.N., Brooks, H.M.: ASK for information retrieval: part I. background
and theory. J. Doc. 38(2), 61–71 (1982). doi:10.1108/eb026722
2. Meho, L.I., Tibbo, H.R.: Modeling the information-seeking behavior of social scientists:
Ellis’s study revisited. J. Am. Soc. Inf. Sci. Technol. 54(6), 570–587 (2003). doi:10.1002/
asi.10244
3. Goerke, V., Oliver, B.: Australian undergraduates’ use and ownership of emerging
technologies: Implications and opportunities for creating engaging learning experiences for
the net generation. Australas. J. Educ. Technol. 23(2), 171–186 (2007). doi:10.14742/ajet.
1263
4. Hicks, S.D.: Technology in today’s classroom: are you a tech-savvy teacher? Clearing
House J. Educ. Strat. Issues Ideas 84(5), 188–191 (2011). doi:10.1080/00098655.2011.
557406
5. Inan, F.A., Lowther, D.L.: Factors affecting technology integration in K-12 classrooms: a
path model. Educ. Technol. Res. Dev. 58(2), 137–154 (2010). doi:10.1007/s11423-009-
9132-y
6. Arnone, M.P., Small, R.V., Chauncey, S.A., Mckenna, H.P.: Curiosity, interest and
engagement in technology-pervasive learning environments: a new research agenda. Educ.
Technol. Res. Dev. 59(2), 181–198 (2011). doi:10.1007/s11423-011-9190-9
7. Geck, C.: The generation Z connection: teaching information literacy to the newest net
generation. Teach. Libr. 33(3), 19–23 (2006)
8. Ng, W.: Can we teach digital natives digital literacy? Comput. Educ. 59(3), 1065–1078
(2012). doi:10.1016/j.compedu.2012.04.016
9. Education Statistics Digest. M.O. Education, Editor, Singapore (2016)
10. Foo, S., Majid, S., Mokhtar, A., Zhang, X., Chang, Y.K., Luyt, B., Theng, Y.L.: Information
literacy skills of secondary school students in Singapore. Aslib J. Inf. Manag. 66(1), 54–76
(2014). doi:10.1108/AJIM-08-2012-0066
11. Chang, Y.K., Zhang, X., Mokhtar, A., Foo, S., Majid, S., Luyt, B., Theng, Y.L.: Assessing
students’ information literacy skills in two secondary schools in Singapore. J. Inf. Lit. 6(2),
19–34 (2012). doi:10.11645/6.2.1694
12. Little, J.J., Gu, Z.: Video retrieval by spatial and temporal structure of trajectories. In:
Minerva, M.Y., Chung-Sheng, L., Rainer, W.L. (eds.) Proceedings of SPIE, vol. 4315,
pp. 545–552
13. Snoek, C.G., Huurnink, B., Hollink, L., de Rijke, M., Schreiber, G., Worring, M.: Adding
semantics to detectors for video retrieval. IEEE Trans. Multimedia 9(5), 975–986 (2007).
doi:10.1109/tmm.2007.900156
14. Amir, A., Srinivasan, S., Efrat, A.: Search the audio, browse the video—a generic paradigm
for video collections. J. Adv. Sig. Process. 2, 209–222 (2003). doi:10.1155/
s111086570321012x
15. Lee, H., Smeaton, A.F.: Designing the user interface for the Físchlár digital video library.
J. Dig. Inf. 2(4), 251–262 (2006)
16. Butcher, K.R., Sumner, T.: Self-directed learning and the sensemaking paradox. Hum.-
Comput. Interact. 26(1), 123–159 (2011). doi:10.1080/07370024.2011.556552
324 C. Loke et al.
17. Marchionini, G.: Exploratory search: from finding to understanding. Commun. ACM 49(4),
41–46 (2006). doi:10.1145/1121949.1121979
18. Aula, A., Khan, R.M., Guan, Z.: How does search behavior change as search becomes more
difficult? In: Proceedings of the 28th International Conference on Human Factors in
Computing Systems, pp. 35–44. ACM, New York (2010). doi:10.1145/1753326.1753333
19. White, R.W., Roth, R.A.: Exploratory search: beyond the query-response paradigm. Synth.
Lect. Inf. Concepts Retriev. Serv. 1(1), 1–98 (2009). doi:10.2200/
s00174ed1v01y200901icr003
20. Bouchard, P.: Pedagogy without a teacher: what are the limits? Int. J. Self-Directed Learn. 6
(2), 13–22 (2009)
21. Burgoon, J.K., Bonito, J.A., Bengtsson, B., Cederberg, C., Lundeberg, M., Allspach, L.:
Interactivity in human–computer interaction: a study of credibility, understanding, and
influence. Comput. Hum. Behav. 16(6), 553–574 (2000). doi:10.1016/s0747-5632(00)
00029-7
22. Wildemuth, B.M., Freund, L.: Assigning search tasks designed to elicit exploratory search
behaviors. In: Proceedings of the Symposium on Human-Computer Interaction and
Information Retrieval - HCIR 2012. ACM: Cambridge (2012). doi:10.1145/2391224.
2391228
23. Kules, B., Capra, R.: Creating exploratory tasks for a faceted search interface. In: Second
Workshop on Human-Computer Interaction, HCIR 2008 (2008)
24. Sharp, H., Rogers, Y., Preece, J.: Interaction Design: Beyond Human-Computer Interaction.
Wiley, New Jersey (2007)
25. Saldaña, J.: The Coding Manual for Qualitative Researchers. Sage Publications, London
(2009)
26. Mergenthaler, E., Stinson, C.: Psychotherapy transcription standards. Psychother. Res. 2(2),
125–142 (1992). doi:10.1080/10503309212331332904
27. Hirsh, S.G.: Children’s relevance criteria and information seeking on electronic resources.
J. Assoc. Inf. Sci. Technol. 50(14), 1265–1283 (1999). doi:10.1002/(sici)1097-4571(1999)
50:14%3C1265:aid-asi2%3E3.3.co;2-5
28. Christel, M.G., Smith, M.A., Taylor, C.R., Winkler, D.B.: Evolving video skims into useful
multimedia abstractions. In: Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems, pp. 171–178. ACM Press, New York (1998). doi:10.1145/274644.
274670
29. Haesen, M., Meskens, J., Luyten, K., Coninx, K., Becker, J.H., Tuytelaars, T., Poulisse, G.,
Pham, P.T., Moens, M.: Finding a needle in a haystack: an interactive video archive explorer
for professional video searchers. Multimedia Tools Appl. 63(2), 331–356 (2011). doi:10.
1007/s11042-011-0809-y
30. de Vries, L., Gensler, S., Leeflang, P.S.: Popularity of brand posts on brand fan pages: an
investigation of the effects of social media marketing. J. Interact. Market. 26(2), 83–91
(2012). doi:10.1016/j.intmar.2012.01.003
31. Metzger, M.J., Flanagin, A.J., Medders, R.B.: Social and heuristic approaches to credibility
evaluation online. J. Commun. 60(3), 413–439 (2010). doi:10.1111/j.1460-2466.2010.
01488.x
Author Index