Code-Switching & Lang. Processing
Code-Switching & Lang. Processing
Processing
Sunayana Sitaram
Microsoft Research India
arXiv:1904.00784v3 [cs.CL] 22 Jul 2020
Abstract
Preprint submitted to Computer Speech and Language (CSL) July 24, 2020
1. Introduction
Linguistic code choice refers to the use of a language for a specific com-
municative purpose and code-switching denotes a shift from one language to
another within a single utterance. Not only is there a plethora of different lan-
guages across the world, but speakers also often mix these languages within the
same utterance. In fact, some form of code-switching is expected to occur in
almost every scenario that involves multilinguals [1]. This can go beyond mere
insertion of borrowed words, fillers and phrases, and include morphological and
grammatical mixing. Such shifts not only convey group identity [2], embody
societal patterning [3] and signal cultural discourse strategies [4] but also have
been shown to reduce the social and interpersonal distance [5] in both formal
[6, 7] and informal settings.
In this paper we refer to this phenomenon as code-switching, though the
term code-mixing is also used. While such switching is typically considered
informal - and is more likely to be found in speech and in casual text as now
found in social media - it is also found in semi formal and formal settings such
as news paper headlines and teaching. Therefore, we argue that code-switching
should not be looked down upon or ignored but be acknowledged as a genuine
form of communication that deserves analysis and development of tools and
techniques to be handled appropriately. As language technologies improve and
permeate more and more applications that involve interactions with humans
[8, 9], it is imperative that they take phenomena such as code-switching into
account for any consumer facing technology.
Code-switching is most common among peers who have similar fluency in
each language. For example fluent bilingual Spanish and English people may of-
ten float between their languages, in a form of communication called Spanglish.
Indian sub-continent residents, who often have a substantial fluency in English
will often mix their speech with their regional languages in Hinglish (Hindi),
Tenglish (Telugu), Tamlish (Tamil) and others. But it is not just English that
code-switching occurs with. Southern Mainland Chinese residents who, for ex-
2
ample, speak Cantonese and Shanghaiese, may switch with Putonghua (stan-
dard Mandarin). Arabic Dialects are often mixed with Modern Standard Arabic.
The distinction between languages and dialects is hard to define, but we see that
code-switching appears with dialects too. African American Vernacular English
(AAVE) speakers will commonly switch between AAVE and Standard American
English; Scottish people may switch between Scots and Standard English. At
an extreme, code-switching could also be used to describe register shifting in
monolingual speech. Formal speech versus slang or swearing may follow similar
functions and patterns as those in code-switching among two distinct languages.
3
Unlike pidgins or creoles [11, 12, 13], where speakers may not have full
fluency in the language of influence, we are primarily interested in situations
where participants have fluency in each of the languages but are choosing not to
stay within one language. Code-switching is not a simple linguistic phenomena
and depending on the languages involved, and the type of code-switching the
interaction between the component languages may be quite different. It is easy
to identify at least linguistic sharing, cross-lingual transfer, lexical borrowing
as well as speech errors with restarts commonly within the code-switched data.
Likewise although there may be language technology tasks that can be achieved
with straightforward techniques, it is clear that some tasks, such as semantic
role labeling will require complex cross-lingual analysis.
Many have identified the notion of a matrix language in code-switching [14],
that there is an underlying language choice which mostly defines the grammar
and morphological aspects of the utterance. From a language technologies point
of view, especially when considering code-switched data generation using any
form of language modeling, it is possible to identify ‘bad’ code-switching or even
‘wrong’ code-switching. Although it is obviously not a binary decision, there
are extremes that will almost always be wrong. We cannot in general randomly
choose which language a word would be realized in, or simply state that we will
choose alternate languages for each word. That is, there are constraints, there
is an underlying grammar and there are multiple linguistic theories that have
been proposed for code-switching. Modeling the grammar is challenging, even if
there may be an eventual standardized Hinglish that everyone in Northern India
may speak, at present, such code-switched languages are very dynamic, and will
have very diverse ideolects across speakers. This is reminiscent of pidgins and
creoles which can develop over time, but they too, especially as they are not
normally written languages, are also diverse.
But we should not give up, there is underlying structure, and there are
constraints, and we have good machine learning modeling techniques that can
deal with uncertainty. Recently, there has been quite a lot of interest in the
speech and NLP community on processing code-switched speech and text, and
4
this paper aims at describing progress made in the field, and discussing open
problems.
This survey is organized as follows. First, we introduce why code-switching
is a challenging and important problem for speech and NLP. Next, in Section
2, we briefly describe linguistic studies on code-switching with other theoreti-
cal aspects. In Section 3 we describe speech and NLP corpora and resources
that have been created for code-switched language pairs. Section 4 describes
techniques for building models for code-switching in specific speech and NLP
applications. Section 5 describes various shared tasks and challenges that have
been conducted to evaluate code-switching, and introduces benchmarks that
evaluate models across tasks and languages. We conclude in Section 6 with a
description of the challenges that remain to be addressed and future directions.
2. Background
5
The extent and type of code-switching can vary across language pairs. [18]
used word-level Language Identification to estimate which language pairs were
code-switched on Twitter. They found that around 3.5% of tweets were code-
switched, with the most common pairs being English-Spanish, English-French
and English-Portuguese. English-German tweets typically had only one switch
point, implying that the tweets usually contained translations of the same con-
tent in English and German, while English-Turkish tweets had the most switch
points, implying fluid switching between the two languages. Code-switching can
also vary within a language pair. For example, casual conversational Hinglish
may be different from Hinglish used in Bollywood movies, which may be differ-
ent from Hinglish seen on Twitter.
6
equivalence order in constituents. The above described linguistic theories are
also used in [23] to identify governing relationships between constituents. [24]
have demonstrated evidence that a constrained Universal Grammar needs refine-
ment of f-selection in code-switching as compared to monolingual speech. [25]
have proposed four categories for any switch point comprising of harmonization,
neutralization, compromise, and blocking.
[26] have a rather interesting approach towards analyzing grammatical vari-
ants in code-switching based on pre-conceptualized assumptions. They claim
that grammar in this context is subject to poly-idiolectal repertoires of bilingual
speakers and sociolinguistic factors take precedence over grammatical factors.
Hence they propose accounting for variability among the bilingual speakers.
This same work was extended later to examine intra-sentential switching focus-
ing on bilingual compound verbs and using grammatical knowledge.
The linguistic theories mentioned above were put to use in computational
frameworks by [27]. They address several issues such as the absence of literal
level translation pairs, sensitivity to minor alignment errors and the under spec-
ification of the original models. The human evaluation of generated sentences
reveals that the acceptability of code-switching patterns depend not only on
socio-linguistic factors but also cognitive factors. This work was later extended
in [28] to perform language modeling by leveraging the theories discussed above
to generate synthetic code-switched text.
While on one hand, there are studies of formally constructing grammatical
representations to understand the nature of code-switching, there is also work
that focuses on understanding the psycho-linguistic aspect of this subject per-
taining to how and when this occurs. There are studies pertaining to socially
determined and pragmatic choices in the developmental perspective of switching
in bilingual infants [29].
Another stream of work talks about the factors triggering code-switching
that are attributed to ‘cognate’ or trigger words including proper nouns, cog-
nate content words with good and moderate form overlap, and cognate func-
tion words. [30] have studied attested contact-induced changes based on prior
7
linguistic theories regarding the types of structural changes in calques, distribu-
tions, frequencies, inventory and stability. [31] explored three different hypoth-
esis and presented empirical evidence for the same. They are the relationships
between (i) cognate stimuli and code-switching, (ii) syntactic information and
code-switching, (iii) entrainment in a code-switched conversation among bilin-
guals. The empirical evidence demonstrates that there is a strong correlation
between precedence of cognates and code-switching, relationship between POS
tags and code-switching and the convergence of the rate of entrianment in code-
switching.
[32] present an integrated representation of inter and intra-sentential phe-
nomena as well as spoken and written modalities of code-switching. This is done
in order to make better reuse of the minimally available code-switched data by
analyzing various global dimensions such as modality, discourse, granularity, so-
cial familiarity and social hierarchy. These properties pave way to potentially
footprint corpora and functional derivations. [33] present a systematic approach
to analyze code-switching in conversations. Patterns of switching are analyzed
in multi-party conversations from Hindi movie scripts to establish identity and
social contexts.
8
[37] and [38] also propose the following metrics: Language Entropy, Span
Entropy, Burstiness and Memory. Language Entropy and Span Entropy are the
number of bits needed to represent the distribution of language spans. Bursti-
ness quantifies whether the switching has periodic character or occurs in bursts.
Memory captures the tendency of consecutive language spans to be positively
or negatively autocorrelated.
[39] propose techniques to automatically determine the matrix language of a
code-switched utterance. Although the notion of the matrix language is based
on the underlying grammar of the sentence, [39] show that the matrix language
can be determined by word-count alone as an approximation. [40] character-
ize languages as being asymmetric or symmetric depending on whether code
switching is insertional or alternating, and show that the same grammatical
constraints hold in both cases.
Over the last few years, significant progress has been made in the fields of
Speech Processing and Natural Language Processing mainly owing to the use
of large and powerful Machine Learning models such as Deep Neural Networks
(DNNs). DNNs typically require large labeled corpora for training, which can be
found for a few languages such as US English, Mandarin and Modern Standard
Arabic, which are commonly termed as high-resource languages. In the presence
of large datasets, models can be trained to achieve high accuracies on tasks such
as Automatic Speech Recognition, Machine Translation and Parsing.
However, most languages in the world do not have the necessary data and
resources to create models with high enough accuracies to be used in real-world
systems. The situation is even more stark for code-switched languages, since
considerable care is taken to leave out foreign words while building monolin-
gual resources. So, even if monolingual resources exist for one or more of the
languages being mixed, code-switched speech and language resources are very
scarce.
9
However, owing to the recent interest in code-switched speech and language
processing, there are some speech and text data sets available for a few language
pairs, which we describe next.
Data used for building Automatic Speech Recognition (ASR) and Text to
Speech (TTS) systems typically consists of recorded speech and the correspond-
ing transcripts. For ASR systems, the speech may be spontaneous or read, and
typically needs to be at least a few thousand hours to build systems that are
usable. For TTS systems, a few hours of clean, well recorded speech from a
single speaker is typically enough. Below is a list of code-switched data sets
available for speech processing.
10
• [48] present an artificially generated Japanese-English code-switched cor-
pus using a Japanese and English Text-to-speech system from a bilingual
speaker. The corpus consists of 280k speech utterances.
• [56] collected 1000 hours of Malay-English speech from 208 Chinese, Malay
and Indian speakers.
11
• An Egyptian Arabic-English speech corpus is described in [57]. It consists
of 5.3 hours of speech from interviews with 12 participants, of which 4.5
hours of speech has been transcribed.
12
• Although no code-switched speech databases exist for Speech Synthesis,
bilingual TTS databases are available from the same speakers in a number
of Indian languages and English [67].
In this section, we describe various resources that exist for processing code-
switched text. Since the type of data and resources vary greatly with the task
at hand, we describe them separately for each task.
13
• Another very large scale dataset that is is not explicitly targeted at code-
switching but contains it is [73] that addresses curating socially represen-
tative text by taking into account geographic, social, topical and multi-
lingual diversity. This corpus consists of Tweets from 197 countries in 53
languages.
• A shared task was organized to address NER for code-switched texts using
around 50k Spanish-English and around 10k Arabic-English annotated
tweets [75].
• Public pages from Facebook pages of three celebrities and the BBC Hindi
news page are used to gather 6,983 posts and comments and annotated
with POS tags in addition to matrix language information [77].
• ICON 2015 conducted a shared task on POS tagging for which they re-
leased data in Hindi-English, Bengali-English, Tamil-English [78]. The
dataset contains 1k-3k annotated utterances for each language pair.
14
• Code-switched Turkish-German tweets were annotated based on Univer-
sal Dependencies POS tags and the authors proposed guidelines for the
Turkish parts to adopt language-general heuristics to gather a corpus of
1029 tweets [79].
• [81] gathered 1106 messages (552 Facebook posts and 554 tweets) in Hindi-
English and annotated them with a Twitter specific tagset. [82] describe
an English-Bengali corpus consisting of Twitter messages and two English-
Hindi corpora consisting of Twitter and Facebook messages tagged with
coarse and fine grained POS tags.
• [83] crowd-sourced POS tags using the Universal POS tagset to annotate
the BANGOR-MIAMI corpus which is a conversational speech dataset
with Spanish-English code-switching.
3.2.4. Parsing
Datasets for parsing contain code-switched sentences with dependency parses
and chunking tags.
15
tences, the test set comprises of 80 Komi-Russian multilingual sentences
and 25 Komi spoken sentences.
16
• Another section of efforts that move towards using monolingual data from
English and weakly supervised and imperfect bilingual embeddings pro-
vided a test set of 250 Hindi-English code-switched questions mapped
between SimpleQuestions dataset and Freebase tuples [92].
• One of the early efforts also include leveraging around 300 messages from
social media platforms like Twitter and blogs to collect 506 questions from
the domains of sports and tourism [93].
[94] present the first dataset for code-switched NLI, in which premises are
taken from Bollywood (Hindi) movie scripts and annotators create hypotheses
that entail or contradict the premises. The dataset contains 400 premises and
around 2k hypotheses.
Various datasets from social media such as Facebook and Twitter have been
collected for different NLP tasks, which we describe in this section.
[95] collected 1959 Hindi-English tweets and asked annotators to rank tweets
according to relevance for specific queries.
[96] collect a Twitter corpus of around 4k tweets and annotate it for Hate
Speech. [97] create a corpus of around 3k tweets for automated irony detection.
[98] present a brief survey of code-switching studies in NLP. [99] describe the
challenges in computational processing of core NLP tasks as well as downstream
applications. They highlight issues caused due to combining two languages at
the lexical and syntactic level, using examples from several tasks and language
pairs.
In this paper, we provide a comprehensive description of work done in code-
switched speech and NLP. Various approaches have been taken to build speech
17
and NLP systems for code-switched languages depending on the availability of
monolingual, bilingual and code-switched data. When there is a complete lack
of code-switched data and resources, a few attempts have been made to build
models using only monolingual resources from the two languages being mixed.
Domain adaptation or transfer learning techniques can be used, wherein
models are built on monolingual data and resources in the two languages and a
small amount of ‘in-domain’ code-switched data can be used to tune the models.
Word embeddings have been used recently for a wide variety of NLP tasks.
Code-switched embeddings can be created using code-switched corpora [100],
however, in practice such resources are not available and other techniques such
as synthesizing code-switched data for training such embeddings can be used.
Massive multilingual models such as multilingual BERT [101] have also been
explored in code-switched NLP.
18
duced are re-scored to get the final code-switched recognition result. However,
the disadvantages with multi-pass approaches are that errors made by the LID
system are not possible to recover from. [47] suggest a single-pass approach with
soft decisions on LID and language boundary detection for Mandarin-Taiwanese
ASR.
The choice of phone set is important in building ASR systems and for code-
switched language pairs, the choice of phoneset is not always obvious, since one
language can have an influence on the pronunciation of the other language. [104]
develop a cross-lingual phonetic Acoustic Model for Cantonese-English speech,
with the phone set designed based on linguistic knowledge. [105] present three
approaches for Mandarin-English ASR - combining the two phone inventories,
using IPA mappings to construct a bilingual phone set and clustering phones
by using the Bhattacharyya distance and acoustic likelihood. The clustering
approach outperforms the IPA-based mapping and is comparable to the com-
bination of the phone inventories. [52] describe approaches to combine phone
sets, merge phones manually using knowledge and iterative merging using ASR
errors on Hindi-English speech. Although the automatic approach is promis-
ing, manual merging using expert knowledge from a bilingual speaker performs
best. [106] use IPA, Bhattacharya distance and discriminative training to com-
bine phone sets for Mandarin-English. When code-switching occurs between
closely related languages, the phone set of one language can be extended to
cover the other, as is suggested in [107] for Ukranian-Russian ASR. In this
work, the Ukranian phone set and lexicon are extended to cover Russian words
using phonetic knowledge about both languages.
[65] describe an ASR system for Sepedi-English in which a single Sepedi
lexicon is used for decoding. English pronunciations in terms of the Sepedi
phone set are obtained by phone-decoding English words with the Sepedi ASR.
[54] use a common Wx-based phone set for Hindi-English ASR built using a
large amount of monolingual Hindi data with a small amount of code-switched
Hindi-English data. [108] use cross-lingual data sharing to tackle the problem of
highly imbalanced Mandarin-English code-switching, where the speakers speak
19
primarily in Mandarin. In [109], authors attempt to alleviate the problem of L2
word pronunciation by creating linguistically motivated pairwise mappings
When data from both languages is available but there is no or very lit-
tle data in code-switched form, bilingual models can be built. [106] train the
Acoustic Model on bilingual data, while [110] and [111] use existing monolin-
gual models and with a phone-mapped lexicon and modified Language Model
for Hindi-English ASR. In [112], authors create synthetic code mixed speech
by concatenating segments from different monolingual utterances and employ
this to improve Hindi-English code mixed ASR performance. In [113], authors
first detect real and untranscribed code mixed segments from online archives.
They then employ semi supervised and active learning techniques to obtain tran-
scriptions and use as augmented data to train code switched models. In [114]
authors follow semi supervised training and show that incorporating language
and speaker information is helpful while building bilingual acoustic models. In
[115] authors combine monolingual and bilingual graphs together with a unified
acoustic model.
[116] propose a technique known as meta transfer learning to select the best
monolingual data for transfer that can improve code-switched models. [117]
describe the importance of data selection between subsets of English, Mandarin
and code-switched datasets for improving Mandarin-English ASR and show that
simply pooling all the data leads to worse results.
[118] build a bilingual DNN-based ASR system for Frisian-Dutch broadcast
speech using both language-dependent and independent phones. The language
dependent approach, where each phone is tagged with the language and mod-
eled separately performs better. [119] decode untranscribed data with this ASR
system and add the decoded speech to ASR training data after rescoring using
Language Models. In [120], this ASR is significantly improved with augmented
textual and acoustic data by adding more monolingual data in Dutch, auto-
matically transcribing untranscribed data, generating code-switched data using
Recurrent LMs and machine translation.
[64, 121] build a unified ASR system for five South African languages, by us-
20
ing interpolated language models from English-isiZulu, English-isiXhosa, English-
Setswana and English-Sesotho. This system is capable of recognizing code-
switched speech in any of the five language combinations.
[122] use semi-supervised techniques to improve the lexicon, acoustic model
and language model of English-Mandarin code-switched ASR. They modify the
lexicon to deal with accents and treat utterances that the ASR system performs
poorly on as unsupervised data. In [123] authors utilise ASR and TTS in a semi
supervised fashion to learn code switching. They further show that integrating
language embeddings allows the framework to address even the language pairs
not seen during training [124].
In [125] authors jointly train two Mandarin-English acoustic models that
differ in the choice of acoustic units describing the salient acoustic and phonetic
information. In [126], they observe that sharing parameters between the primary
and auxiliary tasks helps capture language switching information.
Recent studies have explored end-to-end ASR for code-switching. Tradi-
tional end-to-end ASR models require a large amount of training data, which is
difficult to find for code-switched speech. [127] propose a CTC-based model for
Mandarin-English speech, in which the model is first trained using monolingual
data and then fine-tuned on code-switched data. [128] use transfer learning from
monolingual models, wordpieces as opposed to graphemes and multitask learn-
ing with language identification as an additional task for Mandarin-English end-
to-end ASR. In [129], authors address the scenario where monolingual speakers
attempt to comprehend code switched speech in the context of a dialog. To ad-
dress this, they build a system to recognize code mixed speech and translate it
to monolingual text. In [130], authors present a hypothesis that the discrepancy
between distributions of token representations for different languages restricts
end to end models. To alleviate this, they constrain the token representations
using Shannon divergence and cosine distance. In [131], authors perform a frame
level language detection and adjust the posterior distribution with CTC condi-
tioned on the language detection. [132] present an RNN-T model with language
bias that can improve upon an RNN-T model without any LID information,
21
without needing an explicit LID system.
In [133], authors explore using two types of units: characters for both Man-
darin and English, characters for Mandarin and sub word units for English. In
[134] authors employ BPE sub word units. In [135], authors employ a frame
level language recognition system to seed CTC based acoustic model.
Recent work by [136] showed that speech recognition models fine-tuned on
code-switched data regress on monolingual speech. To alleviate this issue and
build robust models that can improve on both monolingual and code-switched
speech recognition, the authors propose using Learning Without Forgetting and
adversarial training. [137] extends this work by proposing a multi-task approach
to domain adversarial training that shows further improvements on both mono-
lingual and code-switched ASR.
As stated earlier, switching/mixing and borrowing are not always clearly
distinguishable. Due to this, the transcription of code-switched and borrowed
words is often not standardized, and can lead to the presence of words be-
ing cross-transcribed in both languages. [138] automatically identify and dis-
ambiguate homophones in code-switched data to improve recognition of code-
switched Hindi-English speech.
Language models (LMs) are used in a variety of Speech and NLP systems,
most notably in ASR and Machine Translation. Although there is significantly
more code-switched text data compared to speech data in the form of informal
conversational data such as on Twitter, Facebook and Internet forums, robust
language models typically require millions of sentences to build. Code-switched
text data found on the Internet may not follow exactly the same patterns as
code-switched speech. This makes building LMs for code-switched languages
challenging.
Monolingual data in the languages being mixed may be available, and some
approaches use only monolingual data in the languages being mixed [139] while
22
others use large amounts of monolingual data with a small amount of code-
switched data.
Other approaches have used grammatical constraints imposed by theories of
code-switching to constrain search paths in language models built using artifi-
cially generated data. [140] use inversion constraints to predict CS points and
integrate this prediction into the ASR decoding process. [141] integrate Func-
tional Head constraints (FHC) for code-switching into the Language Model for
Mandarin-English speech recognition. This work uses parsing techniques to re-
strict the lattice paths during decoding of speech to those permissible under the
FHC theory. [142] assign weights to parallel sentences to build a code-switched
translation model that is used with a language model for decoding code-switched
Mandarin-English speech.
[143] show that a training curriculum where an Recurrent Neural Network
(RNN) LM is trained rst with interleaved monolingual data in both languages
followed by code-switched data gives the best results for English-Spanish LM.
[100] extend this work by using grammatical models of code-switching to gener-
ate artificial code-switched data and using a small amount of real code-switched
data to sample from the artificially generated data to build Language Models.
[144] uses Factored Language Models for rescoring n-best lists during ASR
decoding. The factors used include POS tags, code-switching point probability
and LID. In [145], [146] and [147], RNNLMs are combined with n-gram based
models, or converted to backoff models, giving improvements in perplexity and
mixed error rate.
[148, 149] investigate the importance of syntactic information such as Part-
of-Speech(POS) in predicting the switching point. They observe that the switchig
attitude is speaker dependent[150].
[151] synthesize isiZulu-English bigrams using word embeddings and use
them to augment training data for LMs, which leads to a reduction in per-
plexity when tested on a corpus of soap opera speech.
In [152] authors employ dual RNNs for language model training while [153,
154, 155] investigate the applicability of artificially generated code mixed data
23
for data augmentation.
In [156], authors show that encoding language information improves lan-
guage model by learning code switch points. In [157] authors present a dis-
criminative training based approach for model code mixed text. Alternatively,
authors in [158] propose to manipulate n gram based language model by employ-
ing clustering for the infrequent words. In [159] authors present an approach
using multi task learning by jointly learning language modeling as well POS
tagging.
[160] use a bilingual attention language model that learns cross-lingual prob-
abilities by using parallel data simultaneously along with the language modeling
objective and achieves high reductions in perplexity over the SEAME corpus.
In [161] authors show that humans exploit prosodic cues to detect code
mixing. They also show taht humans can anticipate switch poitns even in noisy
speech.
As mentioned earlier, some ASR systems first try to detect the language
being spoken and then use the appropriate model to decode speech. In case
of intra-sentential switching, it may be useful to be able to detect the code-
switching style of a particular utterance, and be able to adapt to that style
through specialized language models or other adaptation techniques.
[162] look at the problem of language detection from code-switched speech
and classify code-switched corpora by code-switching style and show that fea-
tures extracted from acoustics alone can distinguish between different kinds of
code-switching in a single language.
In [163, 164] authors investigate the effectiveness of using retrained multilin-
gual DNNS and augmenting the data for detecting the language. In [165, 166]
authors employ word based lexical information [167] build HMM based acoustic
model followed by an SVM based decision classifier to identify the code mixing
between Northern Sotho and English.
24
4.4. Speech Synthesis
Most Text to Speech (TTS) systems assume that the input is in a single
language and that it is written in native script. However, due to the rise in
globalization, phenomena such as code-switching are now seen in various types
of text ranging from news articles through comments/posts on social media,
leading to co-existence of multiple languages in the same sentence. Incidentally,
these typically are the scenarios where TTS systems are widely deployed as
speech interfaces and therefore these systems should be able to handle such
input. Even though independent monolingual synthesizers today are of very
high quality, they are not fully capable of effectively handling such mixed content
that they encounter when deployed. These synthesizers in such cases speak out
the wrong/accented version at best or completely leave the words from the other
language out at worst. Considering that the words from other language(s) used
in such contexts are often the most important content in the message, these
systems need to be able to handle this scenario better.
Current approaches handling code-switching fall into three broad categories:
phone mapping, multilingual or polyglot synthesis. In phone mapping, the
phones of the foreign language are substituted with the closest sounding phones
of the primary language, often resulting in strongly accented speech. In a mul-
tilingual setting, each text portion in a different language is synthesised by a
corresponding monolingual TTS system. This typically means that the differ-
ent languages will have different voices unless each of the voices is trained on
the voice of same multilingual speaker. Even if we have access to bilingual
databases, care needs to be taken to ensure that the recording conditions of
the two databases are very similar. The polyglot solution refers to the case
where a single system is trained using data from a multilingual speaker. Similar
approaches to dealing with code-switching have been focused on assimilation
at the linguistic level, and advocate applying a foreign linguistic model to a
monolingual TTS system. The linguistic model might include text analysis and
normalisation, a G2P module and a mapping between the phone set of the for-
eign language and the primary language of the TTS system [168, 169, 170].
25
Other approaches utilise cross-language voice conversion techniques [171] and
adaptation on a combination of data from multiple languages[172]. Assimilation
at the linguistic level is fairly successful for phonetically similar languages [170],
and the resulting foreign synthesized speech was found to be more intelligible
compared to an unmodified non-native monolingual system but still retains a
degree of accent of the primary language. This might in part be attributed to
the non-exact correspondence between individual phone sets.
[173] find from subjective experiments that listeners have a strong preference
for cross-lingual systems with Hindi as the target language. However, in prac-
tice, this method results in a strong foreign accent while synthesizing the English
words. [174, 175] propose a method to use a word to phone mapping instead,
where an English word is statistically mapped to Indian language phones.
[176] train speech synthesizers for Hindi-English, Tamil-English and Hindi-
Tamil by randomizing the order of bilingual training data which are then used
to synthesize monolingual and code-switched text. This leads to improvements
in subjective metrics for the code-switched speech and marginal degradation in
monolingual speech.
[177] present an end-to-end code-switched TTS for Mandarin English, in
which they use bilingual data with a shared encoder that contains language
information and separate decoders. [178] extend this approach to use a bilingual
phonetic posteriorgram (PPG) to synthesize code-switched speech using only
monolingual data. [179] also use a language specific encoder along with a multi-
head attention mechanism in the decoder resuling in large improvements in the
SEAME corpus.
The task of lexical level language identification (LID) is one of the skeletal
tasks for the lexical level modeling of downstream NLP tasks. Most research
has focused on word-level LID, although some work on utterance-level LID
also exists. [180] build tools for web-scale analysis of code-switching, using
an utterance-level language identification system based on the language ratio of
26
the two languages involved. A large amount of research in this area has been
conducted due to shared tasks on word-level LID ([68], [69]).
Social media data, especially posts from Facebook was used to collect data
for the task of LID [70] of Bengali, Hindi and English code-switching. Techniques
include dictionary based lookup, supervised techniques applied at word level
along with ablation studies of contextual cues and CRF based sequence labeling
approaches. Character level n-gram features and contextual information are
found to be useful as features.
[181] is among the first computational approaches towards determining intra-
word switching by segmenting the words into smaller meaningful units through
morphological segmentation and then performing language identification prob-
abilistically. This was followed by intra-word approaches [182, 183] and ap-
proaches that incorporate information beyond word level [184, 185, 186, 187,
188, 189, 190, 191, 192, 193]. In addition to features, model based variants have
been proposed by [194, 195, 196, 197, 198].
[199] make use of patterns in language usage of Hinglish along with the con-
secutive POS tags for LID. [71] have also experimented with n-gram modeling
with pruning and SVM based models with feature ablations Hindi-English and
Bengali-English LID. [72] have worked on re-defining and re-annotating language
tags from social media cues based on cultural, core and therapeutic borrowings.
[73] have introduced a socially equitable LID system known as EQUILID by
explicitly modeling switching with character level sequence to sequence models
to encompass dialectal variability in addition to code-switching. [200] present a
weakly supervised approach with a CRF based on a generalization expectation
criteria that outperformed HMM, Maximum Entropy and Naive Bayes methods
by considering this a sequence labeling task.
Recently, POS tagging has also been examined as a means to perform lan-
guage identification in code-switched scenarios [201]. To this end, they have col-
lected a Devanagari corpus and annotated it with POS tags followed by translit-
erating it into Roman text. The complementary English data is annotated with
POS tags as well. Several classical approaches including SVM, Decision Trees,
27
Logistic Regression and Random Forests have been experimented with. The
feature set that included POS tags along with the word length and the word
itself with a random forest resulted in the highest performance. Hence mono-
lingual data with corresponding POS tags seem useful in performing language
identification of code-switched text.
28
tity Recognition in Arabish from three different sources: Twitter, transcribed
conversational speech and translating a standard NER dataset. The dataset
comprises of 6k sentences with 130k tokens. The baseline model itself is a
BiLSTM-CRF which is one of the heavily investigated architectures in the task
of NER. On top of this, they have adopted the FLAIR framework [209] to in-
vestigate different types of embeddings along with pooled datasets. They have
also experimented with word embeddings that are not only traditionally used
and also more recently used such as contextual embeddings in their architec-
ture and discovered that a combination of both performed better. [210] extend
the LSTM architecture to combat high percentage of out of vocabulary words
in code-switched data using transfer learning with bilingual character repre-
sentations. Additionally, they also remove the noise with normalization of the
spellings. Alternative to the fusion approach see above, [211] utilize the self at-
tention mechanism over the charcacter based embeddings. The final embedding
representation is obtained by feeding these word and character based embed-
dings through a stacked BiLSTM with residual connections. Inspired by this,
[212] proposed multilingual meta embeddings that extend the scope to other
related and similar languages. They circumvent the problem of lexical level
language identification using the same self attention mechanism on pre-trained
word embeddings. [213] propose the use of hierarchical meta-embeddings that
combine word and sub-word level embeddings to achieve SOTA performance on
English-Spanish NER.
29
tion for Turkish-German tweets that align with existing language identification
based on POS tags from Universal Dependencies. [80] explored the exploita-
tion of monolingual resources such as taggers (for Spanish and English data)
and heuristic based approaches in conjunction with machine learning techniques
such as SVM, Logit Boost, Naive Bayes and J48. This work shows that many
errors occur in the presence of intra-sentential switching thus establishing the
complexity of the task.
[81] have also gathered data from social media platforms such as Facebook
and Twitter and have annotated them at coarse and fine grained levels. They fo-
cus on comparing language specific taggers with ML based approaches including
CRFs, Sequential Minimal Optimization, Naive Bayes and Random Forests and
observ that Random Forests performed the best, although only marginally bet-
ter than combinations of individual language taggers. [83] use crowd-sourcing
for annotating universal POS labels for Spanish-English speech data by splitting
the task into three subtasks. These are 1. labeling a subset of tokens automat-
ically 2. disambiguating a subset of high frequency words 3. crowd-sourcing
tags by decisions based on questions in the form of a decision tree structure.
The choice of mode of tagging is based on a curated list of words.
[214] use a stacked model technique and compare them to joint modeling
and pipeline based techniques to find that that best stacked model that utilizes
all features outperform the joint and pipeline-based models.
[215] carry out normalization of code-switched data and assess the impact
on POS tagging as a downstream task. They find that automatic normalization
leads to a performance gain in POS tagging.
4.8. Parsing
30
syntactic properties of the subcategorized elements irrespective of the languages
to which these words belong.
[86] leveraged a non-linear neural approach for the task of predicting the
transitions for the parser configurations of arc-eager transitions by leveraging
only monolingual annotated data by including lexical features from pre-trained
word representations. [87] also worked on a pipeline and annotating data for
shallow parsing by labeling three individual sequence labeling tasks based on
labels, boundaries and combination tasks where a CRF is trained for each of
these tasks.
[88] performed multilingual semantic parsing using a transfer learning ap-
proach for code-switched text utilizing cross lingual word embeddings in a se-
quence to sequence framework. [85] compared different systems for dependency
parsing and concluded that the Multilingual BIST parser is able to parse code-
switched data relatively well.
[218] present a Universal Dependencies dataset in Hindi-English and a neural
stacking model for parsing with a new decoding scheme that outperforms prior
approaches.
So far, we have seen individual speech and NLP applications which can be
used as part of other downstream applications. One very impactful downstream
application of casual and free mixing beyond mere borrowing in terms of infor-
mation need is Question Answering (QA). This is especially important in the
domains of health and technology where there is a rapid change in vocabulary
thereby resulting in rapid variations of usage with mixed languages. One of the
initial efforts in eliciting code-mixed data to perform question classification was
undertaken by [89]. This work leveraged monolingual English questions from
websites for school level science and maths, and from Indian version of the show
‘Who wants to be a Millionaire?’. Crowd-workers are asked to translate these
questions into mixed language in terms of how they would frame this question
to a friend next to them.
31
Lexical level language identification, transliteration, translation and adja-
cency features are used to build an SVM based Question Classification model
for data annotated based on coarse grained ontology proposed by [219]. Since
this mode of data collection has the advantage of gathering parallel corpus of
English questions with their corresponding code-switched questions, there is a
possibility of lexical bias due to entrainment. In order to combat this, [90]
discussed techniques to crowd-source code-mixed questions based on a couple
of sources comprising of code-mixed blog articles and based on certain fulcrum
images. They organized the first edition of the code-mixed question answer-
ing challenge where the participants used techniques based on Deep Semantic
Similarity model for retrieval and pre-trained DrQA model fine-tuned on the
training dataset. An end-to-end web based QA system WebShodh is built and
hosted by [220] which also has an additional advantage of collecting more data.
[92] trained TripletSiamese-Hybrid CNN to re-rank candidate answers that
are trained on the SimpleQuestions dataset in monolingual English as well as
with loosely translated code-mixed questions in English thereby eliminating the
need to actually perform full fledged translation to answer queries. [93] gathered
a QA dataset from Facebook messages for Bengali-English CM domain. In
addition to this line of work, there were efforts for developing a cross-lingual QA
system where questions are asked in one language (English) and the answer is
provided in English but the candidate answers are searched in Hindi newspapers
[221].
[222] presented a query oriented multi-document summarization system for
Telugu-English with a dictionary based approach for cross language query ex-
pansion using bilingual lexical resources. Cross language QA systems are ex-
plored in European languages as well [223], [224].
32
this work to compare code-switching in monolingual and multilingual settings
[226, 227]. The comparisons made between a multilingual model trained on a
multilingual dataset, separate monolingual models, and a monolingual model
that is triggered based on the language identification demonstrate the effective-
ness of the multilingual model to deal with code-switched scenarios. [228] use
a more classical word probabilities based approach to determine the sentiment
of a tweet about a movie. In specific, this is performed for tweets in Telugu-
English mixed data by transliterating each Roman word to the corresponding
Telugu script and computing the probability of the word in each class. [229]
conducted a shared task for sentiment analysis of social media data in two lan-
guage pairs. The dataset used for the shared task includes around 12k and 2500
tweets released for training in Hindi-English and Bengali-English respectively.
The best performing system of the shared task used word and character level
n-gram features with an SVM classifier. A similar trend is observed by [230]
while comparing the models of Naive Bayes and SVM to perform sentiment
classification on movie reviews in Bengali-English.
Contrary to this, [231] use CNNs to model sub-word level representations.
These are then given to a dual encoder, which both capture the sentiment at the
sentence level and at the sub-word level. Similarly, [232] approached this using
multitask learning over a CNN based encoder to classify the stance taken on a
popular issue of ‘Demonetization’. The auxiliary task is posed as a manipulation
of the primary task by combining the labels.
Extending this to emotion detection, there have been several attempts to
model this as a graph problem. [233] present a proposition of scheme to anno-
tate data collected with emotions for Chinese-English corpus specifically. The
schema is developed to address the choice of text in which a sentiment is ex-
pressed. This can be either in one of Chinese, or English text, or using both
the languages, or using a mixed language text. [234] gathered a dataset from
Weibo.com which have labelled emotions which are then self aligned among
the languages using statistical machine translation paradigm. They use label
propagation over a bipartite graph constructed on bilingual and sentiment in-
33
formation. They have extended this work to joint factor graph model [235]
between the two kinds of information identifying the necessity of correlating
different emotions as well in addition to sentiment and languages. Along very
similar lines, [236] use belief propagation over the factor graphs which poses this
as a dynamic programming approach to query a graphical model. The graph
itself is constructed as joint factor graph model by utilizing both the bilingual
word information and the emotion related information.
In [237], authors employ transfer learning. They first train a CNN based
model on a large corpus of hateful tweets as source task followed by fine tuning
on a transliterated set in the same language. In [238], authors use a combination
of psycho-linguistic feature and basic features and perform model averaging. In
[239], authors investigate both hierarchical employing phonemic units and sub
world level models to detect hate speech from code mixed data.
[94] present the first work on code-switched NLI, where the task is to predict
if a hypotheses entails or contradicts the given premise, which is in the form of a
conversation taken from Bollywood (Hindi) movies. They fine-tune multilingual
BERT for the task, however, the accuracy of this model is only slightly better
than chance showing that NLI is a very challenging problem for code-switched
NLP.
34
4.14. Dialogue and discourse
While languages that are being mixed may share the same script (such as in
the case of English and Spanish), this is not true for many language pairs that
are frequently code-switched, particularly when the languages are not related
to each other. In such cases, users may choose to use the same script to write
both languages, or use a mixed script. This has implications not only in how
35
to process mixed languages, but also on how to display them. [248] presents a
study on the interaction between script-mixing and language mixing for Hindi-
English and shows that script choice may be used for emphasis, disambiguation
and marking whether a word is borrowed or not.
Much of the progress in a new field can be shaped by shared tasks in which
common datasets are released and participants compete to build systems for a
specific task. There have been several shared tasks conducted for code-switched
text processing, and a few shared tasks for code-switched speech processing over
the last few years. Shared tasks for code-switched NLP have included Language
Identification [254, 68, 255], transliterated search [256], code-mixed entity ex-
traction [257], mixed script information retrieval [258, 259], POS tagging [260],
36
Named Entity Recognition [75], Sentiment Analysis [229] and Question An-
swering [90, 91]. There have been fewer shared task for code-switched speech
processing, however, the Blizzard challenge 2014 had a code-switched speech
synthesis task [261], and code-switched ASR challenges have been conducted
for Mandarin English [45, 262]. Recently, a spoken Language Identification
challenge was conducted for inter and intra-utterance LID [66] in three code-
switched language pairs.
Each of these shared task have spurred research in their respective sub-
areas of code-switched speech and NLP. However, it is not clear how well these
individual models can generalize across different tasks and language pairs. To
address this gap, benchmarks for evaluating code-switching across different NLP
have been proposed.
The GLUECoS benchmark [263] consists of 11 datasets spanning different
tasks for code-switching across two language pairs Spanish-English and Hindi-
English, including a new task for code-switching, Natural Language Inference
(NLI). The GLUECoS benchmark aims to add more tasks to evaluate the general
language understanding capabilities of models, including tasks such as Question
Answering, Natural Language Generation and Summarization, Machine Trans-
lation and NLI. The LINCE benchmark [264] consists of 10 datasets across
5 language pairs. The tasks include LID, NER, POS tagging and Sentiment
Analysis.
Evaluations conducted on the benchmarks described above indicate that
massively multilingual contextual language models such as multilingual BERT
[101] outperform cross-lingual models and other task-specific models. These
models can be further improved by adding synthetic code-switched data to pre-
training, as shown in [263]. While models on some word level tasks such as
Language Identification and Named Entity Recognition reach high accuracy,
the performance on harder tasks like Sentiment Analysis, Question Answering
and NLI is much worse and there is a large gap between the performance of
models on monolingual tasks compared to code-switched tasks. This indicates
that massive multilingual models do not perform as well on code-switching as
37
they do on monolingual or even cross-lingual tasks. However, pre-training or
fine-tuning such models on synthetic code-switched data in the absence of real
code-switched data seems to a be promising future direction.
38
code-switching that others. Thus we are unlikely to encounter programming
languages that use code-switching, but we are much more likely to encounter
code-switching in sentiment analysis. Likewise analysis of parliamentary tran-
scripts are more likely to be monolingual, while code-switching is much more
likely in social media. Of course its not just the forum that affects the distribu-
tion, the topic too may be a factor.
These factors of use of code-switch should influence how we consider develop-
ment of code-switched models. Although it may be possible to build end-to-end
systems where large amounts of code-switching data is available, in well-defined
task environments, such models will not have the generalizations we need to
cover the whole space. For models that can make use of large amounts of un-
labeled data for training, generating synthetic code-switched data may be a
promising direction. However, most models of code-switched data generation
rely on syntactic constraints and do not take into account sociolinguistic factors
that affect code-switched language. Building models that are capable of incor-
porating these factors could lead to more realistic data generation, which could
lead to better models for code-switched speech and NLP.
It is not yet clear yet from the NLP point of view if code-switching analysis
should be treated primarily as a translation problem, or be treated as a new
language itself. It is however likely as with many techniques in low-resource
language processing, exploiting resources from nearby languages will have an
advantage. It is common (though not always) that one language involved in
code-switching has significant resources (e.g. English, Putonghua, Modern Stan-
dard Arabic). Thus transfer learning approaches are likely to offer short term
advantages. Also given the advancement of language technologies, particularly
due to the rise of massively multilingual models, developing techniques that can
work over multiple pairs of code-switched languages may lead to faster develop-
ment and generalization of the field.
Evaluating code-switched speech and NLP is challenging due to the lack
of standardized datasets. Although initial attempts at creating benchmarks
have been made, a comprehensive evaluation of code-switched systems across
39
speech and NLP tasks in many typologically different language pairs is required.
Such evaluation benchmarks are even more important due to the prevalence of
multilingual models that perform zero-shot cross-lingual transfer well, and are
also expected to perform well on code-switched languages.
Speech and language technologies for code-switching is not yet a mature
field. It is noted that the references to work in this article are for the most
part the beginnings of analysis. They are investigating the raw tools that are
necessary in order for the development of full systems. While there has been
a lot of work on individual Speech and NLP systems for code-switching, there
are no end-to-end systems that can interact in code-switched language with
multilingual humans. Specifically we are not yet seeing full end-to-end digital
assistants for code-switched interaction, or sentiment analysis for code-switched
reviews, or grammar and spelling for code-switched text. This is partly due to
lack of data for such end-to-end systems, however, a code-switching intelligent
agent has to be more than just the sum of parts that can handle code-switching.
To build effective systems that can code-switch, we will also have to leverage the
work done in sociolinguistics to understand how, when and why to code-switch.
References
References
[1] R. Hickey, The handbook of language contact, John Wiley & Sons, 2012.
40
[6] X. Qian, G. Tian, Q. Wang, Codeswitching in the primary efl classroom
in china–two case studies, System 37 (4) (2009) 719–730.
41
[17] K. Bali, J. Sharma, M. Choudhury, Y. Vyas, ” i am borrowing ya mixing?”
an analysis of english-hindi code mixing in facebook, in: Proceedings of
the First Workshop on Computational Approaches to Code Switching,
2014, pp. 116–126.
42
[27] G. Bhat, M. Choudhury, K. Bali, Grammatical constraints on intra-
sentential code-switching: From theories to working models, arXiv
preprint arXiv:1612.04538.
[30] A. Backus, Codeswitching and language change: One thing leads to an-
other?, International Journal of Bilingualism 9 (3-4) (2005) 307–340.
[31] V. Soto, N. Cestero, J. Hirschberg, The role of cognate words, pos tags
and entrainment in code-switching., in: Interspeech, 2018, pp. 1938–1942.
43
[36] G. A. Guzman, J. Serigos, B. Bullock, A. J. Toribio, Simple tools for
exploring variation in code-switching for linguists, in: Proceedings of the
Second Workshop on Computational Approaches to Code Switching, 2016,
pp. 12–20.
[42] G. Lee, T.-N. Ho, E.-S. Chng, H. Li, A review of the mandarin-english
code-switching corpus: Seame, in: 2017 International Conference on Asian
Language Processing (IALP), IEEE, 2017, pp. 210–213.
[44] H.-P. Shen, C.-H. Wu, Y.-T. Yang, C.-S. Hsu, Cecos: A chinese-english
code-switching speech database, in: 2011 International Conference on
Speech Database and Assessments (Oriental COCOSDA), IEEE, 2011,
pp. 120–123.
44
[45] D. Wang, Z. Tang, D. Tang, Q. Chen, Oc16-ce80: A chinese-english
mixlingual database and a speech recognition baseline, in: 2016 Confer-
ence of The Oriental Chapter of International Committee for Coordination
and Standardization of Speech Databases and Assessment Techniques (O-
COCOSDA), IEEE, 2016, pp. 84–88.
[47] D.-C. Lyu, R.-Y. Lyu, Y.-c. Chiang, C.-N. Hsu, Speech Recognition on
Code-Switching among the Chinese Dialects, in: Acoustics, Speech and
Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE Interna-
tional Conference on, Vol. 1, IEEE, 2006, pp. I–I.
[51] V. Ramanarayanan, D. Suendermann-Oeft, Jee haan, I’d like both, por fa-
vor: Elicitation of a Code-Switched Corpus of Hindi–English and Spanish–
English Human–Machine Dialog, Proc. Interspeech 2017 (2017) 47–51.
45
[53] G. Sreeram, K. Dhawan, R. Sinha, Hindi-English Code-Switching Speech
Corpus, arXiv preprint arXiv:1810.00662.
46
[62] E. van der Westhuizen, T. Niesler, Automatic Speech Recognition of
English-isizulu Code-Switched Speech from South African Soap Operas,
Procedia Computer Science 81 (2016) 121–127.
[63] E. van der Westhuizen, T. Niesler, A first south african corpus of multi-
lingual code-switched soap opera speech, in: Proceedings of the Eleventh
International Conference on Language Resources and Evaluation (LREC
2018), 2018.
47
the first workshop on computational approaches to code switching, 2014,
pp. 13–23.
48
[78] S. Ghosh, S. Ghosh, D. Das, Part-of-speech tagging of code-mixed social
media text, in: Proceedings of the Second Workshop on Computational
Approaches to Code Switching, 2016, pp. 90–97.
49
[86] I. Bhat, R. A. Bhat, M. Shrivastava, D. Sharma, Joining hands: Exploiting
monolingual treebanks for parsing of code-mixing data, in: Proceedings
of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 2, Short Papers, Vol. 2, 2017, pp.
324–330.
50
[93] S. Banerjee, S. K. Naskar, P. Rosso, S. Bandyopadhyay, The First Cross-
Script Code-Mixed Question Answering Corpus.
[95] K. Chakma, A. Das, Cmir: A corpus for evaluation of code mixed infor-
mation retrieval of hindi-english tweets, Computación y Sistemas 20 (3)
(2016) 425–434.
51
[102] J. Y. Chan, P. Ching, T. Lee, H. M. Meng, Detection of language boundary
in code-switching utterances by bi-phone probabilities, in: International
Symposium on Chinese Spoken Language Processing, IEEE, 2004.
[108] C.-F. Yeh, L.-S. Lee, An improved framework for recognizing highly im-
balanced bilingual code-switched lectures with Cross-language Acoustic
modeling and frame-level language identification, IEEE Transactions on
Audio, Speech, and Language Processing(ICASSP) 2015.
52
[110] K. Bhuvanagiri, S. Kopparapu, An approach to mixed language Automatic
Speech Recognition, Oriental COCOSDA, Kathmandu, Nepal, 2010.
[113] Y. Long, Y. Li, Q. Zhang, S. Wei, H. Ye, J. Yang, Acoustic data augmen-
tation for mandarin-english code-switching speech recognition, Applied
Acoustics 161 (2020) 107175.
53
[120] E. Yılmaz, H. v. d. Heuvel, D. A. van Leeuwen, Acoustic and textual data
augmentation for improved asr of code-switching speech, arXiv preprint
arXiv:1807.10945.
54
[128] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, L. Xie, Investigating
end-to-end speech recognition for mandarin-english code-switching.
[131] K. Li, J. Li, G. Ye, R. Zhao, Y. Gong, Towards code-switching asr for end-
to-end ctc models, in: ICASSP 2019-2019 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp.
6076–6080.
[135] K. Li, J. Li, G. Ye, R. Zhao, Y. Gong, Towards code-switching asr for end-
to-end ctc models, in: ICASSP 2019-2019 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp.
6076–6080.
55
[137] G. Reddy Madhumani, S. Shah, B. Abraham, V. Joshi, S. Sitaram, Learn-
ing not to discriminate: Task agnostic learning for improving monolingual
and code-switched speech recognition, arXiv (2020) arXiv–2006.
[140] Y. Li, P. Fung, Code-switch language model with inversion constraints for
mixed language speech recognition, Proceedings of COLING 2012 (2012)
1671–1680.
[141] Y. Li, P. Fung, Code switch language modeling with functional head con-
straint, in: 2014 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), IEEE, 2014, pp. 4913–4917.
[142] Y. Li, P. Fung, Language modeling for mixed language speech recognition
using weighted phrase extraction., in: Interspeech, 2013, pp. 2599–2603.
56
[146] H. Adel, N. T. Vu, T. Schultz, Combination of recurrent neural networks
and factored language models for code-switching language modeling, in:
Proceedings of the 51st Annual Meeting of the Association for Computa-
tional Linguistics (Volume 2: Short Papers), Vol. 2, 2013, pp. 206–211.
57
[155] C.-T. Chang, S.-P. Chuang, H.-Y. Lee, Code-switching sentence genera-
tion by generative adversarial networks and its application to data aug-
mentation, arXiv preprint arXiv:1811.02356.
58
[164] E. Yılmaz, H. v. d. Heuvel, D. A. van Leeuwen, Code-switching detec-
tion with data-augmented acoustic and language models, arXiv preprint
arXiv:1808.00521.
[165] D.-C. Lyu, R.-Y. Lyu, C.-L. Zhu, M.-T. Ko, Language identification in
code-switching speech using word-based lexical model, in: 2010 7th In-
ternational Symposium on Chinese Spoken Language Processing, IEEE,
2010, pp. 460–464.
[166] D.-C. Lyu, T.-P. Tan, E.-S. Chng, H. Li, An analysis of a mandarin-english
code-switching speech corpus: Seame, Age 21 (2010) 25–8.
59
[174] N. K. Elluru, A. Vadapalli, R. Elluru, H. Murthy, K. Prahallad, Is word-
to-phone mapping better than phone-phone mapping for handling English
words?, in: Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics, 2013.
[177] Y. Cao, X. Wu, S. Liu, J. Yu, X. Li, Z. Wu, X. Liu, H. Meng, End-to-
end code-switched tts with mix of monolingual recordings, in: ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, 2019, pp. 6935–6939.
[178] Y. Cao, S. Liu, X. Wu, S. Kang, P. Liu, Z. Wu, X. Liu, D. Su, D. Yu,
H. Meng, Code-switched speech synthesis using bilingual phonetic poste-
riorgram with only monolingual corpora, in: ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2020, pp. 7619–7623.
60
[182] M. X. Xia, Codeswitching language identification using subword informa-
tion enriched word vectors, in: Proceedings of The Second Workshop on
Computational Approaches to Code Switching, 2016, pp. 132–136.
[185] Y.-L. Yeong, T.-P. Tan, Language identification of code switching malay-
english words using syllable structure information, in: Spoken Languages
Technologies for Under-Resourced Languages, 2010.
[187] Y.-L. Yeong, T.-P. Tan, Applying grapheme, word, and syllable infor-
mation for language identification in code switching sentences, in: 2011
International Conference on Asian Language Processing, IEEE, 2011, pp.
111–114.
[188] Y.-L. Yeong, T.-P. Tan, Language identification of code switching sen-
tences and multilingual sentences of under-resourced languages by using
multi structural word information, in: Fifteenth Annual Conference of the
International Speech Communication Association, 2014.
61
[190] N. Dongen, Analysis and prediction of dutch-english code-switching in
dutch social media messages, Master’s thesis, Universiteit van Amsterdam,
Amsterdam, Netherlands.
62
Proceedings of the Second Workshop on Computational Approaches to
Code Switching, 2016, pp. 112–115.
[202] M. Attia, Y. Samih, W. Maier, Ghht at calcs 2018: Named Entity Recog-
nition for Dialectal Arabic using Neural Networks, in: Proceedings of
the Third Workshop on Computational Approaches to Linguistic Code-
Switching, 2018.
[205] A. Zirikly, M. Diab, Named Entity Recognition for Arabic Social Media,
in: Proceedings of the 1st Workshop on Vector Space Modeling for Natural
Language Processing, 2015.
63
[207] D. Etter, F. Ferraro, R. Cotterell, O. Buzek, B. Van Durme, Nerit: Named
Entity Recognition for Informal Text.
[215] R. van der Goot, Ö. Çetinoğlu, Lexical normalization for code-switched
data and its effect on pos-tagging, arXiv preprint arXiv:2006.01175.
64
[216] P. Goyal, M. R. Mital, A. Mukerjee, A Bilingual Parser for Hindi, En-
glish and code-switching structures, in: 10th Conference of The European
Chapter, 2003.
[219] X. Li, D. Roth, Learning Question Classifiers, in: Proceedings of the 19th
international conference on Computational linguistics-Volume 1, Associa-
tion for Computational Linguistics, 2002.
65
[225] D. Vilares, M. A. Alonso, C. Gómez-Rodrı́guez, En-es-cs: An english-
spanish code-switching twitter corpus for multilingual sentiment analysis,
in: Proceedings of the Tenth International Conference on Language Re-
sources and Evaluation (LREC’16), 2016, pp. 4149–4153.
66
[233] S. Lee, Z. Wang, Emotion in code-switching texts: Corpus construction
and analysis, in: Proceedings of the eighth SIGHAN workshop on Chinese
language processing, 2015, pp. 91–99.
67
neural machine translation system: Enabling zero-shot translation, Trans-
actions of the Association for Computational Linguistics 5 (2017) 339–351.
68
in: Proceedings of the 2015 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Human Language
Technologies, 2015, pp. 1036–1041.
[257] P. R. Rao, S. L. Devi, Cmee-il: Code mix entity extraction in indian lan-
guages from social media text@ fire 2016-an overview., in: FIRE (Working
Notes), 2016, pp. 289–295.
69
[258] R. Sequiera, M. Choudhury, P. Gupta, P. Rosso, S. Kumar, S. Banerjee,
S. K. Naskar, S. Bandyopadhyay, G. Chittaranjan, A. Das, et al., Overview
of fire-2015 shared task on mixed script information retrieval., in: FIRE
Workshops, Vol. 1587, 2015, pp. 19–25.
[260] A. Jamatia, A. Das, Task report: Tool contest on pos tagging for code-
mixed indian social media (facebook, twitter, and whatsapp) text@ icon
2016., Proceedings of ICON.
70