Are Current Task-oriented Dialogue Systems Able to Satisfy Impolite

Are Current Task-oriented Dialogue Systems Able

to Satisfy Impolite Users?

Zhiqiang Hu

∗

, Roy Ka-Wei Lee

∗

, Nancy F. Chen

†

∗

Singapore University of Technology and Design, Singapore

zhiqiang [email protected], roy [email protected]

†

Institute of Infocomm Research (I2R), A*STAR, Singapore

nfychen@i2r.a-star.edu.sg

Abstract—Task-oriented dialogue (TOD) systems have assisted

users on many tasks, including ticket booking and service

inquiries. While existing TOD systems have shown promising per-

formance in serving customer needs, these systems mostly assume

that users would interact with the dialogue agent politely. This

assumption is unrealistic as impatient or frustrated customers

may also interact with TOD systems impolitely. This paper aims

to address this research gap by investigating impolite users’

effects on TOD systems. Speciﬁcally, we constructed an impolite

dialogue corpus and conducted extensive experiments to evaluate

the state-of-the-art TOD systems on our impolite dialogue corpus.

Our experimental results show that existing TOD systems are

unable to handle impolite user utterances. We also present

a data augmentation method to improve TOD performance

in impolite dialogues. Nevertheless, handling impolite dialogues

remains a very challenging research task. We hope by releasing

the impolite dialogue corpus and establishing the benchmark

evaluations, more researchers are encouraged to investigate this

new challenging research task.

Index Terms—Task-oriented dialogue systems, impolite users,

data augmentation.

I. INTRODUCTION

Motivation. Task-oriented dialogue (TOD) systems play a

vital role in many businesses and service operations. Specif-

ically, these systems are deployed to assist users with spe-

ciﬁc tasks such as ticket booking and restaurant reservations

through natural language conversations. TOD systems are

usually built through a pipeline architecture that consists of

four sequential modules, including natural language under-

standing (NLU), dialogue state tracking (DST), policy learning

(POL), and natural language generation (NLG) [1]–[4]. More

recently, researchers have also explored leveraging large pre-

trained language models to improve the performance of TOD

systems [5]–[7]. These TOD systems have demonstrated their

effectiveness in understanding and responding to the users’

needs through conversations.

As most TOD systems are developed to serve and assist

humans in performing speciﬁc tasks, the politeness of the

TD systems remains a key design consideration. For instance,

Gupta et al. [8] presented POLLy (Politeness for Language

Learning), a system that combines a spoken language genera-

tor with an AI Planner to model Brown and Levinson’s theory

of politeness in TOD. Bothe et al. [9] developed a dialogue-

based navigation approach incorporating politeness and so-

ciolinguistic features for robotic behavioral modeling. More

User

Bot

(PPTOD)

Impolite

User

Bot

(PPTOD)

What is the phone number and address?

The phone number for the Portuguese

restaurant is 01223 361355 and the address is

Cambridge Leisure Park Clifton Way.

phone number and address? Do I really have to

chase you for every single detail?

I'm sorry, I don't.

Fig. 1. Examples of two dialogue interactions between PPTOD and two types

of users: normal user (top) and impolite user (bottom).

recently, Mishra et al. [10] proposed a politeness adaptive

dialogue system (PADS) that can interact with users politely

and showcases empathy.

Nevertheless, the above studies have focused on generating

polite dialogues and ignored the users’ politeness (or impolite-

ness) in the conversation. Therefore it is unclear how the TOD

systems would respond when the users interact with the TOD

systems in an impolite manner, especially when the users are

in a rush to get information or frustrated when TOD systems

provide irrelevant responses. Consider the example in Figure 1,

we notice that the TOD system, PPOTD [11], is able to provide

a proper response to a user who presents the question in

a normal or polite manner. However, when encountering an

impolite user, PPOTD is not able to provide a proper and

relevant response. Ideally, the TOD systems should be robust

in handling user requests regardless of their politeness.

A straightforward approach to improving TOD systems’

ability to handle impolite users is to train the dialogue systems

with impolite user utterances. Unfortunately, most of the

existing TOD datasets [12]–[15] only capture user utterances

that are neural or polite. The lack of an impolite dialogue

dataset also limits the evaluation of TOD systems; to the

best of our knowledge, there are no existing studies on the

robustness of TOD systems in handling problematic users.

Research Objectives. To address the research gaps, we aim

to investigate the effects of impolite user utterances on TOD

systems. Working towards this goal, we collect and annotate

an impolite dialogue corpus by manually rewriting the user

utterances of the MultiWOZ 2.2 dataset [13]. Speciﬁcally,

arXiv:2210.12942v1 [cs.CL] 24 Oct 2022

human annotators are recruited to rewrite the user utterances

with role-playing scenarios that could encourage impolite user

utterances. For example, “imagine you are in a rush and

frustrated that the systems have given the wrong response for

the second time.”. In total, the human annotators have rewritten

over 10K impolite user utterances. Statistical and linguistic

analyses of the impolite user utterances are also performed to

understand the constructed dataset better.

The impolite dialogue corpus is subsequently used to eval-

uate the performance and limitations of state-of-the-art TOD

systems. Speciﬁcally, we have designed experiments to evalu-

ate the robustness of TOD systems in handling impolite users

and understand the effects of impolite user utterances on these

systems. We have also explored solutions to improve TOD

systems’ robustness in handling impolite users. A possible

solution is to train the TOD systems with more data. However,

the construction of a large-scale impolite dialogue corpus is a

laborious and expensive process. Therefore, we propose a data

augmentation method that utilizes text style transfer techniques

to improve TOD systems’ performance in impolite dialogues.

Contributions. We summarize our contributions as follows:

• We collect and annotate an impolite dialogue corpus

to support the evaluation of TOD systems performance

when handling impolite users. We hope that the impolite

dialogue dataset will encourage researchers to propose

TOD systems that are robust in handling users’ requests.

• We evaluate the performance of six state-of-the-art TOD

systems using our impolite dialogue corpus. The eval-

uation results showed that existing TOD systems have

difﬁculty handling impolite users’ requests.

• We propose a simple data augmentation method that

utilizes text style transfer techniques to improve TOD

systems’ performance in impolite dialogues.

II. RELATED WORK

1) Task Oriented Dialogue Systems: With the rapid ad-

vancement of deep learning techniques, TOD systems have

shown promising performance in handling user requests and

interactions. Recent studies have focused on end-to-end TOD

systems to train a general mapping from user utterance to the

system’s natural language response [16]–[20]. Yang et al. [5]

proposed UBAR by ﬁne-tuning the large pre-trained unidirec-

tional language model GPT-2 [21] on the entire dialog session

sequence. The dialogue session consists of user utterances,

belief states, database results, system actions, and system re-

sponses of every dialog turn. Su et al. [11] proposed PPTOD to

effectively leverage pre-trained language models with a multi-

task pre-training strategy that increases the model’s ability

with heterogeneous dialogue corpora. Lin et al. [7] proposed

Minimalist Transfer Learning (MinTL) to plug-and-play large-

scale pre-trained models for domain transfer in dialogue task

completion. Zang et al. [13] proposed the LABES model,

which treated the dialogue states as discrete latent variables to

reduce the reliance on turn-level DST labels. Kulh

anek et al.

[22] proposed AuGPT with modiﬁed training objectives for

language model ﬁne-tuning and data augmentation via back-

translation [23] to increase the diversity of the training data.

Existing studies have also leveraged knowledge bases to track

pivotal and critical information required in generating TOD

system agent’s responses [24]–[27]. For instance, Madotto

et al. [28] dynamically updated a knowledge base via ﬁne-

tuning by directly embedding it into the model parameters.

Other studies have also explored reinforcement learning to

build TOD systems [29]–[31]. For instance, Zhao et al. [29]

utilized a Deep Recurrent Q-Networks (DRQN) for building

TOD systems.

2) Modeling Politeness in Dialogue Systems: Recent stud-

ies have also attempted to improve dialogue systems to gen-

erate responses in a more empathetic manner [32]–[35]. Yu et

al. [33] proposed to include user sentiment obtained through

multimodal information (acoustic, dialogic, and textual) in

the end-to-end learning framework to make TOD systems

more user-adaptive and effective. Feng et al. [34] constructed

a corpus containing task-oriented dialogues with emotion

labels for emotion recognition in TOD systems. However, the

impoliteness of users is not modeled as too few instances

exist in the MultiWOZ dataset. The lack of impolite dialogue

data motivates us to construct an impolite dialogue corpus to

facilitate downstream analysis.

Politeness is a human virtue and a crucial aspect of com-

munication [36], [37]. Danescu-Niculescu-Mizil et al. [37]

proposed a computational framework to identify the linguistic

aspects of politeness with application to social factors. Re-

searchers have also attempted to model and include politeness

in TOD systems [38]–[40]. For instance, Golchha et al. [38]

utilized a reinforced pointer generator network to transform

a generic response into a polite response. More recently,

Madaan et al. [40] adopted a text style transfer approach

to generate polite sentences while preserving the intended

content. Nevertheless, most of these studies have focused

on generating polite responses, neglecting the handling of

impolite inputs, i.e., impolite user utterances. This study aims

to ﬁll this research gap by extensively evaluating state-of-the-

art TOD systems’ ability to handle impolite dialogues.

3) Data Augmentation in Dialogue Systems: Data augmen-

tation, which aims to enlarge training data size in machine

learning systems, is a common solution to the data scarcity

problem. Data augmentation methods has also been widely

used in dialogue systems [22], [41], [42]. For instance, Kurata

et al. [43] trained an encoder-decoder to reconstruct the utter-

ances in training data. To augment training data, the encoder’s

output hidden states are perturbed randomly to yield different

utterances. Hou et al. [41] proposed a sequence-to-sequence

generation-based data augmentation framework that models

relations between utterances of the same semantic frame in

the training data. Gritta et al. [42] proposed the Conversation

Graph (ConvGraph), which is a graph-based representation of

dialogues, to augment data volume and diversity by generating

dialogue paths. In this paper, we propose a simple data aug-

mentation method that utilizes text style transfer techniques to

generate impolite user utterances for training data to improve

TOD systems’ performance in dealing with impolite users.

III. IMPOLITE DIALOGUE CORPUS

We construct an impolite dialogue dataset to support our

evaluation of TOD systems’ ability to interpret and respond

to impolite users. Speciﬁcally, we recruited eight native En-

glish speakers to rewrite the user utterances in MultiWOZ

2.2 dataset [13] in an impolite manner. To the best of our

knowledge, this is the ﬁrst impolite task-oriented dialogue

corpus. In the subsequent sections, we will discuss the corpus

construction process and provide a preliminary analysis of the

constructed impolite dialogue corpus.

A. Corpus Construction

MultiWOZ 2.2 [13] is a large-scale multi-domain task-

oriented dialogue benchmark that contains dialogues in seven

domains, including attraction, hotel, hospital, bus, restaurant,

train, and taxi. This dataset is also popular and commonly

used to evaluate existing TOD systems [5], [7], [11], [17],

[18], [22]. We performed a preliminary analysis using the

Stanford Politeness classiﬁer trained on Wikipedia requests

data [37] to assign a politeness score to the user utterances in

MultiWOZ 2.2. We found that 99% of the user utterances are

classiﬁed as polite. Hence, we aim to rewrite the user utterance

in MultiWOZ 2.2 to create our impolite dialogue corpus.

Impolite Rewriting. The goal is to rewrite the user ut-

terance in the MultiWOZ 2.2 dataset and present the user

utterance in a rude and impolite manner. We randomly sampled

a subset of dialogues from MultiWOZ 2.2 for rewriting.

Next, we recruited eight native English speakers to rewrite

the user utterances. For each user utterance, the annotators

are presented with the entire dialogue history to have the

conversation’s overall context. The annotators are tasked to

rewrite the user utterances with three objectives: (i) the

rewritten sentences should be impolite, (ii) the content of

the rewritten sentence should be semantically close to the

original sentence, and (iii) the rewritten sentences should be

ﬂuent. To further encourage the diversity of the impolite user

utterance, we also prescribed six role-playing scenarios to aid

the annotators in the rewriting tasks. For instance, we asked

the annotators to imagine they were customers in a bad mood

or impatient customers who wanted to get the information

quickly. The details of the role-playing scenarios are shown

in Table I, and the annotation system interface is included in

the Appendix A-A.

Annotation Quality Control. Impoliteness is subjective,

and the annotators may have different interpretations of im-

politeness. Therefore, we implement iterative checkpoints to

evaluate the quality of the rewritten user utterance. Speciﬁ-

cally, we conducted peer evaluation at various checkpoints to

allow annotators to rate the quality of each other’s rewritten

sentences. The annotators are tasked to rate the rewritten user

utterance based on the following three criteria:

• Politeness. Rate the sentence politeness using a 5-point

Likert scale. 1: strongly opined that the sentence is

impolite, and 5: strongly opined that the sentence is

polite.

• Content Preservation. Compare the original user utter-

ance and rewrite sentence and rate the amount of content

TABLE I

Role-playing scenarios for impolite user annotation.

No. Scenario

1 Imagine that the customer is a sarcastic person in

a bad mood.

2 Imagine that the customer is an impatient customer

that wants to get the information fast.

3 Imagine that the customer is in a bad mood as

something bad has just happened (e.g., just had an

argument with friends or spouse).

4 Imagine that the customer is tired and hungry after

a long-haul ﬂight and need to get this information

fast.

5 Imagine that the customer is a spoilt-brat with a

lot of money.

6 Imagine that the customer is getting help from

CSA for the third time and they didn’t get the

previous information right.

preserved in the rewrite sentence using a 5-point Likert

scale. 1: the original and rewritten sentences have very

different content, and 5: the original and rewritten sen-

tences have the same content.

• Fluency. Rate the ﬂuency of rewritten sentences using

a 5-point Likert scale. 1: unreadable with too many

grammatical errors, 5: perfectly ﬂuent sentence.

Each user utterance is evaluated by two annotators. Never-

theless, we recognize that it is unnatural for all utterances in a

dialogue to be impolite. Thus, we would consider a dialogue

impolite if 50% of the user utterances in a conversation are

rated as impolite (i.e., with Politeness score 2 or less). This

exercise allows the annotators to align their understanding of

the rewriting task. The annotators will revise the unqualiﬁed

dialogues until they are rated impolite in the peer evaluation.

While the annotators are tasked to assess each other’s work,

they are unaware of their evaluation scores to mitigate any

biases. Therefore, the annotators might learn new ways to write

impolite dialogues from each other, but they are not writing

to “optimize” any assessment scores in the human evaluation.

B. Corpus Analysis

In total, the annotators rewrote 1,573 dialogues, comprising

10,667 user utterances. Table II shows the distributions of the

MultiWOZ 2.2 dataset and our impolite dialogue corpus. As

we have sampled a substantial number of dialogues from Mul-

tiWOZ 2.20, we notice that the rewritten impolite dialogues

follow similar domain distributions as the original dataset.

Table III shows the results of the ﬁnal peer evaluation

of all rewritten impolite user utterances. Speciﬁcally, the

average politeness, content preservation, and ﬂuency scores of

the rewritten impolite user utterances are reported. The high

average content preservation and ﬂuency scores suggest that

the high-quality rewritten utterances retain the original user’s

intention in the conversations. More importantly, the average

politeness score is 1.96, indicating that most of the rewrites

are impolite but not too “offensive”. We further examine and

show the politeness score distribution of the rewritten impolite

dialogue in Figure 2. Note that the politeness score of dialogue

TABLE II

Domain distribution of MultiWOZ 2.2 and our annotated impolite dialogue corpus. Numbers in () represent the percentage of dialogues in each domain.

Data Restaurant Attraction Hotel Taxi Train&Bus Hospital

MultiWOZ 2.2 3836 (45.5%) 2681 (31.8%) 3369 (39.9%) 1463 (17.3%) 2969 (35.2%) 107 (1.3%)

Impolite Dialogue Corpus 694 (44.0%) 545 (28.8%) 630 (40.0%) 295 (18.7%) 528 (33.5%) 107 (6.8%)

TABLE III

Peer evaluation results.

Metric Avg. Score

Politeness 1.96

Content Preservation 4.68

Fluency 4.67

Fig. 2. Distribution of dialogues binned according to the politeness scores.

is obtained by averaging the politeness scores of the rewritten

user utterances in the given dialogue. We observe that most

rewritten dialogues are impolite, having politeness scores of

less than 2.5.

We also empirically examine the top 20 keywords in the

original and rewritten dialogues (shown in Table IV. We notice

that in the original dialogues, users tend to express gratitude

by using terms such as “thank” and adopt courtesy terms

such “please”, “help”. Users in original dialogues also posted

their requests as questions using terms such as “would you”.

Conversely, in the rewritten dialogues, gratitude terms are

absent. Users are less courteous and issued direct commands

to the agent using terms such as “ﬁnd”, “get”, “give”. Frequent

terms “hurry” and “fast” also suggested the users’ impatience

in the dialogues.

As we aim to keep the rewritten utterances natural and

realistic, we did not limit the users to using offensive language

(neither did we encourage them). Interestingly, we have also

checked the impolite user utterances and found offensive

languages (e.g., “idiot”, “stupid”, “f*ck”, etc.) being used in

some of the rewritten utterances.

IV. MODELS

We experiment with six state-of-the-art TOD systems and

evaluate their performance in handling user utterances in our

constructed impolite dialogue dataset. These TOD systems are

not designed to respond to impolite users or trained with impo-

TABLE IV

Top 10 keywords in original and rewritten dialogues.

Top 20 keywords in Orig-

inal Dialogues

Top 20 keywords in

Rewritten Dialogues

need, please, thank, yes,

like, looking, would, num-

ber, book, also, restaurant,

thanks, help, train, hotel,

Cambridge, people, ﬁnd,

free, get

ﬁnd, get, give, time, need,

book, want, number, go,

hurry, ok, fast, one, restau-

rant,make, people, job, bet-

ter, train, Cambridge

lite dialogues. Thus, they may not perform well on our corpus.

A simple approach to improve the TOD systems’ performance

is to train them with impolite user utterances. However, there

is inadequate impolite dialogue data to train these systems.

Therefore, we propose a data augmentation strategy to enhance

the TOD systems’ ability to handle impolite users.

A. TOD Systems

Recent studies have proposed TOD systems with promising

performance on the MultiWOZ 2.2 dataset. For our study,

we select six TOD systems that have achieved state-of-the-

art performance on the task-oriented dialogue generation task.

Domain Aware Multi-Decoder (DAMD) [17] utilizes the one-

to-many property that assumes that multiple responses may

be appropriate for the same dialog context to generate diverse

dialogue responses. LABES [18] is a probabilistic dialogue

model with belief states represented as discrete latent variables

and jointly modeled with system responses. MinTL [7] is

a transfer learning framework that allows plug-and-play pre-

trained seq2seq models to jointly learn dialogue state tracking

and dialogue response generation. UBAR [5] ﬁne-tunes GPT-

2 [21] on the entire dialogue session sequence for dialogue

response generation. Similarly, AuGPT [22] ﬁne-tunes GPT-2

with modiﬁed training objectives and enhanced data through

back-translation. PPTOD [11] uniﬁes the TOD task as multi-

ple generation tasks, including intent detection, dialogue state

tracking, and response generation.

B. Data Augmentation with Text Style Transfer

Training TOD systems with impolite dialogues can im-

prove their capabilities in handling impolite users. However,

collecting impolite dialogues is a challenging task as human

annotation is a laborious and expensive process. Therefore, we

explore augmenting TOD systems by generating impolite user

utterances automatically. We formulate the impolite utterance

generation as a supervised text style transfer task [44] (i.e.,

politeness transfer). Speciﬁcally, the text style transfer algo-

rithms will be trained using a parallel dataset; we have pairs

of aligned polite (i.e., original) and the corresponding impolite

(i.e., rewritten) user utterances in our impolite dialogue corpus.

Subsequently, the trained text style transfer algorithms will

be able to take in unseen user utterances from MultiWOZ

2.2 dataset as input and generate impolite user utterances as

output. Finally, the generated impolite user utterances will be

augmented to train TOD systems.

In our implementation, we ﬁrst ﬁne-tune the pre-trained

language model T5-large [45] and BART-large [46] with polite

user utterances as inputs and impolite rewrites as targets. Then,

the ﬁne-tuned T5 and BART model is used to transfer the

politeness for user utterances in the rest of the dialogues

in MultiWOZ 2.2 dataset. We have also included a state-of-

the-art text style transfer model, DAST [47], to perform the

politeness transfer task.

We evaluate the text style transferred user utterance on

three automatic metrics, including politeness, content preser-

vation, and ﬂuency. For politeness, we train a politeness

classiﬁer based on the BERT base model [48] with the original

polite user utterances and corresponding impolite rewrites.

The trained classiﬁer predicts if a given user utterance is

correctly transferred to impolite style, and the accuracy of the

predictions is reported. For content preservation, we compute

the BLEU score [49] between the transferred sentences and

original user utterances. For ﬂuency, we use GPT-2 [21] to

calculate perplexity score (PPL) on the transferred sentences.

Finally, We compute the geometric mean (G-Mean score) of

ACC, BLEU, and 1/PPL to give an overall score of the models’

performance. We take the inverse of the calculated perplexity

score because a lower PPL score corresponds to better ﬂuency.

These evaluation metrics are commonly used in existing text

style transfer studies [50].

Table V shows the automatic evaluation results of the style

transferred user utterances. We observe that both pre-trained

language models have achieved reasonably good performance

on the politeness style transfer task. Speciﬁcally, the ﬁne-tuned

T5-large model achieves the best performance on the G-Mean

score and 85.3%. The transferred user utterances are mostly

impolite (i.e., high accuracy score) and ﬂuent (i.e., low PPL).

The content is also well-presented. Interestingly, we noted

that DAST did not perform well for the politeness transfer

task. A possible reason could be that the DAST is designed

to perform non-parallel text style transfer, and the model did

not exploit the parallel information in the training data [47].

Another reason could be the small training dataset; DAST is

trained on our impolite dialogue corpus, while the T5 and

BART are pre-trained with a larger corpus and ﬁne-tuned on

our impolite dialogue dataset.

The automatically generated impolite user utterances from

the ﬁne-tuned T5-Large are augmented to the original Multi-

WOZ 2.2 dataset to train the TOD systems. We will discuss

the effects of data augmentation in our experiment section.

V. EVALUATION

One of the primary goals of this study is to evaluate the

TOD systems’ ability to handle impolite dialogues. Therefore,

we design automatic and human evaluation experiments to

TABLE V

Automatic evaluation results of text style transfer task.

Model ACC(%) BLEU PPL G-Mean

DAST 73.6 25.4 28.5 4.03

BART-large Fine-tune 96.4 37.6 6.2 8.36

T5-large Fine-tune 85.3 43.4 6.3 8.38

benchmark the performance of TOD systems on our impolite

dialogue corpus.

A. Automatic Evaluation

Over the past decades, many different automatic evaluation

methods [51]–[53] have been proposed to evaluate TOD sys-

tems. The evaluation methodologies are inextricably linked to

the properties of the evaluated dialogue system. For instance,

Nekvinda et al. [54] proposed their standalone standardized

evaluation scripts for the MultiWOZ dataset to eliminate in-

consistencies in data preprocessing and reporting of evaluation

metrics, i.e., BLEU score and Inform & Success rates. For our

study, we employ four automatic evaluation metrics that are

commonly used in existing studies [12], [54] to benchmark

the performance of TOD systems on impolite dialogues:

• Inform: The inform rate is the proportion of dialogues in

which the system mentions a name of an entity that does

not conﬂict with the current dialogue state or the user’s

goal.

• Success: The percentage of dialogues in which the system

provides the correct entity and answers all the requested

information.

• BLEU: The BLEU score between the generated utter-

ances and the ground truth is used to approximate the

output ﬂuency.

• Combined. The combined score is computed through

(Inform+Success)×0.5+BLEU as an overall quality

measure suggested in [55].

B. Human Evaluation

There are relatively lesser studies that performed a human

evaluation on TOD systems [11], [17], [22] as such evaluations

are often expensive and laborious. In our study, we perform a

human evaluation to access the TOD systems on three criteria:

The human evaluation is conducted on a random subset of 100

dialogues with 50 of the original dialogue in the MultiWOZ

2.2 dataset and 50 corresponding dialogues from our impolite

dialogue corpus. We recruit four human evaluators, and at

least two human evaluators evaluate each dialogue. The human

evaluators are tasked to evaluate the generated dialogues on

the following criteria:

• Success: A binary indicator (yes or no) on whether

the dialogue system fulﬁlls the information requirements

dictated by the user’s goals. For instance, this includes

whether the dialogue system has found the correct type

of venue and whether the dialogue system returned all

the requested information.

• Comprehension: A 5-point Likert scale that indicates the

TOD system’s level of comprehension of the user input.

1: the TOD system did not understand the user intention;

5: TOD system understands well the user input.

• Appropriateness: A 5-point Likert scale indicates if the

TOD system’s response is appropriate and human-like. 1:

the TOD system’s response is inappropriate and does not

make sense; 5: TOD system’s response is appropriate and

human-like.

VI. EXPERIMENTS

In this section, we design experiments to evaluate the

state-of-the-art TOD systems’ ability to handle impolite users

using our impolite dialogue corpus. The rest of this section

is organized as follows: We ﬁrst present the details of the

experimental settings. Next, we perform automatic and human

evaluations of the TOD systems and analyze their performance

on our impolite dialogue corpus. Finally, we conduct further

empirical analyses to understand the TOD systems’ challenges

in handling impolite users.

A. Experimental Settings

Reproduce Models. We reproduce six state-of-the-art TOD

systems by training them on the MultiWOZ 2.2 dataset. For

TOD systems that have publicly released checkpoints, such as

AuGPT [22] and PPTOD [11], we directly use the released

checkpoints trained on MultiWOZ 2.2. For DAMD [17],

UBAR [5], MinTL [7], and LABES [18], we use their

published code implementations and optimize their hyperpa-

rameters to reproduce their reported results for MultiWOZ

2.2 dataset

. Table VI shows the reported and reproduced

performance of the TOD systems tested on the MultiWOZ

2.2 dataset. We observe that the reproduced performance

is equivalent to or slightly better than the reported results.

Our subsequent experiment will evaluate the reproduced TOD

systems on the impolite dialogue corpus.

Data Augmentation. To evaluate the effectiveness of our

proposed data augmentation solution, we train the six TOD

systems with MultiWOZ 2.2 dataset augmented with the au-

tomatically generated impolite dialogues. The data augmented

TOD systems will be evaluated against the reproduced models

on the impolite dialogue corpus.

Test Data. As we are interested in evaluating the TOD

systems’ performance on impolite dialogues, we utilized two

test sets to evaluate the reproduced and data augmented

TOD systems. We ﬁrst evaluate the models on original user

utterances, which are polite or neutral user utterances that

we have rewritten to construct the impolite dialogue corpus

discussed in Section III-A. Next, we also evaluate the models

on our impolite dialogue corpus. Intuitively, the original and

impolite tests set have user utterances that discuss similar

content but are different in politeness. Evaluating the models

on the two test sets enables us to understand the TOD systems’

ability to handle user dialogues of different politeness.

B. Automatic Evaluation Results

Table VII shows the automatic evaluation results of the

six state-of-the-art TOD models and our data augmentation

https://github.com/budzianowski/MultiWOZ

TABLE VI

Reported and reproduced results of TOD systems on MultiWOZ 2.2 dataset.

Model Test Data Inform Success BLEU

DAMD reported 76.3 60.4 16.6

DAMD reproduced 81.8 68.8 18.5

LABES reported 78.1 67.1 18.1

LABES [18] reproduced 80.2 67.2 17.6

MinTL reported 80.0 72.7 19.1

MinTL reproduced 81.8 74.0 20.5

UBAR reported 83.4 70.3 17.6

UBAR reproduced 90.0 76.8 13.4

AuGPT reported 83.1 70.1 17.2

AuGPT reproduced 83.1 70.1 17.2

PPTOD reported 83.1 72.7 18.2

PPTOD reproduced 83.4 72.9 19.6

approach. Comparing the performance of TOD models on the

original and impolite test sets, we observe all TOD systems

performed worse on the impolite dialogues. Speciﬁcally, com-

pared to the performance on original dialogues, we notice a

7-15% decrease in the combined score when tested on the

impolite user utterances. The decrease in inform and success

rates suggests that the models have difﬁculty understanding

users’ intentions from impolite user utterances and responding

correctly. The reproduced TOD system with the best perfor-

mance on the impolite test set is UBAR, achieving a combined

score of 83.7. Nevertheless, UBAR still suffers a 9% drop

in the combined score compared to its performance on the

original test set.

It is unsurprising that the reproduced models do not perform

well on the impolite test set as they are trained on the

MultiWOZ 2.2 dataset, which largely contains only polite or

neutral user utterances. To address this limitation, we augment

reproduced models with generated impolite dialogues using

our text style transfer data augmentation approach. We observe

the data augmentation strategy is able to boost the perfor-

mance of all six TOD systems. Speciﬁcally, the proposed data

augmentation approach has achieved the greatest improvement

on PPTOD’s performance, increasing the model’s combined

score to 85.2. Nevertheless, we also noted a gap between

the data augmented models’ performance on impolite user

utterances, and the reproduced models’ performance on the

original test set. Thus, while our data augmentation method

can help improve the TOD systems’ performance on impolite

user dialogues, there is still a gap to bridge before TOD

systems can effectively handle impolite users.

Interestingly, we also observe that the success rates of all

models decrease after the data augmentation, but there is

a signiﬁcant improvement in the BLEU score. The success

metric measures the percentage of dialogues in which the

system provides the correct entity and answers all requested

information. While the data augmentation method may im-

prove the dialogue in generating a response more similar

to the ground truth (i.e., a higher BLEU score), it may not

guarantee that models are able to learn well how to respond

to impolite dialogue with the request information. We have

manually examined the response to impolite user utterances

by the reproduced and data augmented models and found that

TABLE VII

Automatic evaluation results. The numbers in () represent the decrease in the percentage of Combined score on impolite user utterances compared to the

performance on original user utterances. The best performing models on the original and impolite user dialogues are underlined and bold, respectively.

Model Test Data Inform Success BLEU Combined

DAMD Original 78.2 57.6 17.9 85.8

DAMD Impolite 72.8 52.5 16.0 78.7 (↓ 8.3%)

DAMD + Data Augmentation Impolite 71.6 50.3 19.3 80.3

LABES Original 75.4 59.1 18.1 85.4

LABES Impolite 70.1 53.0 16.8 78.4 (↓ 8.2%)

LABES + Data Augmentation Impolite 72.2 49.6 19.3 80.2

MinTL Original 75.9 62.2 20.1 89.2

MinTL Impolite 70.5 54.9 17.4 80.1 (↓ 10.2%)

MinTL + Data Augmentation Impolite 72.8 51.2 21.3 83.3

UBAR Original 85.5 68.3 15.1 92.0

UBAR Impolite 81.1 61.7 12.4 83.7 (↓ 9.0%)

UBAR + Data Augmentation Impolite 80.0 61.0 13.5 84.1

AuGPT Original 75.6 56.6 17.9 84.0

AuGPT Impolite 72.2 51.3 15.9 77.7 (↓ 7.5%)

AuGPT + Data Augmentation Impolite 73.8 48.7 17.5 78.8

PPTOD Original 82.3 66.1 18.9 93.1

PPTOD Impolite 71.1 52.3 16.7 78.4 (↓ 15.8%)

PPTOD + Data Augmentation Impolite 74.8 48.4 23.6 85.2

TABLE VIII

Human evaluation results. The best performing models on the original and impolite user dialogues are underlined and bold, respectively.

Model Test Data Success Comprehension Appropriateness

UBAR Original 80% 4.68 4.43

UBAR Impolite 70% 4.47 4.38

UBAR + Data augmentation Impolite 72% 4.55 4.41

PPTOD Original 78% 4.64 4.45

PPTOD Impolite 64% 4.32 4.41

PPTOD + Data augmentation Impolite 68% 4.52 4.43

the TOD systems have difﬁculties responding to impolite user

utterances with requested information. This also suggests that

handling impolite dialogue is a challenging task that requires

more than simple data augmentation; specialized techniques

may need to be designed to handle impolite users. We hope

that our impolite dialogue corpus would encourage more

researchers to design robust TOD systems that are robust in

handling impolite users’ requests.

C. Human Evaluation Results

Automatic metrics only validate the TOD systems’ per-

formance on one single dimension at a time. In contrast,

human can provide an ultimate holistic evaluation. Therefore

we perform the human evaluation on the top-performing TOD

systems that achieved the best performance in the automatic

evaluation, namely, UBAR and PPTOD. Table VIII shows the

human evaluation results. We observe that UBAR outperforms

PPTOD model with 80% success rate and 4.68 comprehension

score on the original dialogues while 72% success rate and

4.55 comprehension score on the impolite dialogues. The

human evaluation results also concurred with the automatic

evaluation results, suggesting a gap between the models’

performance on impolite and original dialogues.

Similar to the automatic evaluation results, the data aug-

mentation method improves the performance of UBAR and

PPTOD on impolite dialogues. Nevertheless, the TOD systems

still have room for improvement compared to their responses

in the original dialogues. To check for human evaluator biases,

we compute the inter-annotator agreement between the eval-

uators. The Cohen’s kappa coefﬁcients on task success rate,

comprehension, and appropriateness are 0.78, 0.76, and 0.66,

respectively, suggesting that the evaluators have substantially

high agreement on the three human evaluation metrics.

D. Case Studies and Analysis

Besides evaluating the state-of-the-art TOD systems’ per-

formance on impolite dialogues, we also empirically analyze

the issues of handling impolite users. Speciﬁcally, we aim

to investigate the question: “What makes impolite dialogue

challenging for TOD systems?”. Working towards this goal, we

empirically examine case studies of TOD systems’ responses

to impolite dialogues. We will also analyze the correlation

between automatic performance metrics and the statistical

attributes of impolite dialogues.

Case studies. Table IX shows an example in one turn of

a conversation and the delexcalized responses generated by

the TOD systems. We noted that the rewritten impolite user

utterance is longer than the original. We also include the

ground-truth agent response from the MultiWOZ 2.2 dataset.

From the TOD systems’ responses, we observed that most

models could not respond effectively to the impolite user;

most of the TOD systems are unable to understand the user’s

TABLE IX

An example of a conversation and the delexicalized responses generated by the TOD systems.

Original User: Pick me up from the hotel.

Impolite User: From the hotel. Where else? Don’t waste my time, OK?

Groundtruth: Your taxi has been booked. It will be a [value car] and the contact number is [value phone].

PPTOD: I am sorry, but I am not ﬁnding anything else for you.

UBAR: You are welcome. Enjoy your stay. Goodbye.

MinTL: OK. Is there anything else I can help you with?

LABES: There seems to be an error. Can you please restate your request?

AuGPT: I am unable to book that for you. Would you like to try another day and time?

DAMD: I am sorry, I do not know.

PPTOD+DA: I have booked you a [value car]. The contact number is [value phone].

UBAR+DA: i have booked you a [value car], contact number [value phone].

MinTL+DA: I have booked you a taxi to pick you up at [value leave].

LABES+DA: I can not ﬁnd a train. Can you please restate your request?

AuGPT+DA: A [value car] will pick you up at [value leave].

DAMD+DA: I am sorry, I am not sure if there is anything else I can help you with.

(a) (b) (c)

Fig. 3. Effects of (a) sentence length, (b) politeness score, and (c) semantic similarity on PPTOD’s automatic performance.

request or respond to the request with the relevant information.

Interestingly, we observe that most TOD systems with data

augmentation can respond better to impolite users’ requests

with relevant information. Nevertheless, we noted that some

models, such as LABES and DAMD, are still unable to

respond well to impolite users even with data augmentation.

Impacts of Impolite Dialogue Statistical Attributes. To

perform a deeper dive into the challenges of handling impolite

dialogues, we perform a correlation analysis between the

PPOTD’s performance and statistical attributes of impolite

dialogues. Speciﬁcally, we investigate the three attributes

of the impolite dialogues: sentence length, politeness score,

and semantic similarities between the impolite and original

dialogues. Noted that although the analysis was performed on

PPTOD, we observed similar trends on other TOD systems.

Sentence length. We compute the average sentence lengths

of all user utterances for each impolite dialogue. Subsequently,

we ranked impolite dialogues in ascending order according

to the computed average sentence length. The ranked im-

polite dialogues are divided into ﬁve equal bins, with bin

1 containing impolite dialogues with the shortest sentence

length. For each bin, we compute the average automatic per-

formance metric (i.e., inform, success, BLEU, combined score)

difference between the original and impolite dialogues in the

bin. Figure 3 (a) shows the change in PPTOD’s performance

across different sentence length bins. As the impolite rewrites’

sentence length increases, the differences in inform, success,

and combined scores are observed to increase. This suggests

that the lengthier the impolite rewrites, the worse PPTOD

responded to the users’ requests. A potential reason for this

observation might be that the longer expression has confused

the model and muddled the understanding of users’ underlying

intentions. For future work, we plan to explore designing

TOD systems that can distill the users’ intention from lengthy

impolite expressions.

Politeness Score. Similar to the sentence length analysis,

we ranked the rewritten impolite dialogues according to their

average politeness scores evaluated by the human annotators

(see Section III-B) and bin the impolite dialogues into ﬁve

equal-sized bins. Similarly, the average automatic performance

metric difference is computed for the impolite dialogue in each

bin. Figure 3 (b) plots the performance difference against the

politeness bins. Interestingly, we observe the highest combined

score difference peaks in politeness bin 2, which contain impo-

lite dialogues with the second lowest average politeness scores.

This indicates that the TOD systems ﬁnd it harder to respond

to such impolite dialogues than extremely impolite users (i.e.,

impolite dialogues in bin 1). We further examine the impolite

dialogues in bins 1 and 2. Table X shows two examples from

politeness bin 1 and 2, respectively. In the example from

bin 1, the average politeness score is 1.0, and we notice

the user used abusive language in almost every utterance.

Nevertheless, PPTOD provided the correct information in

some of its responses. In contrast, the user in the example from

TABLE X

Examples of impolite dialogues from politeness bin 1 and 2 with varying politeness scores.

Politeness Bin: 1 2

Politeness Score: 1.0 1.8

User: Get me a train leaving on Tuesday to Cambridge. Or

is that too hard for the idiot CSA?

Get me a taxi to take me from Clare College to

Cambridge that gets there by 18:00. I’m very tired

now , show your respect and do it fast.

PPTOD: i have [value choice] trains leaving

[value departure] on [value day] to

[value destination]. Where would you like to depart

from?

I have booked you a taxi. it is a [value car].

User: Who the hell do you take me for? How dare you ask

me to do your job for you. I am going to

Peterborough, you useless CSA.

Wait you haven’t even told me the contact number

and car type. How do you expect me to contact the

driver ?

PPTOD: I am sorry, but there are no trains that meet your

criteria. would you like to try a different day or time?

I am sorry, but i was unable to book that.

User: The train should leave after 15:30 and should go to

Peterborough. Don’t you dare make me repeat. I have

had enough.

I’ve changed my mind. Now I want to go to Christ ’s

College. Book another taxi for me.

PPTOD: I have a train leaving at [value leave]and arriving

at [value arrive]. Would you like me to book that for

you?

I have booked you a taxi. it is a [value car].

User: I didn’t mention anything about arrival time. Listen

to instructions, you clown.

Why do you always leave out important information?

You forget to tell me the contact number and car type

again. I hope this is the last time you made such

stupid mistake.

PPTOD: I am sorry, but there are no trains that meet your

criteria. Would you like to try a different day or time?

I am sorry, but i am not sure what you are looking

for. Is there anything else i can help you with?

User: We’re done here. Get out of my sight. That’s all I need.

PPTOD: You are welcome. Have a great day! Thank you for using our service today!

User: I said get lost.

PPTOD: Have a great day!

bin 2 is more acidulous with sharp but less abusive utterances.

Such complex expressions confused the TOD system, and the

system could not provide meaningful responses to the user.

This interesting observation also highlights the diversity of

our constructed impolite dialogue corpus; the different human

annotators are instructed to rewrite the user utterance in an

impolite manner with minimal guidance. Thus, the rewritten

impolite dialogues are diverse and natural.

Semantic similarities. Finally, we compute the semantic

similarity between the impolite and its corresponding original

dialogues and analyze its effect on the various automatic

performance metrics. In this study, the semantic similarity is

computed as the cosine similarity of impolite and original

dialogues’ sentence representations extracted with sentence

transformers [56]. Similarly, we ranked and bin the dialogues

according to the semantic similarity and computed the average

automatic performance metric difference for the impolite dia-

logue in each bin. Figure 3 (c) shows the change in PPTOD’s

performance across different semantic similarity bins. We

observe the large performance difference in bin 1, which has

the lowest semantic similarity. When performing the rewriting,

some annotators paraphrased the sentence signiﬁcantly to add

impolite expressions. The TOD systems would ﬁnd such user

utterances difﬁcult to understand as the expressions are not

traditionally observed in its training set.

VII. CONCLUSION

In this paper, we investigated the effects impolite users

have on TOD systems. Speciﬁcally, we constructed an im-

polite dialogue corpus and conducted extensive experiments

to evaluate the state-of-the-art TOD systems on our impolite

dialogue corpus. We found that current TOD systems have

limited ability to handle impolite users. We proposed a text

style transfer data augmentation strategy to augment TOD

systems with synthesized impolite dialogues. Empirical results

showed that our data augmentation method could improve

TOD performance when users are impolite. Nevertheless,

handling impolite dialogues remains a challenging task. We

hope our corpus and investigations can help motivate more

researchers to examine how TOD systems can better handle

impolite users.

REFERENCES

[1] H. Elder, A. O’Connor, and J. Foster, “How to make neural natural

language generation as reliable as templates in task-oriented dialogue,”

in Proceedings of the 2020 Conference on Empirical Methods in Natural

Language Processing (EMNLP), 2020, pp. 2877–2888.

[2] A. Balakrishnan, J. Rao, K. Upasani, M. White, and R. Subba, “Con-

strained decoding for neural nlg from compositional representations in

task-oriented dialogue,” arXiv preprint arXiv:1906.07220, 2019.

[3] Y. Li, K. Yao, L. Qin, W. Che, X. Li, and T. Liu, “Slot-consistent nlg

for task-oriented dialogue systems with iterative rectiﬁcation network,”

in Proceedings of the 58th Annual Meeting of the Association for

Computational Linguistics, 2020, pp. 97–106.

[4] S. Golovanov, R. Kurbanov, S. Nikolenko, K. Truskovskyi, A. Tselousov,

and T. Wolf, “Large-scale transfer learning for natural language genera-

tion,” in Proceedings of the 57th Annual Meeting of the Association for

Computational Linguistics, 2019, pp. 6053–6058.

[5] Y. Yang, Y. Li, and X. Quan, “Ubar: Towards fully end-to-end task-

oriented dialog systems with gpt-2,” in AAAI, 2021.

[6] E. Hosseini-Asl, B. McCann, C.-S. Wu, S. Yavuz, and R. Socher, “A

simple language model for task-oriented dialogue,” Advances in Neural

Information Processing Systems, vol. 33, pp. 20 179–20 191, 2020.

[7] Z. Lin, A. Madotto, G. I. Winata, and P. Fung, “MinTL:

Minimalist transfer learning for task-oriented dialogue systems,” in

Proceedings of the 2020 Conference on Empirical Methods in

Natural Language Processing (EMNLP). Online: Association for

Computational Linguistics, Nov. 2020, pp. 3391–3405. [Online].

Available: https://aclanthology.org/2020.emnlp-main.273

[8] S. Gupta, M. A. Walker, and D. M. Romano, “How rude are you?: Eval-

uating politeness and affect in interaction,” in International Conference

on Affective Computing and Intelligent Interaction. Springer, 2007, pp.

203–217.

[9] C. Bothe, F. Garcia, A. Cruz Maya, A. K. Pandey, and S. Wermter, “To-

wards dialogue-based navigation with multivariate adaptation driven by

intention and politeness for social robots,” in International Conference

on Social Robotics. Springer, 2018, pp. 230–240.

[10] K. Mishra, M. Firdaus, and A. Ekbal, “Please be polite: Towards building

a politeness adaptive dialogue system for goal-oriented conversations,”

Neurocomputing, vol. 494, pp. 242–254, 2022. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0925231222003952

[11] Y. Su, L. Shu, E. Mansimov, A. Gupta, D. Cai, Y.-A. Lai, and Y. Zhang,

“Multi-task pre-training for plug-and-play task-oriented dialogue sys-

tem,” arXiv preprint arXiv:2109.14739, 2021.

[12] P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes,

O. Ramadan, and M. Ga

c, “MultiWOZ - a large-scale multi-

domain Wizard-of-Oz dataset for task-oriented dialogue modelling,”

in Proceedings of the 2018 Conference on Empirical Methods in

Natural Language Processing. Brussels, Belgium: Association for

Computational Linguistics, Oct.-Nov. 2018, pp. 5016–5026. [Online].

Available: https://aclanthology.org/D18-1547

[13] X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang, and J. Chen,

“MultiWOZ 2.2 : A dialogue dataset with additional annotation

corrections and state tracking baselines,” in Proceedings of the 2nd

Workshop on Natural Language Processing for Conversational AI.

Online: Association for Computational Linguistics, Jul. 2020, pp. 109–

117. [Online]. Available: https://aclanthology.org/2020.nlp4convai-1.13

[14] A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan, “Towards

scalable multi-domain conversational agents: The schema-guided dia-

logue dataset,” in Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, vol. 34, no. 05, 2020, pp. 8689–8696.

[15] Q. Zhu, K. Huang, Z. Zhang, X. Zhu, and M. Huang, “Crosswoz:

A large-scale chinese cross-domain task-oriented dialogue dataset,”

Transactions of the Association for Computational Linguistics, vol. 8,

pp. 281–295, 2020.

[16] Y. Lee, “Improving end-to-end task-oriented dialog system with a

simple auxiliary task,” in Findings of the Association for Computational

Linguistics: EMNLP 2021. Punta Cana, Dominican Republic:

Association for Computational Linguistics, Nov. 2021, pp. 1296–1303.

[Online]. Available: https://aclanthology.org/2021.ﬁndings-emnlp.112

[17] Y. Zhang, Z. Ou, and Z. Yu, “Task-oriented dialog systems that consider

multiple appropriate responses under the same context,” in AAAI, 2020.

[18] Y. Zhang, Z. Ou, M. Hu, and J. Feng, “A probabilistic end-to-end task-

oriented dialog model with latent belief states towards semi-supervised

learning,” in Proceedings of the 2020 Conference on Empirical Methods

in Natural Language Processing (EMNLP). Online: Association for

Computational Linguistics, Nov. 2020, pp. 9207–9219. [Online].

Available: https://aclanthology.org/2020.emnlp-main.740

[19] A. Bordes, Y.-L. Boureau, and J. Weston, “Learning end-to-end goal-

oriented dialog,” arXiv preprint arXiv:1605.07683, 2016.

[20] W. Lei, X. Jin, M.-Y. Kan, Z. Ren, X. He, and D. Yin,

“Sequicity: Simplifying task-oriented dialogue systems with single

sequence-to-sequence architectures,” in Proceedings of the 56th

Annual Meeting of the Association for Computational Linguistics

(Volume 1: Long Papers). Melbourne, Australia: Association for

Computational Linguistics, Jul. 2018, pp. 1437–1447. [Online].

Available: https://aclanthology.org/P18-1133

[21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,

“Language models are unsupervised multitask learners,” OpenAI blog,

vol. 1, no. 8, p. 9, 2019.

[22] J. Kulh

anek, V. Hude

cek, T. Nekvinda, and O. Du

sek, “AuGPT:

Auxiliary tasks and data augmentation for end-to-end dialogue with

pre-trained language models,” in Proceedings of the 3rd Workshop

on Natural Language Processing for Conversational AI. Online:

Association for Computational Linguistics, Nov. 2021, pp. 198–210.

[Online]. Available: https://aclanthology.org/2021.nlp4convai-1.19

[23] S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding

back-translation at scale,” in Proceedings of the 2018 Conference

on Empirical Methods in Natural Language Processing. Brussels,

Belgium: Association for Computational Linguistics, Oct.-Nov. 2018,

pp. 489–500. [Online]. Available: https://aclanthology.org/D18-1045

[24] W. Zhu, K. Mo, Y. Zhang, Z. Zhu, X. Peng, and Q. Yang, “Flexible end-

to-end dialogue system for knowledge grounded conversation,” arXiv

preprint arXiv:1709.04264, 2017.

[25] M. Eric and C. D. Manning, “Key-value retrieval networks for task-

oriented dialogue,” arXiv preprint arXiv:1705.05414, 2017.

[26] M. Ghazvininejad, C. Brockett, M.-W. Chang, B. Dolan, J. Gao, W.-t.

Yih, and M. Galley, “A knowledge-grounded neural conversation model,”

in Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 32,

no. 1, 2018.

[27] K. Hua, Z. Feng, C. Tao, R. Yan, and L. Zhang, “Learning to detect

relevant contexts and knowledge for response selection in retrieval-

based dialogue systems,” in Proceedings of the 29th ACM international

conference on information & knowledge management, 2020, pp. 525–

534.

[28] A. Madotto, S. Cahyawijaya, G. I. Winata, Y. Xu, Z. Liu, Z. Lin, and

P. Fung, “Learning knowledge bases with parameters for task-oriented

dialogue systems,” in Findings of the Association for Computational

Linguistics: EMNLP 2020, 2020, pp. 2372–2394.

[29] T. Zhao and M. Eskenazi, “Towards end-to-end learning for dialog

state tracking and management using deep reinforcement learning,” in

Proceedings of the 17th Annual Meeting of the Special Interest Group

on Discourse and Dialogue, 2016, pp. 1–10.

[30] X. Li, Z. C. Lipton, B. Dhingra, L. Li, J. Gao, and Y.-N. Chen,

“A user simulator for task-completion dialogues,” arXiv preprint

arXiv:1612.05688, 2016.

[31] B. Liu and I. Lane, “Iterative policy learning in end-to-end trainable

task-oriented neural dialog models,” in 2017 IEEE Automatic Speech

Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp.

482–489.

[32] Y. Ma, K. L. Nguyen, F. Z. Xing, and E. Cambria, “A survey on

empathetic dialogue systems,” Information Fusion, vol. 64, pp. 50–70,

2020.

[33] W. Shi and Z. Yu, “Sentiment adaptive end-to-end dialog systems,”

in Proceedings of the 56th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers). Melbourne,

Australia: Association for Computational Linguistics, Jul. 2018, pp.

1509–1519. [Online]. Available: https://aclanthology.org/P18-1140

[34] S. Feng, N. Lubis, C. Geishauser, H.-c. Lin, M. Heck, C. van Niek-

erk, and M. Ga

c, “Emowoz: A large-scale corpus and labelling

scheme for emotion in task-oriented dialogue systems,” arXiv preprint

arXiv:2109.04919, 2021.

[35] D. Li, Y. Li, and S. Wang, “Interactive double states emotion cell model

for textual dialogue emotion prediction,” Knowledge-Based Systems, vol.

189, p. 105084, 2020.

[36] P. Brown and S. C. Levinson, “Universals in language usage: Politeness

phenomena,” in Questions and politeness: Strategies in social interac-

tion. Cambridge University Press, 1978, pp. 56–311.

[37] C. Danescu-Niculescu-Mizil, M. Sudhof, D. Jurafsky, J. Leskovec, and

C. Potts, “A computational approach to politeness with application

to social factors,” in Proceedings of the 51st Annual Meeting of the

Association for Computational Linguistics (Volume 1: Long Papers).

Soﬁa, Bulgaria: Association for Computational Linguistics, Aug. 2013,

pp. 250–259. [Online]. Available: https://aclanthology.org/P13-1025

[38] H. Golchha, M. Firdaus, A. Ekbal, and P. Bhattacharyya, “Courteously

yours: Inducing courteous behavior in customer care responses using

reinforced pointer generator network,” in Proceedings of the 2019

Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, Volume 1

(Long and Short Papers), 2019, pp. 851–860.

[39] T. Niu and M. Bansal, “Polite dialogue generation without parallel data,”

Transactions of the Association for Computational Linguistics, vol. 6,

pp. 373–389, 2018.

[40] A. Madaan, A. Setlur, T. Parekh, B. P

oczos, G. Neubig, Y. Yang,

R. Salakhutdinov, A. W. Black, and S. Prabhumoye, “Politeness transfer:

A tag and generate approach,” in Proceedings of the 58th Annual

Meeting of the Association for Computational Linguistics, 2020, pp.

1869–1881.

[41] Y. Hou, Y. Liu, W. Che, and T. Liu, “Sequence-to-sequence data

augmentation for dialogue language understanding,” in Proceedings

of the 27th International Conference on Computational Linguistics.

Santa Fe, New Mexico, USA: Association for Computational

Linguistics, Aug. 2018, pp. 1234–1245. [Online]. Available: https:

//aclanthology.org/C18-1105

[42] M. Gritta, G. Lampouras, and I. Iacobacci, “Conversation graph: Data

augmentation, training, and evaluation for non-deterministic dialogue

management,” Transactions of the Association for Computational

Linguistics, vol. 9, pp. 36–52, 2021. [Online]. Available: https:

//aclanthology.org/2021.tacl-1.3

[43] G. Kurata, B. Xiang, and B. Zhou, “Labeled data generation with

encoder-decoder lstm for semantic slot ﬁlling.” in INTERSPEECH, 2016,

pp. 725–729.

[44] Z. Hu, R. Lee, C. Aggarwal, and A. Zhang, “Text style transfer: a review

and experimental evaluation (2020),” arXiv preprint arXiv:2010.12742.

[45] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,

Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning

with a uniﬁed text-to-text transformer,” Journal of Machine Learning

Research, vol. 21, no. 140, pp. 1–67, 2020. [Online]. Available:

http://jmlr.org/papers/v21/20-074.html

[46] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,

V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-

sequence pre-training for natural language generation, translation,

and comprehension,” in Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics. Online: Association

for Computational Linguistics, Jul. 2020, pp. 7871–7880. [Online].

Available: https://aclanthology.org/2020.acl-main.703

[47] D. Li, Y. Zhang, Z. Gan, Y. Cheng, C. Brockett, B. Dolan, and M.-T.

Sun, “Domain adaptive text style transfer,” in Proceedings of the 2019

Conference on Empirical Methods in Natural Language Processing

and the 9th International Joint Conference on Natural Language

Processing (EMNLP-IJCNLP). Hong Kong, China: Association for

Computational Linguistics, Nov. 2019, pp. 3304–3313. [Online].

Available: https://aclanthology.org/D19-1325

[48] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training

of deep bidirectional transformers for language understanding,” arXiv

preprint arXiv:1810.04805, 2018.

[49] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for

automatic evaluation of machine translation,” in Proceedings of the

40th Annual Meeting of the Association for Computational Linguistics.

Philadelphia, Pennsylvania, USA: Association for Computational

Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https:

//aclanthology.org/P02-1040

[50] Z. Hu, R. K.-W. Lee, C. C. Aggarwal, and A. Zhang, “Text style transfer:

A review and experimental evaluation,” ACM SIGKDD Explorations

Newsletter, vol. 24, no. 1, pp. 14–45, 2022.

[51] W. Sun, S. Zhang, K. Balog, Z. Ren, P. Ren, Z. Chen, and M. de Rijke,

“Simulating user satisfaction for the evaluation of task-oriented dialogue

systems,” in Proceedings of the 44th International ACM SIGIR Confer-

ence on Research and Development in Information Retrieval, 2021, pp.

2499–2506.

[52] B. Peng, C. Li, Z. Zhang, C. Zhu, J. Li, and J. Gao, “Raddle: An

evaluation benchmark and analysis platform for robust task-oriented

dialog systems,” arXiv preprint arXiv:2012.14666, 2020.

[53] J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Rosset, E. Agirre, and

M. Cieliebak, “Survey on evaluation methods for dialogue systems,”

Artiﬁcial Intelligence Review, vol. 54, no. 1, pp. 755–810, 2021.

[54] T. Nekvinda and O. Du

sek, “Shades of bleu, ﬂavours of success: The

case of multiwoz,” arXiv preprint arXiv:2106.05555, 2021.

[55] S. Mehri, T. Srinivasan, and M. Eskenazi, “Structured fusion networks

for dialog,” arXiv preprint arXiv:1907.10016, 2019.

[56] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings

using siamese bert-networks,” in Proceedings of the 2019 Conference

on Empirical Methods in Natural Language Processing. Association

for Computational Linguistics, 11 2019. [Online]. Available: https:

//arxiv.org/abs/1908.10084

History

Submit

Fig. 4. The annotation tool interface.

APPENDIX A

ANNOTATION DETAILS

A. Annotation Tool Interface

We built a annotation tool for our impolite user rewrites

as there is few suitable human annotation tools. The tool

interface is shown in Fig. 4. The annotation tool will be

publicly available for further use by the NLP community.