Paralinguistics in spoken dialogue
  • What can be useful paralinguistics information to a dialogue system?
    • Male, female, child: speech recognition
    • Cognitive load: understand when people hesitate, user addressing the system or not
    • Adaptation: simplifying and shortening prompts, reducing the speed of the ASR
    • Acquiring knowledge from linguistics bringing human language knowledge to dialogue systems
    • End-to-End vs Modular architecture:
    • Mutual interaction between different inputs to the system can be dealt with end-to-end approaches, but there is less control of what's going wrong in the system
    • Paralinguistics in text dialogue systems:
      • Emojis, punctuation, capitalize, macro-images (memes), how is paralinguistics expressed in text dialogues (linguistic information, e.g. capitalization for shouting)
      • New standardizations, converging of paralinguistic signs that can also be cultural
      • Should we create a persona for or being part of the crowd (learn from what’s more common)?
    • No recent publication in SigDial
    • Paralinguistic challenges at Interspeech (http://emotion-research.net/sigs/speech-sig/is17-compare):
      • 2017: cold, addressee, snoring
      • 2016: deception, sincerity, native language
      • 2015: nativeness degree, parkinson, eating condition
      • 2014: cognitive load, physical load
      • 2013: social signs (laughter or sigh), conflict, emotion, autism
      • ...
    • Deep Learning:
      • The control is completely everything works sort of like a black box
Multi-lingual dialogue systems
  • Is multi-lingual a system that handles multiple languages at the same time (can change between languages on the fly?) or are we talking rather about a system that is language independent but can, with minor effort, be adapted to a new language?
    • Language Portability for Dialogue Systems: Translating a Question-Answering System from English into Tamil (Ravi and Artstein, 2016)
  • Is it a research problem at all?
    • Technically if we have a distributed or a cloud system running several ASR and NLU simultaneously... How can we choose which language to speak? Would it be possible to have a module similar to a multi-domain system? In such a case which features should be used?
  • Knowledge-transfer: between systems that handle the same domain. How to do it between different languages?
  • How do we deal with short utterances to detect language?
    • Prompting the user to speak longer sentences is not doable in an IVR
    • Adaptation from English to lower resourced language with different morphologies?
  • What is a representation that is language independent? (e.g.: greetings)
  • Language generation in under resourced languages (Ondrej Dusek's thesis, pages 113-142)
  • Avoid annotations in a new language (takes a lot of time), translate
    • Google translate is still not there yet to be used in a context of translating a dialogue systems
    • But, it could be enough if there is dialogue act detection works well
  • How to take cultural differences into account?

Incremental dialogue systems
  • Rooted in Schlangen and Skantze's and DeVault's work
  • The relation between turn-taking and incrementality?
  • Incremental processing: understanding or generating at a level that is below the granularity level of an utterance
    • Generation, e.g. when the person is waiting, carefully in planning time buying utterances
    • Incremental ASR, similar to what humans do
  • Are today's systems already incremental?
    • ASR: achieved reduction in latency, need open-microphones to be incremental
    • NLU is harder, and there is no interest since a lot of the applications do not require to be interrupted (e.g.: slot filling)
  • In the future all the systems need will tend to be incremental
    • Need to deal with interruptions and overlaps for instance
  • What kind of systems should be incremental?
    • Chabots, should they? They, inference to which turn I am referring to. Can chunks be considered incremental processing?
  • End-to-end: what kind of data could be used to do that? No goldstandard data available
    • How do current system handle barge-in: yield the floor to the user or continues, no smooth management. NOt optimal?
      • What's the ideal dataset to train an incremental end-to-end: annotation of the purpose behind any communicative behavior; language: what's the right way to understand a partial utterance.
  • Turn Taking: Switchboard data, predict who is speaking next (~61% accuracy achieved (Heeman and Lundsford, 2015)); between detecting the point to react and the point to act
  • What if we have solved the incrementality problem for current systems?
  • Should we care about incrementality before we have robust systems?

Learning Techniques:
Question 1: Where do you use machine learning in your dialog systems and what were the biggest difficulties? Answers from the roundtable members ->
  1. Everything I did is machine learning. I used seq2seq for generating response given a dialog history. I am current exploring CVAE (Conditional Variational Autoencoder) for response generation. Limitations are:

    1. Blackbox. Cannot be controlled by the human.

    2. Hard to improve and understand

  2. Research topic: automatic generation of dialog strategy from a business description.

  3. I experiment with CNN, RNN, attention for dialog act recognition.

    1. The model is tested on SWDA and MRDA

  4. I used Hierarchical RL for multi-domain dialog
  5. There is a trade off between accuracy and generalization. I prefer hybrid system that combines hand rules and machine learning models.

  6. I work on using machine learning to figure out the best question asking strategy given a set of questions.

    1. RNN -> state tracking. Graph structure encoder. RL

    1. Parameter

    2. Encoder lattice

  7. I tried SVM but it gave bad results for unseen data. Now I am working on using machine learning for dialog manager.

    1. Data is not a problem in the industry.

    2. Focus on using reinforcement learning to find out the best action for the DM.

  8. User stratification. HMM SVM. Compare it with RNN

  9. Argumentation strategy. Large state space. Learn structure.

    1. How to generalize quickly to unseen states

  10. I worked on dialog state tracking challenge.

    1. Found existing word2vec or sent2vec do not work well on a specific domain. How to adapt?


Question 2: What's the way to go for future dialog systems? Supervised learning or reinforcement learning?

  1. Combine SL + RL best

    1. Pre-train the model on SL data

    2. Use RL to fine tune.

  2. RL is good for non-chat or chat

  3. RL can be creative.

  4. How to use API is a key issue. RL may solve that.

  5. There are many limitations of RL because of the user simulator.

    1. Hard to create simulator biggest problem

    2. Or self-playing. Simulation and goal is enough to let model talk to each other.

      1. Self-learning will not learn language

      2. Collapse MDP.

        1. RL learn high-level actions

        2. SL learn language generation


Question 3: What’s more important NLU and DM?

  1. A good DM can fix poor SLU.

  2. SLU are both important but

    1. One thinks SLU is prerequisite of DM

    2. SLU and DM are together.

    3. Depending on the definition.


Question 4: Modular vs E2E system

  1. Training E2E is more important.

  2. POMDP is never used in real-world

  3. Scalability is more important the accuracy

    1. An industry like immediate results.

    2. PM needs is faster than model development.

    3. OpenDial is a good trade off between rules and ML.

  4. Modular error propagation.

    1. E2E system can do multitasking learning

    2. Incremental development.

  5. Theoretical work in understanding RNN

  6. E2E is ambitious

    1. learning a lot functions in a small data

    2. Goal driven system

    3. Group similar modules



Context NLG:

Different types of context:
situational relationship context
cultural context
more common think of dialogue history context (what has been said before).

In order to resolve ambiguous terms like “how long did you live there” where there refers to place in the previous utterance
need to keep track of entities on the table in order to do it automatically.


Some evidence that you usually don’t have to go back more than 2 or 3 turns to resolve referencing terms such as pronouns…simple heuristics might get you like 80% of the way but then hard to get rest of way.


In order to scale storing context could keep track of context associated with each dialogue act.


Is there a general frame work to store context? Frameworks don’t seem to be shared between labs so much; no generally accepted framework yet.


Hierarchical model of context (start with general things like initiative and response but then specify what is the specific initiative…find a general level where you can talk about these applications and different dialogue acts…


Generally, use rules that serve your specific situation best as dialogues are so varied.


cultural context also needs to be taken into account? do people from different cultures phrase things differently? etc…

frame detection keep track of entities for current frame helps make context more tractable.


In pizza ordering, for example, keep track of all important information given so far (in general domain knowledge needed to know which information attributes are important to keep track of)


Many commercial systems fail when the need information from dialogue history but easier problem for them to handle context of user attributes.


Can user attribute context be exploited in a commercial setting? Could keep track of how long user’s utterances are for example and adopt dialogue to match user speaking style…