Benjamin Johnston - Stalling for time with LLMs

In conversation, sometimes it’s useful to stall for time. For example, saying, “that’s a great question,” gives you a moment to consider your response.

In interactive voice systems, the same principle applies. Rather than a system pausing silently for a few seconds while completing speech recognition, it can be more natural for the system to begin responding to show that it has heard the speaker.

It is possible to have a few generic templates such as “Hmmm. Let me think.” However, I’ve found LLMs can be used to generate more specific fillers.

Read on for details.

LLM Sampling

Many LLMs can be understood as a function that takes a prompt (a list of tokens), and then outputs the probability of the next token.

Sentences and paragraphs are created by repeatedly calling the function to generate the possible next token. A token is chosen (or sampled) from the function output, and this is combined with the original prompt to repeat the process again.

tokens = prompt
while tokens[-1] !== END_OF_MESSAGE:
    logprobs = model(tokens)
    next_token = sample(logprobs)
    tokens = tokens + [next_token]

A ‘greedy’ sampling strategy is perhaps the simplest and most obvious approach: it just chooses as the next token the one that the model predicted as most likely:

tokens = prompt
while tokens[-1] !== END_OF_MESSAGE:
    logprobs = model(tokens)
    next_token = logprobs.argmax()
    tokens = tokens + [next_token]

Sampling with Multiple Prompts

One way to stall for time is to treat the user’s speech as a set of several possible prompts. For example, if the system has asked the user for feedback, it may be uncertain about whether the user will say “You were great” or “You were terrible“.

This could be handled as two separate prompts:

Generate a response for a caller who says "You were great"
Generate a response for a caller who says "You were terrible"

These would be generated as hypotheticals. Alternatively, we might write two prompts based on the next steps in a dialog tree:

Write a statement thanking the user for their loyalty.
Write a statement asking the user for their contact details to follow up the problem they encountered.

Then, the challenge is sampling from both distributions simultaneously.

Greedy Maximin Sampling

One simple approach is to choose the token that maximizes the worst-case probability (i.e., select the argmax of the pairwise minimums). This is analogous to the minimax or maximin strategy in game playing: taking the best option among risk-averse choices.

In other words, suppose we have probabilities from prompts a and b for a choice of the next tokens ['w','x','y','z']:

probs_a = [0.1, 0.3, 0.4, 0.2]
probs_b = [0.4, 0.4, 0.1, 0.1]

Then the pairwise minimums (the ‘worst case’ probabilities) would be the minimums across corresponding values:

          [0.1, 0.3, 0.1, 0.1]

Finally, the ‘maximin’ would be the best of this list. i.e., The maximin is the second item in the list (that has worst-case probability 0.3), corresponding to the token 'x'.

As a complete decoding strategy, this would be implemented as follows:

tokens = []
while len(tokens) == 0 or tokens[-1] !== END_OF_MESSAGE:
    # Use softmax because the logprobs may be biased 
    probs_a = softmax(model(prompt_a + tokens))
    probs_b = softmax(model(prompt_b + tokens))

    worst_cases = minimum(probs_a, probs_b)
    next_token = worst_cases.argmax()
    tokens = tokens + [next_token]

I found this does a good job of producing a few words of coherent but content-less text that can pass a few seconds of time before switching to just a correct model.

For example, in the great/terrible feedback scenario, this approach generates filler that is appropriate for both cases: “Thank you for your sincere feedback.”

In the loyalty or contact details scenario (in a banking context), this approach generates the harmless filler, “Thank you for calling our customer service. We appreciate your recent interaction with our bank.”

That’s it! That’s the trick.

Puzzle Solving

This greedy ‘maximin’ sampling has a tendency to generate harmless filler. To go a little further, it can also sometimes be used to solve puzzles.

For example, consider these pairs of riddles:

What is an example of something that can be found at a baseball game?
What is an example of an animal that can fly?

and

What is a country in Europe?
What is a food commonly eaten at Christmas?

While an LLM can solve these riddles if you ask directly in a single prompt, this trick can also be used to find a single solution that answers multiple prompts. For example, it successfully generates A bat, in the baseball example.

However, it isn’t able to solve the second puzzle. Turkey is too rare a response when asking for a country in Europe (France and Germany are more prototypical answers), so greedy search keeps generating filler without getting to an answer. Instead, it’s necessary to perform a search with a similar goal to greedy maximin sampling. Instead of selecting the token that has the highest minimum probability, compute the probability of entire sentences on a model-by-model basis, and choose the sentence that has the highest minimum probability across all models.

An exhaustive search of every possibility would obviously be prohibitive. However, I found it appropriate to use the following as a heuristic for choosing tokens to search across:

The highest probability tokens from each model (argmax)
The highest worst-case tokens across all models (argmax of minimum)
The highest probability tokens as a product of probabilities across all models (argmax of product)

By including a selection of all these sources, it’s more likely to discover more creative responses that jointly answer both prompts. Indeed, this approach will correctly find the correct answer Turkey with few branches.

Human-supervision Required

It goes without saying that in practice your mileage may vary. The simple greedy maximin sampling algorithm seems to work very well for a few words or sentences, but it seems that once it has generated a lot of filler, the models become far more prone to hallucination: a model that has been instruction-tuned ultimately seeks to answer its prompt. This means that the models quickly begin to hallucinate, because they can only ‘fill’ for so long.

So, I’d suggest that this might make more sense as a tool to support designers in creating fillers in voice conversation flows, as opposed to something that would be put directly into a live system without human-oversight.