argmax to multinomial

2 minute read

Published:

Why use a multinomial function to get the next word prediction in a GPT (generative) model rather than outputting the id with the largest probability?

(Here we talk about generation, not training)

Answer (GPT-4o)

Using a multinomial function to get the next word prediction in a GPT (Generative Pre-trained Transformer) model introduces randomness into the selection process, which enhances the diversity and creativity of the generated text. Here’s why this approach is preferred over simply selecting the id with the largest probability:

  1. Avoiding Repetitive and Predictable Outputs:
    • If the model always selected the word with the highest probability (argmax selection), the generated text would become highly repetitive and predictable, lacking variation and creativity.
    • This method, called “greedy decoding,” can lead to boring and repetitive sequences, especially in long text generation tasks.
  2. Incorporating Uncertainty and Diversity:
    • Language naturally contains uncertainty and variability. By sampling from the probability distribution (using the multinomial function), the model can produce a wider range of plausible outputs, making the text more varied and interesting.
    • This method helps in generating text that appears more human-like and less mechanical.
  3. Exploring Alternative Paths:
    • Sampling allows the model to explore alternative paths in text generation. For example, in a storytelling context, this can lead to different narrative developments, making the generated stories more engaging and diverse.
  4. Balancing High-Probability and Low-Probability Words:
    • Sampling can balance between selecting high-probability (likely) words and occasionally including lower-probability (less likely but still possible) words. This balance prevents the text from becoming too predictable while still maintaining coherence.

Example

Consider generating a sentence starting with “The cat sat on the”:

  • Greedy Decoding (argmax selection):
    • Might always predict “mat” if it has the highest probability, resulting in repetitive outputs like “The cat sat on the mat.”
  • Sampling from the Probability Distribution:
    • Might predict “mat” most of the time but occasionally generate “sofa,” “floor,” or other valid options, leading to more diverse sentences like “The cat sat on the sofa.”

Conclusion

Using a multinomial function for next word prediction in GPT models introduces randomness and diversity into the generated text, making it more varied, interesting, and human-like. It avoids the pitfalls of repetitive and predictable outputs that occur with the argmax selection method, enhancing the overall quality and creativity of the text generation.