Source: What is Chatbot and is it better than human? | Tulie Finley-Moise

In the previous module, language models were investigated and n-gram and neural directions were explored. We found that the n-gram approach is generally better for higher N-values, but this may be limited by available computational resources. The lack of n-gram representation in the training corps was also a concern. On the other hand, using subword tokenization methods such as byte-pair encoding and Wordpiece, the latest neural approaches are able to solve problems with n-gram language models and show impressive results.

We have also traced the evolution of neural language models from forward-looking networks based on word immersions and repetitive neural networks of fixed input length that allowed for variable-length input but struggled to adhere to long-term dependencies. We explore the concept of attention and their significance in transformer models using Jay Alammar’s awesome The Illustrated Transformer program. We learned that transformer models have the advantage of being parallelizable during training.

Based on our knowledge of transformer models, we investigated OpenAI’s GPT-2, which has 1.5 billion parameters and was trained in 40GB of text – wow! Then Microsoft DialoGPT, which extends this language model to create a chat response.

All of these provide the context for this article. Here you play with the pre-prepared DialoGPT template and create answers using different strategies.

Microsoft makes available versions of preselected DialoGPT checkpoints through the download link in the GitHub archive and the Hugging Face’s Transformers service. The RansTransformers library provides a unified API through which many models can be downloaded, trained, stored, and shared, so we use it here.

The Tqdm download begins when you execute this code, when 🤗Transformers loads and caches the model and tokenizer checkpoints. Don’t worry, this download only happens the first time a particular template / tokenizer is downloaded. When this is complete, we should see a 12-layer model with an embedding dimension of 768. We can interact with this model in the same way we interact with pytorch models. For example, we can check the number of parameters in a DialoGPT small model.

If you want to interact with the Tensorflow model, simply bring TFGPT2LMHeadModel instead.

1. How chat technology can automate customer service

2. Automated vs. live chats: what does the future of customer service look like?

3. Chatbots as a medical assistant in the COVID-19 pandemic

4. Chatbot vs. Smart Virtual Assistant – what’s the difference and why treatment?

If you want to interact with the uploaded template, we need to prompt the template with text. We need to mark this text in a way that the model understands. To do this, we use a previously downloaded tokenizer. We also use a tokenizer to process the model output to get our response. We can write a simple function to handle all this neatly using “Transformers” Produce method

In order to focus only on the parameters passed to our model, we can bind the model and the tokenizer partial. Then we can call our model.

from functools import partialgenerator = partial(generate, model=model, tokenizer=tokenizer)generator(“Try this cake. I baked it myself.”)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.[“I’m not sure if I’m missing something,”]

We need to pay attention to two other things Produce function. First, we switch the model to evaluation mode. This is necessary due to layers that behave differently during training and deduce, for example, drop layers. Secondly, we are only dismantling part historia_id which our model returns. This is because also the text with which we called for our model to be restored. We can confirm this by unpacking everything historia_id. Give it a try!

We have already demonstrated one decoding strategy above. Our generation function used above uses Greedy search. During generation, the greedy search algorithm simply selects the word / ID with the highest probability for the next output of the model during generation.

Source: How to create text: using different decoding methods for language formation Patrick von Platen

After the image above, the algorithm generatesNice woman“By choosing the most probable word in each translation. Encourage the model in different contexts to get a feel for what kind of responses it produces.

  1. Bar search

We can quickly see that greedy searches often produce general answers. Consider the scenario below (Source: Decoding strategies you need to know to get an answer Vitou Phy)

Context: Try this cake. I baked it myself.

Optimal answer: This tastes good!

Model Response: This is ok.

The model produces a suboptimal response even if it starts with the same symbol as the optimal response – “This”. This can happen because ‘Is‘is the more popular tag next’That” in the training data. In terms of probabilities, the more probable sequence may be ‘hidden’ behind the low probability symbol. Referring to the tree diagram above as an example:The dog has”(0.36) is more likely thanNice woman”(0.2).

Bar search circumvents this problem by tracing a predetermined number of the most probable symbols at each step before finally selecting the sequence with the highest probability. We can use ray search using the “ generate ” function as follows

generate(     “Would you like to have dinner with me?”,     num_beams=5,     early_stopping=True,     num_return_sequences=5)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
[‘How did you make it?’,‘I baked it myself.’,‘How did you make this?’,‘How did you make the cake?’,‘I baked it myself’]

These answers are good. We can improve them by conditional search. no_repeat_ngram_size ensure that n-grams already generated are not repeated in the model response later. min_length and Maximum length make sure the lengths of the generated answers are within a predefined range. We can also pick up multiple sets by setting num_return_sequences to a value less than or equal to num_bars.

generator(     “Try this cake. I baked it myself.”,      max_length=50,      num_beams=5,      early_stopping=True,      no_repeat_ngram_size=2,      min_length=25,      num_return_sequences=5)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
[‘How did you make the cake? I want to make one too!!’,‘How did you make the cake? I want to make one too! :D’,‘How did you make the cake? I want to make one myself!!’,‘How did you make the cake? I want to make one too. It looks amazing!’,‘How did you make the cake? I want to make one too. It looks amazing.’]

Now, these answers seem much more natural. However, finding a balance between all the penalties we impose may require some tuning.

2. Random sampling

In terms of human conversations, it is quite a bit unpredictable. We simply don’t pick the most likely word or sentence in response to a friend. We can introduce this randomness by sampling. Therefore, we select the following ID using a conditional probability distribution.

import torchtorch.manual_seed(42)generator(    “Try this cake. I baked it myself.”,     do_sample=True,     top_k=0,     max_length=50)
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
[‘Surely it was caused by some sort of cake’]

We set top_k to zero so far. We will next visit the Top-K samples.

The answer is really more surprising than others so far. It is also not very consistent. It is unclear how baking the cake was “Certainly caused some sort of sake”. We can improve this if we reduce the probability of low-probability words by using softmax temperature.

generator(   “Try this cake. I baked it myself.”,   do_sample=True,    top_k=0,   max_length=50,   temperature=0.7)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
[“Can you post the recipe? I’m really interested in trying it.”]

Now we get an answer that better connects to what the model was called for. However, random sampling sometimes causes errors. Fan et al. (2018) pointed out that for words such as can not which is tokenized that and No, the model may produce the first ID but bypass the second. Random sampling can also affect the production of longer text because sampling interrupts the consistency of successive sentences.

3. Top-K sampling

This strategy is used by GPT2 and improves story creation. K most likely, the following words are filtered and become the sampling pool. This ensures that unexpected IDs are not sampled and improves model creation.

generator(  “Try this cake. I baked it myself.”,  do_sample=True,  top_k=50,  max_length=50)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
[“I had to google what a dutch oven is because i couldn’t figure it out. I’m not the expert. Who knew?”]

This response sounds even more natural than the response obtained by random sampling.

4. Kernel sampling

Top-k sampling focuses on sampling among the specified K-potential characteristics. However, the probability distribution of the next word may vary from a sharp distribution (left side of the figure below) to a more even one.

Source: How to create text: using different decoding methods for language formation Patrick von Platen

When dealing with sharper distributions, top-k sampling can introduce symbols into the sample selection that affect the naturalness of the formed sentence. Holtzman et al. (2019) introduced Nucleus / Top-P sampling, in which the model selects the smallest possible group of words whose cumulative probability exceeds a predetermined value, p.

We can produce using this strategy as follows

generator(   “Try this cake. I baked it myself.”,   do_sample=True,   top_p=0.9,   top_k=0,   max_length=50)— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — 
Output
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
[‘Then my work here is done.’]

That’s it! We’ve explored four important strategies you can use to create text, and what to look out for when working with each. Still, this is an open area of ​​research, so it’s best to try them all and decide what works best for your use case.

  1. Colab notebook
  2. Hierarchical neural story creation Fan et al. (2018)
  3. A curious case of text degradation Holtzman et al. (2019)

LEAVE A REPLY

Please enter your comment!
Please enter your name here