TechTalks – How Language Models Can Learn to Follow Instructions – (WIW)

SphilJanuary 29, 2024

0 1 4 minutes read

Tech Talks – How Language Models Can Learn to Follow Instructions – (WIW)

There is growing interest in techniques that allow large-scale language models (LLM) to improve their capabilities with little or no human intervention. One of the areas where LLMs develop themselves is in instructional fine-tuning (IFT), where a model is trained to follow human instructions.

IFT is one of the main reasons why models like ChatGPT and Claude have been so successful. However, IFT is a complex process that requires a lot of time and human effort. A new technique described in the Meta and New York University paper, called “self-enhancing language models,” provides a recipe for a pre-trained language model to generate and evaluate examples to teach itself to refine its teaching.

The advantage of this method is that it keeps improving the model as it is used repeatedly. Self-rewarding language models not only improve their ability to follow instructions, but also reward better.

Self-reinforcing language models

A common way to refine LLMs for later instruction is learning with human feedback (RLHF).

In RLHF, the language model learns to optimize its response based on feedback from the reward model. The reward model is trained based on writer feedback, which helps align the model’s responses with people’s preferences. RLHF consists of three steps: pre-training the LLM, building a reward model trained on human-rated results, and reinforcement learning where the LLM is fine-tuned based on the results of the reward model to produce high-quality text. aligned with the person. judgments

Reinforcement Learning from Human Feedback (RLHF) (Source: arXiv)
An alternative is Direct Preference Optimization (DPO), where a model generates multiple responses and receives direct feedback from humans about which one is better. There is no need to create a separate remuneration model in the Data Protection Officer.

Although these techniques have been shown to be effective, they are both limited by the size and quality of human preference data. RLHF has the additional limitation that the reward model is frozen after training and its quality does not change during fine-tuning of the main LLM.

The idea behind Self-Rewarding Language Models (SRLM) is to create a training algorithm that overcomes these limitations. “The key to such an approach is to develop an agent that has all the desired skills during training, instead of separating them into separate models such as a reward model and a language model,” the researchers write in their paper.

SRLM has two main functions. First, it can provide useful and harmless responses to instructions given by the user. Second, it can generate and evaluate examples of instructions and candidate responses.

This allows it to retrain itself with Artificial Intelligence Feedback (AIF) and gradually evolve by generating and training on its own data.

In each iteration, the model follows the instructions better. Correspondingly, it also improves the generation of examples for the following training.

How SRLM works

Self-reinforcing language models (SRLM) generate and evaluate their own training examples (source: arxiv)
Self-reinforcing language models start with a basic LLM trained on a large corpus of text. The model is then refined with a small seed of human explained examples. The raw data contains examples of command fine-tuning (IFT) containing command-response pairs.

To improve results, the raw data may also contain examples of estimated fine-tuning (EFT). In EFT, a template includes a guide and a set of responses. It should sort the answers according to their relevance to the input prompt. The evaluation result consists of a description of the rationale followed by a final grade. With these examples, LLM can play the role of a fee model.

After the model is trained on the initial dataset, it can generate data for subsequent training iterations. At this point, the model takes samples from the original IFT dataset and creates a new prompt. It then generates several candidate responses for the newly created prompt.

Finally, the model uses LLM-as-a-Judge to evaluate the responses. LLM-as-a-Judge requires specific motivation that includes the original application, candidate responses and instructions for evaluating the responses.

LLM-as-a-judge prompt (source: arxiv)

After the model has generated tutorial examples and classified responses, SRLM uses them to generate the AIFT dataset. There are two ways to create a training dataset. You can create a preference dataset using both response and ranking point instructions. This dataset can be used in conjunction with Direct Preference Optimization (DPO) to train a model to distinguish between good and bad responses. Alternatively, you can create a Supervised Fine Tuning (SFT) dataset that contains only the highest-level response. The researchers found that adding investment information improved the performance of the trained model.

After adding the newly created examples to the original dataset, the model can be retrained. This process is repeated several times, each cycle creating a model that is better at both following instructions and evaluating responses.

“Importantly, because the model can both improve its generation capacity and function as its own reward model through the same generation mechanism, this means that the reward model itself can evolve through these repetitions, deviating from conventional practices where the reward model is fixed. ,” the researchers write. “We believe this could raise the ceiling of the self-improvement capabilities of these learning models in the future and remove a limiting bottleneck.”

Experimenting with SRLM

Researchers tested self-paying language models on the Llama-2-70B base model. As raw data for instruction configuration, they used the Open Assistant dataset, which contains thousands of examples of instruction configuration. Open Assistant also has examples of instructions with multiple step responses that can be used for EFT.

Their experiments show that each iteration of self-paced language modeling improves LLM’s ability to control instruction. In addition, LLM becomes a better compensation model, which in turn allows for better training examples for the next iteration. Their tests on the AlpacaEval benchmark show that Llama-2 with three iterations of SRLM outperformed Claude 2, Gemini Pro and GPT-4-0613.