Usage#

Basic Usage#

The procedure for using this library involves:

  1. Loading a pre-trained language model,

  2. Deciding which modules in your model to quantize to int8 (note, only nn.Linear and nn.Embedding modules can be quantized at the moment.),

  3. (Optional) Explicitly quantizing the pretrained model to save memory,

  4. Creating one or more fine-tuned models.

1. Loading a Pre-Trained Language Model#

All we require at this point is a nn.Module that we can use in subsequent steps. While finetuna is intended for use with language models, this doesn’t preclude it’s use with other types of mdoels since it can adapt any model’s nn.Linear or nn.Embedding layers.

This follows the standard procedure for loading models from e.g. HuggingFace.

import transformers

model_name = 'facebook/opt-1.3b'
base_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)

However, if you face memory issues, you can install HF accelerate (pip install accelerate) to use low_cpu_mem_usage=True and also load the memory in float16:

base_model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=t.float16,
    low_cpu_mem_usage=True,
    use_cache=False,
)

Please don’t load the model with HuggingFace’s recent load_in_8bit=True as this will interfere with finetuna. Of course if you are only interested in quantization, then you should just use this feature and not use finetuna!

2. Viewing Adaptable Modules#

finetuna fine-tunes pre-trained models by freezing the pre-trained weights (quantized or not) in each network module, and adding low-rank adapters on top of these modules.

By default, the library will add adapters to all the layers, but this can be unnecessary and you can save a lot of memory and computation by scrupulously selecting the modules you add adapters to.

To get a list of your model’s modules that you can add adapters to, first quantize the model, then use the get_lora_adaptable_modules helper function:

import finetuna as ft

ft.prepare_base_model(base_model)
print(ft.get_lora_adaptable_modules(base_model))

3. Quantizing the Model#

Quantizing the model refers to turning all (or a subset of) the frozen pretrained model weights to int8 using the quantization scheme described in LLM.int8().

This is an optional step and will be done automatically when creating new finetuned models in the next step if base_model is not yet quantized.

If however you have memory constraints that mean that you can’t keep a full model loaded into memory, then use prepare_base_model(base_model) to convert all the nn.Linear and nn.Embedding layers to 8bit:

ft.prepare_base_model(base_model)

Note

The prepare_base_model function will modify the base_model in-place, although it returns a reference to it for convenience.

This function also accepts an additional modules_not_to_freeze argument: this does what it says on the tin, and doesn’t quantize the modules listed in this set. By default, this is set to lm_head (a module name shared by GPT- and OPT- models in HuggingFace), since we often want to retain full precision for the language model head.

If you do want to quantize the language modelling head, you can set this to the empty set:

ft.prepare_base_model(base_model, modules_not_to_quantize=set())

Also note that if you quantize a module in the prepare_base_model function, subsequently requiring that this module is no longer quantized when calling new_finetuned will result in an error. Later versions of finetuna may support this, but the loss of accuracy owing to the round-trip from float32 -> int8 -> float32 is clearly sub-optimal. Un-quantized modules can however later be quantized ad-hoc in new_finetuned.

As a result, if you think you may require a module not to be quantized in the future, it is safer to add it to the modules_not_to_quantize set, assuming you have the memory overhead.

4. Creating Fine-Tuned models#

Now that we have the base model in hand, we are ready to create some new models to fine-tune using the new_finetuned function. The most basic invocation, called without arguments will

  • freeze and quantise all the pretrained nn.Linear and nn.Embedding modules (with lm_head in the modules_not_to_quantize set by default). If you previously quantized the base model, then this step is skipped.

  • add LoRA adapters to all Embedding and Linear layers using the default adapter configs (once again, the unquantized lm_head is treated as an exception by default, and optimised directly in its original datatype).

  • all other base_model parameters which cannot be adapted are frozen

ft1 = ft.new_finetuned(base_model)

Using the adapt_layers argument#

If you only wish to adapt certain layers, then you can specify these layers in the adapt_layers argument:

ft1 = ft.new_finetuned(base_model, adapt_layers={"q_proj", "v_proj"})

In the above, we

  • freeze and quantize all pretrained Embedding and Linear layers in base_model (excluding lm_head)

  • add LoRA adapters to q_proj and v_proj matrices only

  • freeze everything else

In general, adapting just the query and value projection matrices in the attention modules will be effective in fine-tuning the model, while greatly decreasing the memory and computation required to do so.

See Section 7.1 of the LoRa paper for a discussion of which layers are worth adapting.

Using the plain_layers argument#

Occasionally, we want to keep a layer the same as in the base model, and fine-tune it directly.

The running example of this has been the lm_head module, which is not frozen nor quantized by default. When we call opt.step(), we update its parameters directly, not its adapter.

You can specify other layers to keep exactly as in the underlying base_model by adding them to the plain_layers argument when creating a new finetuned model:

ft2 = ft.new_finetuned(
    base_model,
    adapt_layers={"q_proj", "v_proj"},
    plain_layers={"lm_head", "out_proj", "layer_norm"}
)

In the above, we

  • freeze and quantize all pretrained Embedding and Linear layers in base_model (excluding lm_head, out_proj and layer_norm)

  • add LoRA adapters to q_proj and v_proj matrices only

  • freeze all base_model parameters, except those in lm_head, out_proj and layer_norm.

Speicfying Adapter Configurations#

By default, we use a LoRA adapter with a rank of 4 (r=4) and a scaling factor of \(\alpha / r\), where alpha=1. For Linear adapters, we additionally set the dropout layer p=1, and use a bias.

For Embedding adapters, the embedding_config argument to new_finetuned can either be:

  • None, in which case the following default configuration is used:

    EmbeddingAdapterConfig(r=4, alpha=1)
    
  • A single EmbeddingAdapterConfig, which is applied to all Embedding layers to adapt.

  • A dictionary of type dict[str, EmbeddingAdapterConfig], which specifies the adapter configuration for each module to adapt. An error is raised if a module is left out.

Similarly, for Linear adapters, the linear_config argument can also be None, a single LinearAdapterConfig, or a dictionary of type dict[str, LinearAdapterConfig].

See Section 7.2 of the LoRa paper for a discussion of what rank to use. In summary, performance seems to improve through r=1, r=2 and plateaus at r=4 before falling back down at r=8. Setting the rank to be very high like r=64 yields no benefit.

Owing to the size of the layers, a lot of memory and computation can be saved for each incremental decrease in r.

More Controls#

The options described thus far shuold be all you need for most cases. The options in this section should not have to be used very often.

The new_finetuned function also has two other arguments called modules_not_to_freeze and modules_not_to_quantize.

The plain_layers argument is really just for convenience, and inserts its contents into both modules_not_to_freeze and modules_not_to_quantize. Using these two arguments however allows you more fine-grained control, such as adapting a non-quantized layer.

Note that for now it is an error to:

  1. place a module in modules_not_to_quantize if it has previously been quantized in the base model during a call to prepare_base_model.

  2. place a module in modules_not_to_freeze if it is quantized (directly optimising int8 weights is possible, and will be supported in the future)

For completeness, the full signature of the new_finetuned function is:

finetuna.new_finetuned(model: torch.nn.modules.module.Module, adapt_layers: Optional[Set[str]] = None, plain_layers: Set[str] = {'lm_head'}, embedding_config: Union[Dict[str, finetuna.main.EmbeddingAdapterConfig], finetuna.main.EmbeddingAdapterConfig] = EmbeddingAdapterConfig(r=4, alpha=1), linear_config: Union[Dict[str, finetuna.main.LinearAdapterConfig], finetuna.main.LinearAdapterConfig] = LinearAdapterConfig(r=4, alpha=1, dropout=0.0, bias=True), modules_not_to_quantize: Set[str] = {}, modules_not_to_freeze: Set[str] = {}, do_not_quantize: bool = False) torch.nn.modules.module.Module#

Create a new finetuned model from a pretrained model whose weights will be shared.

Parameters
  • model – base pretrained model. Can be quantized already.

  • target_layers – the targets onto which to add LoRA adapters. Only Linear and Embedding layers are suitable for this. If omitted, all suitable layers are adapted.

  • plain_layers – layers not to quantize or freeze (or LoRA-adapt).

  • embedding_config – the configuration to use for the added LoRA EmbeddingAdapters. Either a single config to apply to all layers, or a dict of configs for each embedding layer in target_layers.

  • linear_config – the configuration to use for the added LoRA LinearAdapters. Either a single config to apply to all layers, or a dict of configs for each linear layer in target_layers.

  • modules_not_to_quantize – don’t quantize this layer. Error if already quantized in previous prepare_base_model() call.

  • modules_not_to_freeze – don’t freeze this layer’s weights. Error if adapted (i.e. base weights are not frozen, and there is an adaptor)

  • do_not_quantize – a switch to turn off quantization entirely

Raises

ValueError – For not exhaustively specifying the adapter configs when using a dict in either embedding_config or linear_config.