Usage

After installing, you will be able to run Reinforcement Learning experiments using the package.

Experiment

An experiment consists of two parts:

  • Training your model using a Reinforcement Learning algorithm

  • (Optional) Evaluating the resulting model
    • Harness Evaluation

    • Local Evaluation

Important

All the configurations and hyperparameters for an experiment must be specified in a configuration file in YAML.

When you run an experiment, training jobs and evaluation jobs are created and submitted to MareNostrum5 using SLURM.

For evaluation, some tasks are implemented in the Evaluation Harness. For those which are not, we use custom scripts (This is what we call “Local Evaluation”).

All evaluation jobs are submited to MareNostrum5 with dependencies to their respective training jobs.

Note

Note that evaluation is optional: if you do not specify an "evaluation" field in your config.yaml file, then only training will happen.

Configuration file for an Experiment

First, you will need a configuration file in YAML for your experiment (The values marked with None are computed internally by the package):

Example YAML configuration file
execution:
    algorithm: "dpo" # Reinforcement Learning Algorithm
    venv: "<path_to_your_venv>"
    output_dir: "<path_to_your_output_dir>"
    distributed_config: "DSZero3Offload" # For distributed training in MN5

slurm:
    job-name: "<your_slurm_job_name>"
    # output: None
    # error: None
    nodes: 2
    cpus-per-task: 80
    gres: "gpu:4"
    time: "2:00:00"
    account: "bsc88"
    qos: "acc_debug"

rl_script_args:
    dataset_name: "<path_to_rl_dataset>"

model_config_args:
    model_name_or_path: "<path_to_model>"
    # output_dir: None
    attn_implementation: "flash_attention_2"
    torch_dtype: "bfloat16"

rl_config_args:
    # RL configs are subclasses of transformers.TrainingArguments

    # Different RL algorithms have different uses of beta.
    # However, in most of them, it is the weight of the KL-divergence (Loss=reward+Beta*KL)
    beta: 0.2
    max_length: 8192
    max_prompt_length: 128 # Default. When specified, you use the default data collator
    remove_unused_columns: false
    dataset_num_proc: 1
    # ====
    # From `TrainingArguments`:
    # ===
    learning_rate: 5.0e-6
    num_train_epochs: 2
    bf16: true
    eval_strategy: "steps"
    eval_steps: 0.05

    # logging_dir: None
    # local_rank: None
    report_to: "wandb"
    # These arguments help to manage GPU memory
    per_device_train_batch_size: 2
    per_device_eval_batch_size: 2
    gradient_accumulation_steps: 8
    gradient_checkpointing: true

environment:
    # Bash environment variables
    WANDB_PROJECT: "salamandra_alignment"
    WANDB_NAME: "test_alignment"
    # WANDB_DIR: None

# Evaluation is optional
evaluation:
    harness_tasks:
        - "flores_en-es"
        - "flores_es-ca"
        - "wnli_es"
        - "xlsum_es"
    harness_slurm:
    # job name, logs, and gpus are automatically computed
        qos: "acc_bscls"
        account: "bsc88"
        nodes: 2
        time: "12:00:00"
        # job-name: None
        # output: None
        # error: None
        # cpus-per-task: None #
        # gres : None  # "gpu:4"

Running an experiment

You can use your config.yaml file to run an experiment, using the CLI:

$ rl_salamandra_mn5 config.yaml

This will generate and submit SLURM jobs to MareNostrum 5, you can find the trained models, slurm scripts, slurm logs, and evaluation results in your output_dir.

Debugging

For debugging, use the --debug flag:

$ rl_salamandra_mn5 config.yaml --debug

In debugging mode, SLURM scripts will be generated but not submitted.

Skipping evaluation

If you only want to train but not evaluate nmodels, you can use the --no_evaluation flag

$ rl_salamandra_mn5 config.yaml --no_evaluation

This will create the training and evaluation jobs for SLURM, but it will only submit the training jobs. This may be useful when the evaluation queue is long, or when you want to make a quick experiment.

Subexperiments

To experiment with different configurations of values, you can use lists in your config.yaml file.

For example, the following config.yaml for one experiment executes 12 subexperiments:

  • 6 runs of DPO: on 2 models with 3 learning rates, and

  • 6 runs of KTO: on the same 2 models with the same 3 learning rates

Note that both hyphens (-) and square brackes ([]) work for writing lists in YAML.

Setting up subexperiments
...
execution:
    algorithm:
        - "dpo"
        - "kto"
...
model_config_args:
    model_name_or_path:
        - "model_1"
        - "model_2"
...
rl_config_args:
    learning_rate: [5.0e-6, 1.0e-5, 1.0e-6]
...

Warning

Note that any of the values in the configuration can be a list, except output_dir under execution. The output_dir must always be an absolute path.

Furthermore, for a given configuration file, all subexperiments generated from it share the same evaluation field, which will not be unfolded. This means that you can specify lists inside the evaluation field (for example, lists of evaluation tasks), and doing so will not create more subexperiments.