Fine-tuning CPU Lora ChatGLM2-6B

The open-source dataset found contains less than 50,000 Q&A pairs, and it is recommended to have over 200G of memory. My local setup with 60G of memory cannot run it.

The lora uses Hugging Face’s peft:

https://github.com/huggingface/peft

Two versions of the training part were written:

One references the peft example: https://github.com/huggingface/peft/tree/main/examples. With 60G memory and batch size 1, it ran for 60:06:35. When the batch size reached around 200, it ran out of memory. The other directly used the transformers' Trainer to observe various rank and scaling comparisons.

Model (Note: I actually have a Tesla M40 locally, and with this configuration, it only requires 5G of VRAM to run, limited to local testing):

from peft import (    get_peft_config,    get_peft_model,    get_peft_model_state_dict,    set_peft_model_state_dict,    LoraConfig,    PeftType,    PrefixTuningConfig,    PromptEncoderConfig,)peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1)print('peft_config', peft_config)from transformers import AutoTokenizer, AutoModelmodel_path = ".../chatglm2-6b"tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, padding_side="left", no_cuda=True)if getattr(tokenizer, "pad_token_id") is None:    tokenizer.pad_token_id = tokenizer.eos_token_id    passmodel = AutoModel.from_pretrained(model_path, return_dict=True, trust_remote_code=True).float()model = get_peft_model(model, peft_config)model.print_trainable_parameters()
The second version uses 512G memory and 112-core CPU, comparing various rank parameters and execution times: https://huggingface.co/docs/peft/main/en/quicktour#peftconfigr, the dimension of the low-rank matrices lora_alpha, the scaling factor for the low-rank matrices. These two parameters are easier to understand when compared rather than just introduced.r=8 lora_alpha=16trainable params: 1949696 || all params: 6245533696 || trainable%: 0.031217444255383614372:16:31r=512 lora_alpha=16trainable params: 124780544 || all params: 6368364544 || trainable%: 1.9593813001418532352:06:08r=1024 lora_alpha=16trainable params: 2495610880 || all params: 8739194880 || trainable%: 28.55653082770023798:26:06r=1024 lora_alpha=512trainable params: 249561088 || all params: 6493145088 || trainable%: 3.843454668234883352:07:33Not using lora: 18:30:54<500:35:51

Among them:

In the third-to-last line, AutoModel.from_pretrained... If you don't add .float() when using: collate_fn = DataCollatorForSeq2Seq(tokenizer, return_tensors="pt", padding=True) etc., it will report an error: "addmm_impl_cpu_" not implemented for 'Half'. If you handle it yourself, it generally doesn't matter. It is determined by whether cuda is effective and the no_cuda judgment during training.

Data loading:

The part read is directly copied from the official data, so I won't post it. Different data is not the same, and posting it is of no use.vaild_dataset = dataset[40000:]vaild_dataset = Dataset.from_dict(vaild_dataset)train_dataset = Dataset.from_dict(dataset[:40000])from torch.utils.data import DataLoaderfrom torch.utils.data.dataloader import default_collatefrom transformers import DataCollatorForSeq2Seqcollate_fn = DataCollatorForSeq2Seq(tokenizer, return_tensors="pt", padding=True)train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)valid_dataloader = DataLoader(vaild_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)

Referenced peft training code:

import torchfrom torch.optim import AdamWfrom transformers import get_linear_schedule_with_warmup#, set_seedfrom tqdm import tqdmoptimizer = AdamW(params=model.parameters(), lr=lr)device = 'cpu'model.to(device)for epoch in range(num_epochs):    model.train()    for step, batch in enumerate(tqdm(train_dataloader)):        batch.to(device)        outputs = model(**batch)        loss = outputs.loss        loss.backward()        optimizer.step()        lr_scheduler.step()        optimizer.zero_grad()    model.eval()    for step, batch in enumerate(tqdm(valid_dataloader)):        batch.to(device)        with torch.no_grad():            outputs = model(**batch)        predictions = outputs.logits.argmax(dim=-1)        predictions, references = predictions, batch["labels"]        metric.add_batch(            predictions=predictions,            references=references,        )    eval_metric = metric.compute()    print(f"epoch {epoch}:", eval_metric)    passsave_directory = '6b2peft-lora'tokenizer.save_pretrained(save_directory)model.save_pretrained(save_directory)

Then I thought, since I also used transformers , I might as well use them all for cleaner code, so:

from transformers import TrainingArguments,Trainer# model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to('cpu')model.to('cpu')seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)save_directory = '6b2peft-lora-t'args = TrainingArguments(    output_dir='6b2peft-lora-t',    warmup_steps=100,    evaluation_strategy="epoch",    save_strategy="epoch",    learning_rate=e-5,    num_train_epochs=3,    weight_decay=0.01,    save_steps=500,    eval_steps=500,    logging_steps=300,    gradient_accumulation_steps=2,    per_device_train_batch_size=4,     per_device_eval_batch_size=1,    no_cuda=True,)trainer = Trainer(    model=model,    args=args,    train_dataset=train_dataset,    eval_dataset=vaild_dataset,    data_collator=seq2seq_data_collator,    compute_metrics=eval_metric,    tokenizer=tokenizer,)trainer.train()tokenizer.save_pretrained(save_directory)model.save_pretrained(save_directory)

Fine-tuning CPU Lora ChatGLM2-6B

Leave a Comment