The open-source dataset found contains less than 50,000 Q&A pairs, and it is recommended to have over 200G of memory. My local setup with 60G of memory cannot run it.
The lora uses Hugging Face’s peft:
https://github.com/huggingface/peft
Two versions of the training part were written:
One references the peft example: https://github.com/huggingface/peft/tree/main/examples. With 60G memory and batch size 1, it ran for 60:06:35. When the batch size reached around 200, it ran out of memory. The other directly used the transformers' Trainer to observe various rank and scaling comparisons.
Model (Note: I actually have a Tesla M40 locally, and with this configuration, it only requires 5G of VRAM to run, limited to local testing):
from peft import ( get_peft_config, get_peft_model, get_peft_model_state_dict, set_peft_model_state_dict, LoraConfig, PeftType, PrefixTuningConfig, PromptEncoderConfig,)peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1)print('peft_config', peft_config)from transformers import AutoTokenizer, AutoModelmodel_path = ".../chatglm2-6b"tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, padding_side="left", no_cuda=True)if getattr(tokenizer, "pad_token_id") is None: tokenizer.pad_token_id = tokenizer.eos_token_id passmodel = AutoModel.from_pretrained(model_path, return_dict=True, trust_remote_code=True).float()model = get_peft_model(model, peft_config)model.print_trainable_parameters()
The second version uses 512G memory and 112-core CPU, comparing various rank parameters and execution times: https://huggingface.co/docs/peft/main/en/quicktour#peftconfigr, the dimension of the low-rank matrices lora_alpha, the scaling factor for the low-rank matrices. These two parameters are easier to understand when compared rather than just introduced.r=8 lora_alpha=16trainable params: 1949696 || all params: 6245533696 || trainable%: 0.031217444255383614372:16:31r=512 lora_alpha=16trainable params: 124780544 || all params: 6368364544 || trainable%: 1.9593813001418532352:06:08r=1024 lora_alpha=16trainable params: 2495610880 || all params: 8739194880 || trainable%: 28.55653082770023798:26:06r=1024 lora_alpha=512trainable params: 249561088 || all params: 6493145088 || trainable%: 3.843454668234883352:07:33Not using lora: 18:30:54<500:35:51
Among them:
In the third-to-last line, AutoModel.from_pretrained... If you don't add .float() when using: collate_fn = DataCollatorForSeq2Seq(tokenizer, return_tensors="pt", padding=True) etc., it will report an error: "addmm_impl_cpu_" not implemented for 'Half'. If you handle it yourself, it generally doesn't matter. It is determined by whether cuda is effective and the no_cuda judgment during training.
Data loading:
The part read is directly copied from the official data, so I won't post it. Different data is not the same, and posting it is of no use.vaild_dataset = dataset[40000:]vaild_dataset = Dataset.from_dict(vaild_dataset)train_dataset = Dataset.from_dict(dataset[:40000])from torch.utils.data import DataLoaderfrom torch.utils.data.dataloader import default_collatefrom transformers import DataCollatorForSeq2Seqcollate_fn = DataCollatorForSeq2Seq(tokenizer, return_tensors="pt", padding=True)train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)valid_dataloader = DataLoader(vaild_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
Referenced peft training code:
import torchfrom torch.optim import AdamWfrom transformers import get_linear_schedule_with_warmup#, set_seedfrom tqdm import tqdmoptimizer = AdamW(params=model.parameters(), lr=lr)device = 'cpu'model.to(device)for epoch in range(num_epochs): model.train() for step, batch in enumerate(tqdm(train_dataloader)): batch.to(device) outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad() model.eval() for step, batch in enumerate(tqdm(valid_dataloader)): batch.to(device) with torch.no_grad(): outputs = model(**batch) predictions = outputs.logits.argmax(dim=-1) predictions, references = predictions, batch["labels"] metric.add_batch( predictions=predictions, references=references, ) eval_metric = metric.compute() print(f"epoch {epoch}:", eval_metric) passsave_directory = '6b2peft-lora'tokenizer.save_pretrained(save_directory)model.save_pretrained(save_directory)
Then I thought, since I also used transformers , I might as well use them all for cleaner code, so:
from transformers import TrainingArguments,Trainer# model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to('cpu')model.to('cpu')seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)save_directory = '6b2peft-lora-t'args = TrainingArguments( output_dir='6b2peft-lora-t', warmup_steps=100, evaluation_strategy="epoch", save_strategy="epoch", learning_rate=e-5, num_train_epochs=3, weight_decay=0.01, save_steps=500, eval_steps=500, logging_steps=300, gradient_accumulation_steps=2, per_device_train_batch_size=4, per_device_eval_batch_size=1, no_cuda=True,)trainer = Trainer( model=model, args=args, train_dataset=train_dataset, eval_dataset=vaild_dataset, data_collator=seq2seq_data_collator, compute_metrics=eval_metric, tokenizer=tokenizer,)trainer.train()tokenizer.save_pretrained(save_directory)model.save_pretrained(save_directory)