How to create ChatGPT like application for your business needs

In today's competitive landscape, custom AI solutions have become a key requirement for many businesses. Learn how enterprise-grade fine-tuning can transform your customer interactions.

In this guide, you will learn how to fine-tune the LLaMA 3.2 model on your own dataset to create a customized AI assistant for your business.

We'll walk you through the entire process using a dataset, containing conversation between a doctor and a patient, then fine-tuning the AI to mimic the doctor's responses to a given query. So that we can send automated responses to patient's initial queries and capture leads through it.

You will learn in this guide?

➤ What are LLM models and how to use it

➤ What are the prerequisites before starting

➤ How to fine tune llama 3.2 on your custom dataset

➤ How to prepare dataset for fine tuning

➤ How to run fine-tuned model

➤ What are the common mistakes to avoid while fine tuning llm models

➤ How to push the fine-tuned model to hugging face

➤ Why fine-tuning works for custom ai chatbots

What are LLM models and why we use it

Large Language Models (LLMs) are AI systems trained on massive text datasets to understand and generate human-like responses. Meta's flagship model Llama 3.2 stands out with:

3B to 70B parameter versions which gives a lot of flexibility to choose the right version based on your requirements

LLaMA 3.2 is the latest model in LLaMA family with Enhanced reasoning capabilities

Most importantly LLaMA family models comes with Free commercial-use license, which means you can legally deploy your own custom chatbot using these models without any licensing fees

What are the prerequisites before starting

1. Hugging face token with read and write access. You can get it from this offical guide

2. Request Llama model access from this link

3. A google collab notebook session with T4 gpu enabled, check this guide to get started

How to Fine-Tune Llama3 on your custom dataset

For fine tuning, we will use Google collab notebook with free resources.

Please note that due to limited gpu memory, I used few settings which will affect the overall quality of the trained model but will reduce the memory consumption and total training time drastically. I will explain these while explaining the code step by step, so that if you have bigger and better gpu you can take full advantage of that.

1. Setup Environment

First, we need to install the required packages. Run the following command in the first cell of your Google Collab notebook. This will install the necessary packages for fine-tuning the LLaMA 3.2 model.

Why do we need these packages?

Transformers : Base library for model handling
datasets : For efficient dataset management
Accelerate : Enables multi-GPU training
peft/trl : Implements LoRA (Parameter-Efficient Fine-Tuning)
bitsandbytes : Enables 4-bit quantization
huggingface_hub : For model storage and sharing
test

!pip install -U transformers datasets accelerate peft trl bitsandbytes huggingface_hub

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer
import torch

2. Login to hugging face

In this step we will login using an access token with read access, paste your token in place of “YOUR_HF_TOKEN”, to download and load the model.

from huggingface_hub import login
login(token="YOUR_HF_TOKEN")

3. Configurations

Run the below code to define few important variables. We are using 3B model because we only have 16gb of vram available on google collab which is sufficient to load this model.

Also we have num_epochs equals to 1 which is used to reduce the training time. For optimal result you can consider a higher number, but note that this will increase the training time.

To avoid OOM ( out of memory error ) we have used batch size equal to 1, this is basically parallel training if you have access to bigger and multiple GPUs.

And for dataset we are using a well managed dataset

# Model Parameters
base_model = "meta-llama/Llama-3.2-3B-Instruct"
dataset_name = "ruslanmv/ai-medical-chatbot"  
new_model = "llama-3-2-3b-medical"
output_dir = "./results"

# Training Parameters
num_epochs = 1
batch_size = 1
learning_rate = 3e-4

4. Dataset preparation

In the below code we are preparing the dataset by downloading the dataset from huggingface and loading it in the memory. Please note that we have only used 1000 rows from the dataset to reduce overall training time, you can consider a higher number of rows if you want more accurate results.

tokenizer = AutoTokenizer.from_pretrained(base_model, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=42).select(range(1000))

def format_chat_template(example):
   chat = [
       {"role": "user", "content": example["Patient"]},
       {"role": "assistant", "content": example["Doctor"]}
   ]
   example["text"] = tokenizer.apply_chat_template(chat, tokenize=False)
   return example

dataset = dataset.map(format_chat_template, num_proc=4)
dataset = dataset.train_test_split(test_size=0.1)

5. Quantization Configurations

Now lets prepare our configurations for loading the model with 4bit quantization.The main benefit of loading model in 4bit is that this will reduce the memory consumption from 12gb to just 4gb which will allow us to load and train model more efficiently.

Here are some other facts about quantization:

a) 4-bit over 8-bit: While using 4-bit we can expect 75% memory reduction but it will also affect accuracy of model slightly. Consider loading in 8-bit for more accuracy but note that it will increase overall memory consumption.
b) NF4 quantization: Better than standard INT4 for LLMs because it improves ai precision.
c) float16 compute: float16 provides balance between precision and speed. It also reduces memory consumption by half, making it feasible to run larger models on GPUs with limited VRAM, but you can consider float32 if you have larger GPU available

bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,             	# 4-bit vs 8-bit (better memory savings)
	bnb_4bit_quant_type="nf4",     	# NormalFloat optimization
	bnb_4bit_compute_dtype=torch.float16,  # Faster computations
	bnb_4bit_use_double_quant=True,	# Second quantization for weights
)

6. Model Initialization

Now its time for model initialization where we will load model in the memory by downloading it from huggingface directly. Please note that if you have not filled this form you will get error of 'permission denied' from huggingface. Apart from this we have used eager mode for model initialization which offers better compatibility at the cost of speed.

model = AutoModelForCausalLM.from_pretrained(
   base_model,
   quantization_config=bnb_config,
   device_map="auto",     # Automatic GPU/CPU allocation
   token=hf_token,
   attn_implementation="eager"  # Required for 3B stability
)

7. Lora Adapter Configurations

Now its time for LoRA Adapter which is the most important configuration for fine-tuning because LoRA (Low-Rank Adaptation) is a fine-tuning technique that significantly reduces the number of trainable parameters while preserving model efficiency. This makes fine-tuning faster because instead of modifying all model's weights, LoRA inserts small low-rank matrices into specific model layers. This allows for domain-specific adaptation.

In the following configuration:

r=12: Sweet spot for 3B model (8-16 typical), this will reduce the overall training params to 0.5% of the model, to be exact.
lora_dropout=0.05: lora_dropout introduces regularization to prevent overfitting.
target_modules: ensures only the most critical layers (query, key, value projections) are modified, preserving overall model knowledge.
lm_head: ensures the final layer remains trainable for response generation.

peft_config = LoraConfig(
	r=12,               	# Rank (higher = more adaptable but larger)
	lora_alpha=24,      	# Scaling factor (affects learning rate)
	lora_dropout=0.05,  	# Regularization to prevent overfitting
   	bias="none",
   	task_type="CAUSAL_LM",
	target_modules=[     	# Selected based on model architecture
    	'q_proj', 'v_proj',
    	'k_proj', 'o_proj'
	],
	modules_to_save=["lm_head"]  # Keep final layer trainable
)

8. Training Arguments

Now lets define training arguments especially hyperparameters and settings for the fine-tuning process. This configuration is optimized for memory efficiency and stable training in a resource-constrained environment like google collab. I highly encourage you to change following settings if you have a better gpu and time to invest.

Settings you can play with are:

Eval_steps: currently I have set this to 50 steps, which means after every 50 steps, the model is tested on a validation dataset to track its learning progress. You can play with this number to monitor training progress as needed.
Batch size = 1: this basically enables parallel batch processing, you can increase this value if you have larger or multiple GPUs.
Gradient Accumulation (Steps = 2): This simulates larger batch sizes by accumulating gradients before updating weights which will reduce the memory pressure.
Mixed Precision Training (fp16=True): This reduces memory usage and speeds up computation.
Maximum Gradient Norm (max_grad_norm=0.5): This prevents exploding gradients, ensuring stable training.

training_args = TrainingArguments(
   output_dir=output_dir,
   num_train_epochs=num_epochs,
   per_device_train_batch_size=batch_size,
   gradient_accumulation_steps=2,
   optim="paged_adamw_32bit",        # Prevent memory spikes
   learning_rate=learning_rate,      
   logging_steps=10,
   save_strategy="steps",
   eval_strategy="steps",
   eval_steps=50,
   fp16=True,                   # Mixed precision training
   bf16=False,
   report_to="none",
   gradient_checkpointing=True, # Trade speed for memory
   max_grad_norm=0.5,             # Prevent exploding gradients
   gradient_checkpointing_kwargs={"use_reentrant": False},
)

trainer = SFTTrainer(
   model=model,
   train_dataset=dataset["train"],
   eval_dataset=dataset["test"],
   peft_config=peft_config,
   processing_class=tokenizer,
   args=training_args
)

9. Testing before training

Now let's test our current model with queries which patients have. So that we can compare current and post training ai responses for performance evaluation.

messages = [
   {
       "role": "user",
       "content": "Hello doctor, I have bad acne. How do I get rid of it?"
   }
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False,
                                      add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors='pt', padding=True,
                  truncation=True).to("cuda")

outputs = model.generate(**inputs, max_length=150,
                        num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text.split("assistant")[1])

10. Optimizations before training ( optional )

These are the optimization tricks to work with collob environments to unlock its true potential. These commands will Adjusts PyTorch's CUDA memory allocator to use expandable memory segments rather than fixed blocks. And Forces immediate release of unused GPU memory cached by PyTorch which will enhance the overal performance.

%env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
torch.cuda.empty_cache()
model.config.use_cache = False

11. Start Training

With all configurations set up, now its time to initiate the fine-tuning process by executing function train on the trainer object. This will start the training process and you can monitor the progress in the console.

trainer.train()

12. Save trained model

After training is finish, lets save and upload the fine-tuned model to Hugging-Face Hub. So that we can use this trained model later when we will deploy it. For deployment steps, read this blog.

trainer.model.save_pretrained(new_model)
hf_write_token = "hugging-face-token-with-write-access"
trainer.model.push_to_hub(new_model, token=hf_write_token, private = True,create_remote=True)

13. Post training test

After saving our fine-tuned ai model, lets see how does it respond to the same question which we asked earlier in the step 9.

messages = [
   {
       "role": "user",
       "content": "Hello doctor, I have bad acne. How do I get rid of it?"
   }
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False,
                                      add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors='pt', padding=True,
                  truncation=True).to("cuda")

outputs = model.generate(**inputs, max_length=150,
                        num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text.split("assistant")[1])

Why This Architecture Works for Chatbots?

• Because 4-bit quantization + LoRA enables fine-tuning and deploying 3B parameter models on single 16GB GPU.

• By targeting attention projections (q_proj, v_proj etc.), we can modify how the model processes conversational context while preserving its general knowledge.

• The formatted dataset teaches the model, how a doctor is responding to the given query.

• Keeping 95% of original weights frozen, prevents catastrophic forgetting of general conversation skills.

What are the common mistakes to avoid while fine tuning llm?

a). Over-tuning: More epochs ≠ better performance. For style adoption, 1-3 epochs often suffice for LLMs.
b). Wrong Modules: Targeting all linear layers increases VRAM usage by 40% while providing minimal quality gain for LLMs.

c). Quantization Artifacts: If responses become nonsensical or unclear, try:

bnb_4bit_quant_type="fp4"  # Less compression
# OR
load_in_8bit=True      	# More precision

Conclusion

The main reason you want to have an ai powered chatbot is to automate customer general queries based on your own business needs.

The quality of ai response depends on the quality of your dataset and number of epochs you train. The perfect chatbot gives more accurate response with minimum investment.

If you want to have your own ai application to be developed, then contact us .

Now the next step is to deploy your model. To run these on low-power devices, you will need to:

1. Convert to GGUF format
2. Quantize the model
3. Deploy via Ollama

Note: Our deployment guide would cover optimization techniques for production environment and quantization methods to reduce model size while maintaining the performance.

For the full deployment guide with step by step code examples, read this blog.