March 2nd, 2025

How to create ChatGPT like application for your business needs

Auther: Gurkirt Pal

Written by

Gurkirt Pal

Entrepreneur family managing time and startup tasks together.

In today's competitive landscape, custom AI solutions have become a key requirement for many businesses. Learn how enterprise-grade fine-tuning can transform your customer interactions.

In this guide, you will learn how to fine-tune the LLaMA 3.2 model on your own dataset to create a customized AI assistant for your business.

We'll walk you through the entire process using a dataset, containing conversation between a doctor and a patient, then fine-tuning the AI to mimic the doctor's responses to a given query. So that we can send automated responses to patient's initial queries and capture leads through it.

You will learn in this guide?

➤ What are LLM models and how to use it

➤ What are the prerequisites before starting

➤ How to fine tune llama 3.2 on your custom dataset

➤ How to prepare dataset for fine tuning

➤ How to run fine-tuned model

➤ What are the common mistakes to avoid while fine tuning llm models

➤ How to push the fine-tuned model to hugging face

➤ Why fine-tuning works for custom ai chatbots

What are LLM models and why we used it

What are LLM models and why we use it

Large Language Models (LLMs) are AI systems trained on massive text datasets to understand and generate human-like responses. Meta's flagship model Llama 3.2 stands out with:

1.

3B to 70B parameter versions which gives a lot of flexibility to choose the right version based on your requirements

2.

LLaMA 3.2 is the latest model in LLaMA family with Enhanced reasoning capabilities

3.

Most importantly LLaMA family models comes with Free commercial-use license, which means you can legally deploy your own custom chatbot using these models without any licensing fees

What are the prerequisites before starting

1. Hugging face token with read and write access. You can get it from this offical guide

2. Request Llama model access from this link

3. A google collab notebook session with T4 gpu enabled, check this guide to get started

How to Fine-Tune Llama3 on your custom dataset

For fine tuning, we will use Google collab notebook with free resources.

Please note that due to limited gpu memory, I used few settings which will affect the overall quality of the trained model but will reduce the memory consumption and total training time drastically. I will explain these while explaining the code step by step, so that if you have bigger and better gpu you can take full advantage of that.

1. Setup Environment

First, we need to install the required packages. Run the following command in the first cell of your Google Collab notebook. This will install the necessary packages for fine-tuning the LLaMA 3.2 model.

Why do we need these packages?
  1. Transformers : Base library for model handling
  2. datasets : For efficient dataset management
  3. Accelerate : Enables multi-GPU training
  4. peft/trl : Implements LoRA (Parameter-Efficient Fine-Tuning)
  5. bitsandbytes : Enables 4-bit quantization
  6. huggingface_hub : For model storage and sharing
      
        

2. Login to hugging face

In this step we will login using an access token with read access, paste your token in place of “YOUR_HF_TOKEN”, to download and load the model.

      
        

3. Configurations

Run the below code to define few important variables. We are using 3B model because we only have 16gb of vram available on google collab which is sufficient to load this model.

Also we have num_epochs equals to 1 which is used to reduce the training time. For optimal result you can consider a higher number, but note that this will increase the training time.

To avoid OOM ( out of memory error ) we have used batch size equal to 1, this is basically parallel training if you have access to bigger and multiple GPUs.

And for dataset we are using a well managed dataset

      
        

4. Dataset preparation

In the below code we are preparing the dataset by downloading the dataset from huggingface and loading it in the memory. Please note that we have only used 1000 rows from the dataset to reduce overall training time, you can consider a higher number of rows if you want more accurate results.

      
        

5. Quantization Configurations

Now lets prepare our configurations for loading the model with 4bit quantization.The main benefit of loading model in 4bit is that this will reduce the memory consumption from 12gb to just 4gb which will allow us to load and train model more efficiently.

Here are some other facts about quantization:

  1. a) 4-bit over 8-bit: While using 4-bit we can expect 75% memory reduction but it will also affect accuracy of model slightly. Consider loading in 8-bit for more accuracy but note that it will increase overall memory consumption.
  2. b) NF4 quantization: Better than standard INT4 for LLMs because it improves ai precision.
  3. c) float16 compute: float16 provides balance between precision and speed. It also reduces memory consumption by half, making it feasible to run larger models on GPUs with limited VRAM, but you can consider float32 if you have larger GPU available
      
        

6. Model Initialization

Now its time for model initialization where we will load model in the memory by downloading it from huggingface directly. Please note that if you have not filled this form you will get error of 'permission denied' from huggingface. Apart from this we have used eager mode for model initialization which offers better compatibility at the cost of speed.

      
        

7. Lora Adapter Configurations

Now its time for LoRA Adapter which is the most important configuration for fine-tuning because LoRA (Low-Rank Adaptation) is a fine-tuning technique that significantly reduces the number of trainable parameters while preserving model efficiency. This makes fine-tuning faster because instead of modifying all model's weights, LoRA inserts small low-rank matrices into specific model layers. This allows for domain-specific adaptation.

In the following configuration:

  1. r=12: Sweet spot for 3B model (8-16 typical), this will reduce the overall training params to 0.5% of the model, to be exact.
  2. lora_dropout=0.05: lora_dropout introduces regularization to prevent overfitting.
  3. target_modules: ensures only the most critical layers (query, key, value projections) are modified, preserving overall model knowledge.
  4. lm_head: ensures the final layer remains trainable for response generation.
      
        

8. Training Arguments

Now lets define training arguments especially hyperparameters and settings for the fine-tuning process. This configuration is optimized for memory efficiency and stable training in a resource-constrained environment like google collab. I highly encourage you to change following settings if you have a better gpu and time to invest.

Settings you can play with are:
  1. Eval_steps: currently I have set this to 50 steps, which means after every 50 steps, the model is tested on a validation dataset to track its learning progress. You can play with this number to monitor training progress as needed.
  2. Batch size = 1: this basically enables parallel batch processing, you can increase this value if you have larger or multiple GPUs.
  3. Gradient Accumulation (Steps = 2): This simulates larger batch sizes by accumulating gradients before updating weights which will reduce the memory pressure.
  4. Mixed Precision Training (fp16=True): This reduces memory usage and speeds up computation.
  5. Maximum Gradient Norm (max_grad_norm=0.5): This prevents exploding gradients, ensuring stable training.
      
        

9. Testing before training

Now let's test our current model with queries which patients have. So that we can compare current and post training ai responses for performance evaluation.

      
        

10. Optimizations before training ( optional )

These are the optimization tricks to work with collob environments to unlock its true potential. These commands will Adjusts PyTorch's CUDA memory allocator to use expandable memory segments rather than fixed blocks. And Forces immediate release of unused GPU memory cached by PyTorch which will enhance the overal performance.

      
        

11. Start Training

With all configurations set up, now its time to initiate the fine-tuning process by executing function train on the trainer object. This will start the training process and you can monitor the progress in the console.

      
        

12. Save trained model

After training is finish, lets save and upload the fine-tuned model to Hugging-Face Hub. So that we can use this trained model later when we will deploy it. For deployment steps, read this blog.

      
        

13. Post training test

After saving our fine-tuned ai model, lets see how does it respond to the same question which we asked earlier in the step 9.

      
        

Why This Architecture Works for Chatbots?

• Because 4-bit quantization + LoRA enables fine-tuning and deploying 3B parameter models on single 16GB GPU.

• By targeting attention projections (q_proj, v_proj etc.), we can modify how the model processes conversational context while preserving its general knowledge.

• The formatted dataset teaches the model, how a doctor is responding to the given query.

• Keeping 95% of original weights frozen, prevents catastrophic forgetting of general conversation skills.

What are the common mistakes to avoid while fine tuning llm?

  1. a). Over-tuning: More epochs ≠ better performance. For style adoption, 1-3 epochs often suffice for LLMs.
  2. b). Wrong Modules: Targeting all linear layers increases VRAM usage by 40% while providing minimal quality gain for LLMs.
  3. c). Quantization Artifacts: If responses become nonsensical or unclear, try:
          
            

Conclusion

The main reason you want to have an ai powered chatbot is to automate customer general queries based on your own business needs.

The quality of ai response depends on the quality of your dataset and number of epochs you train. The perfect chatbot gives more accurate response with minimum investment.

If you want to have your own ai application to be developed, then contact us .

Now the next step is to deploy your model. To run these on low-power devices, you will need to:

  • 1. Convert to GGUF format
  • 2. Quantize the model
  • 3. Deploy via Ollama

Note: Our deployment guide would cover optimization techniques for production environment and quantization methods to reduce model size while maintaining the performance.

For the full deployment guide with step by step code examples, read this blog.

Featured blog posts

taple-logo

From seed to growth,

Simplifying your entrepreneurial journey



© 2025 Taple UI. All rights reserved.