Generative AI

Customize Generative AI Models for Enterprise Applications with Llama 3.1

Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

The newly unveiled Llama 3.1 collection of 8B, 70B, and 405B large language models (LLMs) is narrowing the gap between proprietary and open-source models. Their open nature is attracting more developers and enterprises to integrate these models into their AI applications.

These models excel at various tasks including content generation, coding, and deep reasoning, and can be used to power enterprise applications for use cases like chatbots, natural language processing, and language translation.

The Llama 3.1 405B model, thanks to the sheer size of its training data, is an excellent candidate for generating synthetic data to tune other LLMs. This is especially useful in industries like healthcare, finance, and retail where real-world data is out of reach due to compliance requirements.

Additionally, Llama 3.1 405B can also be tuned with domain-specific data to serve enterprise use cases.

Enterprises experience better accuracy once they customize the LLMs to accommodate their organizational requirements with domain knowledge and skills, the company’s vocabulary, and other cultural nuances. 

Build custom generative AI models with NVIDIA AI Foundry

NVIDIA AI Foundry is a platform and service for building custom generative AI models with enterprise data and domain-specific knowledge. Just as TSMC manufactures chips designed by other companies, NVIDIA AI Foundry enables organizations to develop their own AI models.

A chip foundry provides advanced transistor technology, manufacturing processes, large chip fabs, expertise, and a rich ecosystem of third-party tools and library providers. Similarly, NVIDIA AI Foundry includes NVIDIA-created AI models like Nemotron and Edify, popular open foundation models, NVIDIA NeMo software for customizing models, and dedicated capacity on NVIDIA DGX Cloud—built and backed by NVIDIA AI experts. 

An image showing the pillars of NVIDIA AI Foundry including foundation models, NVIDIA NeMo advanced customization software, leading AI infrastructure with DGX Cloud, NVIDIA NIM for AI inference, an ecosystem of partners,  and NVIDIA AI experts.
Figure 1. The NVIDIA AI Foundry is a service and platform where developers can start building custom models with NVIDIA AI Foundry including optimized foundation models, advanced customization software, leading AI infrastructure, and AI experts

The foundry outputs performance-optimized custom models packaged as NVIDIA NIM inference microservices for easy deployment on any accelerated cloud, data center, and workstation.

In this blog, we’ll dive deep into how to create a custom LLM. You can use this as a reference to create other custom generative AI models such as multimodal language models, vision-language models, or text-to-image models.

Generate proprietary synthetic domain data with Llama 3.1

Enterprises often need to overcome the lack of domain data or accessibility of the data due to compliance and security requirements. 

The Llama 3.1 405B model is ideal for synthetic data generation due to its enhanced ability to recognize complex patterns, generate high-quality data, generalize well, scale efficiently, reduce bias, and preserve privacy. 

The Nemotron-4 340B Reward model judges the data generated by the Llama 3.1 405B model and scores the data across various categories, filtering out lower-scored data and providing high-quality datasets that align with human preferences (Figure 2).

ALT Text: A diagram showing a pipeline for synthetic data generation pipeline using Llama 3.1 405B instruct-tuned model that generates data. This data is fed into Nemotron 4 Reward model (top right) which rates the generated data and high quality dataset is filtered and used to align a foundation model using NeMo.
Figure 2. Synthetic data generation pipeline powered by Llama 3.1 405B Instruct and Nemotron-4 340B Reward models

Achieving best-in-class performance with an overall score of 92.0, the reward model tops RewardBench leaderboard. It excels in the Chat-Hard subset, which tests the model’s ability to handle trick questions and subtle differences in instruction responses. 

Additionally, the models provide a permissive license for enterprises to freely use the generated datasets in commercial applications.

Once the dataset is ready, it can fine-tune other foundation models with the NeMo Aligner library on GitHub.

Curate, customize, and evaluate models with NVIDIA NeMo

NVIDIA NeMo is an end-to-end platform for developing custom generative AI, anywhere. It includes tools for training, customization, and retrieval-augmented generation (RAG), guardrailing and toolkits, data curation tools, and model pretraining, offering enterprises an easy, cost-effective, and fast way to adopt generative AI.

In this post, we’ll show you how to create a custom Llama 3.1 model with AI Foundry. Key steps include domain-specific data preparation, LLM customization, and evaluation. 

A diagram showing a pipeline for custom model generation, including synthetic data generation, data curation, model fine-tuning, model evaluation, and guardrails. This workflow produces models for AI inference that can be deployed with NVIDIA NIM. 
Figure 3. Custom AI workflow with NVIDIA NeMo tools and microservices helps build custom generative AI models that can be delivered with NVIDIA NIM

Domain-specific data preparation 

Once the synthetic data is generated, you can use NeMo Curator iteratively to curate high-quality data and improve the custom model’s performance. For instance, if a company wants to build a model that excels at answering law questions, it can use the synthetic data generation pipeline with the Llama 3.1 405B model to create realistic question-and-answer pairs. 

An example question-answer pair is shown below:

{
"question": "Which government department is responsible for setting design and signage standards for federally funded highways in the US?",
"answer": "The U.S. Department of Transportation sets uniform design and signage standards for federally funded highways, which most states adopt for all roads."
}

Check out this NeMo Curator notebook for detailed instructions on how to generate synthetic data and use filtering techniques like quality filtering, semantic dedupe, and HTML tag stripping.  

NeMo Curator can seamlessly scale across thousands of compute cores and uses highly optimized CUDA kernels to effortlessly perform a variety of data acquisition, preprocessing, and cleaning tasks, enabling enterprise developers to save time and create high-quality data faster. 

NeMo Curator offers various out-of-the-box functionalities such as text cleaning, language identification, quality filtering, privacy filtering, domain and toxicity classification, deduplication, and streamlined scalability. These features enable developers to create high-quality data for customization.

Customization and evaluation 

Once the data is prepared, developers can easily customize and evaluate their models to get the best models for their specific needs using NeMo. NeMo leverages advanced parallelism techniques to maximize NVIDIA GPU performance, helping businesses develop solutions quickly. It effectively manages GPU resources and memory across multiple nodes, improving performance. By splitting the model and data, NeMo enables smooth multi-node and multi-GPU training, cutting down training time and boosting productivity.

The table below shows the pretraining performance of the Llama 3.1 405B model with NeMo on NVIDIA H100 GPUs. To achieve this performance, we used these parameters: micro-batch size of 1, global batch size of 253, tensor parallel of 8, pipeline parallel of 9, virtual parallel of 7, and context parallel of 2. 

For pretraining, using FP8 can achieve about a 56.5% increase in throughput (tokens per second) compared to BF16.

ModePrecision#GPUsSeq-lenTokens/sec/GPUTFLOP/sec/GPUTime to train 10T tokens on 1k GPUs (days)
PretrainingBF165768192193512600
PretrainingFP85768192302802369

Similarly, for LoRA fine-tuning, FP8 can achieve about a 55% increase in performance compared to BF16. 

To achieve this performance, we used these parameters: micro-batch size of 1, global batch size of 24, tensor parallel of 4, pipeline parallel of 6, virtual parallel of 7, and context parallel of 1.  To achieve the above performance, please refer to our guidelines for pretraining or fine-tuning LLaMA 3.1, from our docs

ModePrecision#GPUsSeq-lenTokens/sec/GPUTFLOP/sec/GPUTime to finish training in minutes (10M tokens)
LoRABF1624204837662118.5
LoRAFP824204858396211.9

NeMo supports several parameter-efficient fine-tuning techniques, such as p-tuning, low-rank adaption (LoRA), and its quantization version (QLoRA). These techniques are useful for creating custom models without requiring a lot of computing power.

In addition to these parameter-efficient fine-tuning techniques, NeMo also supports supervised fine-tuning (SFT) and alignment techniques such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and NeMo SteerLM. These techniques enable steering the model responses and aligning them with human preferences, making the LLMs ready to integrate into customer-facing applications.

To simplify generative AI customization, the NeMo team has announced an early access program for NVIDIA NeMo Customizer microservice. This high-performance, scalable service simplifies fine-tuning and aligning LLMs for domain-specific use cases. Using well-known microservices and API architecture, helps enterprises bring solutions to market faster.

Enterprises using LLMs should regularly test them on old and new tasks. Customizing these models can cause them to forget previous knowledge, a problem known as catastrophic forgetting. To keep the models working well and improve user experience, companies must keep optimizing the models, retest them on basic skills and alignment with company goals, and check for any loss of previous abilities. 

To streamline LLM evaluation, NVIDIA offers early access to the NeMo Evaluator microservice which supports: 

  1. Automated evaluation on a curated set of academic benchmarks. 
  2. Automated evaluations using user-provided evaluation datasets.
  3. Evaluation using LLM-as-a-judge to perform a holistic evaluation of model responses, which is relevant for generative tasks where the ground truth could be undefined. 

To get started, apply for NeMo Evaluator early access.  In the meantime, you can use the NeMo framework to perform evaluations on metrics like Exact Match, F1, and Recall-Oriented Understudy for Gisting Evaluation (ROUGE).  

To customize the model for the legal domain, developers first curate the data. Then, they use NeMo to apply the LoRA technique. This enhances the Llama 3.1 8B model’s ability to respond more effectively to legal questions. Check out this notebook for detailed instructions on how to fine-tune the Llama 3.1 models with LoRA.

Connecting to enterprise data for insights

NVIDIA NeMo Retriever is a collection of generative AI microservices enabling organizations to seamlessly connect custom models to diverse business data and deliver highly accurate responses. NeMo Retriever provides world-class information retrieval with high-accuracy retrieval pipelines, and maximum data privacy, enabling organizations to make better use of their data and generate business insights in real time. 

NeMo Retriever improves generative AI applications with enterprise-grade RAG capabilities, which can be connected to business data wherever it resides. NeMo Retriever provides advanced open and commercial retrieval pipelines that can provide 30% fewer incorrect answers for enterprise text QA evaluation. Learn how to use NeMo Retriever for building a RAG-powered app

Safeguarding LLM responses

NVIDIA NeMo Guardrails enables developers to add programmable guardrails to LLM-based conversational applications, ensuring trustworthiness, safety, security, and controlled dialog while protecting against common LLM vulnerabilities.

NeMo Guardrails is an extensible toolkit designed to integrate with 3rd-party and community guardrails and safety models such as Meta’s latest Llama Guard. It can also be used with popular LLM application development frameworks like LangChain and LlamaIndex. Learn how to get started with NeMo Guardrails.

High-performance inference with NVIDIA NIM

The custom models from the AI Foundry can be packaged as an NVIDIA NIM inference microservice, part of NVIDIA AI Enterprise, for secure, reliable deployment of high-performance inferencing across the cloud, data center, and workstations. Supporting a wide range of AI models, including open foundation models, it ensures seamless, scalable AI inferencing, using industry-standard APIs. 

NVIDIA NIM is optimized to deliver higher performance for generative AI models and the Llama 3 8B NIM delivers 2x higher performance.
Figure 4. Llama 3.1 8B NIM achieves 2.5X throughput

Use NIM for local deployment with a single command or autoscale on Kubernetes on NVIDIA accelerated infrastructure, anywhere. Get started with a simple guide to NIM deployment. Additionally, NIMs also support deployments of models customized with LoRA

Start building your custom models

 Depending on where you are in your AI journey, there are different ways to get started. 

  • To build a custom Llama NIM for your enterprise, check out the Llama 3.1 NIM customization notebook.
  • Experience the new Llama 3.1 NIMs and other popular foundation models at ai.nvidia.com. You can access the model endpoints directly or download the NIMs and run them locally
  • Get help accessing and setting up infrastructure in the cloud to run a Llama 3.1 NIM or other model for evaluation and prototyping.
  • Learn more about NVIDIA AI Foundry here.
Discuss (1)

Tags