Skip to content

Conversation

@akaashrp
Copy link
Contributor

@akaashrp akaashrp commented Jan 21, 2026

  • Vision support will be addressed in a future commit
  • Instruct variant (block scale quantization) is buggy and will require additional changes for CUDA deployment

@gemini-code-assist
Copy link

Summary of Changes

Hello @akaashrp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the Ministral 3 model within the MLC LLM framework. This includes adding the model architecture, a dedicated conversation template, parameter mapping from HuggingFace, and various quantization methods to optimize performance and reduce memory footprint. The changes enable users to seamlessly utilize Ministral 3 models within the MLC LLM ecosystem.

Highlights

  • Model Support: Adds support for the Ministral 3 model architecture within the MLC LLM framework.
  • Conversation Template: Introduces a new conversation template specifically designed for Ministral 3 models, including system prompts and role templates.
  • Parameter Mapping: Implements parameter mapping from HuggingFace PyTorch models to MLC LLM, facilitating model loading and conversion.
  • Quantization Support: Adds quantization support for Ministral 3 models, including group quantization, FasterTransformer quantization, AWQ quantization, no quantization, and block-scale quantization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Ministral 3 model, including its conversation template, model configuration, parameter loader, and quantization methods. It also updates various model presets and quantization configurations to accommodate the new model and enhance existing FP8 block-scale quantization with static activation scaling. The changes are well-structured and integrate seamlessly with the existing framework.

Comment on lines +396 to +397
if isinstance(shard, tp.ShardSingleDim) and shard.segs is not None:
shard.segs = [x // weight_block_size[shard.dim] for x in shard.segs]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The shard.segs modification shard.segs = [x // weight_block_size[shard.dim] for x in shard.segs] is applied directly to shard. If shard is a shared object, this modification could have unintended side effects on other parts of the code that might be using the same shard instance. It might be safer to create a copy of shard before modifying its segs attribute.

Comment on lines +141 to +150
if (
result.ndim == 1
and result.size > 1
and len(expected_weight_scale_shape) >= 2
and expected_weight_scale_shape[0] % result.size == 0
):
rows_per_segment = expected_weight_scale_shape[0] // result.size
tiled = np.repeat(result, rows_per_segment)
tiled = tiled.reshape(expected_weight_scale_shape[0], 1)
return np.broadcast_to(tiled, expected_weight_scale_shape).astype(dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _weight_scale_transform, the logic for handling various result.shape scenarios is quite complex, especially the if result.ndim == 1 and result.size > 1 block. This might indicate that the _transform function or the input arrays could sometimes produce unexpected shapes. It would be beneficial to add comments explaining the rationale behind each shape-handling branch, or to simplify the _transform function if possible to ensure more predictable outputs.

Comment on lines +167 to +185
def _activation_scale_transform(*arrays, dtype: str, _transform=transform):
result = _transform(*arrays, dtype=dtype)
result = np.asarray(result, dtype=dtype)
if result.shape == expected_shape:
return result
if result.shape == ():
# HF checkpoint stores a single scale; broadcast across the expected dimension.
return np.full(expected_shape, result.item(), dtype=dtype)
if result.shape == (1,) and expected_shape != (1,):
return np.broadcast_to(result, expected_shape).astype(dtype)
if (
result.ndim == 1
and result.size > 1
and len(expected_shape) >= 1
and expected_shape[0] % result.size == 0
):
rows_per_segment = expected_shape[0] // result.size
tiled = np.repeat(result, rows_per_segment)
return tiled.reshape(expected_shape).astype(dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to _weight_scale_transform, the _activation_scale_transform function also has complex logic for handling different result.shape scenarios. Specifically, the broadcasting logic for result.shape == () and result.shape == (1,) could be made more explicit with comments, or potentially refactored if there's a common utility for such broadcasting operations.

Comment on lines +93 to +101
else:
self.weight_block_size = [128, 128]
logger.info(
"Setting default weight_block_size=%s since quantization_config does not provide "
"FP8 block-scale details required by MLC (activation_scheme=%s, quant_method=%s)",
self.weight_block_size,
activation_scheme,
quant_method,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else branch sets self.weight_block_size = [128, 128] and logs an info message. While this provides a default, it might be better to explicitly define a default weight_block_size in the Ministral3Config dataclass definition itself, or raise a more specific warning if the quantization_config is malformed rather than silently defaulting. This ensures that the configuration is always explicit.

Comment on lines +133 to +137
raise ValueError(
"Unable to determine the maximum sequence length, because none of "
"`context_window_size`, `max_position_embeddings` or "
"`max_sequence_length` is provided in `config.json`."
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ValueError here is quite generic. It would be more helpful to the user if the error message explicitly listed which of context_window_size, max_position_embeddings, or max_sequence_length were checked and found missing, and perhaps suggested where to define them.

Comment on lines +153 to +158
logger.info(
"%s defaults to %d",
bold("prefill_chunk_size"),
min(*prefill_chunk_size_candidates, 8192),
)
self.prefill_chunk_size = min(*prefill_chunk_size_candidates, 8192)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The min(*prefill_chunk_size_candidates, 8192) logic for prefill_chunk_size is a bit implicit. It would be clearer to explicitly state the order of preference or the conditions under which each candidate is chosen. For example, if sliding_window_size is preferred over context_window_size when both are present, that should be clear.

"rope_theta": 1000000.0,
"rope_type": "yarn",
"type": "yarn",
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sliding_window parameter is set to None. If this means the feature is disabled or not applicable, it might be more explicit to use -1 or 0 (if 0 implies disabled) for consistency with other int parameters that represent sizes, or add a comment explaining why None is used here.

Comment on lines +415 to +417
shape_supported_by_cutlass = (
self.weight.shape[0] % 128 == 0 and self.weight.shape[1] % 128 == 0
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The shape_supported_by_cutlass check is only for self.weight.shape[0] % 128 == 0 and self.weight.shape[1] % 128 == 0. This check might be too restrictive or not fully capture all conditions for Cutlass support, especially given the Todo: check "shape supported by cutlass" for Hopper comment in BlockScaleQuantizeLinear. It's important to ensure this condition is accurate for all target hardware or to provide a more robust fallback.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for the Ministral 3 model architecture to MLC LLM, including FP8 block-scale quantization with static activation support. The PR also adds several other model presets that appear to be part of a broader batch of model additions.

Changes:

  • Adds Ministral 3 model implementation with support for YARN RoPE and sliding window attention
  • Introduces FP8 block-scale quantization with static activation scales for improved performance
  • Adds multiple model presets including Ministral 3, Gemma variants, Qwen3 variants, SmolLM2 variants, and uncomments previously disabled presets

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
python/mlc_llm/quantization/quantization.py Adds new fp8_e4m3fn_bf16_block_scale_static_activation quantization scheme
python/mlc_llm/quantization/block_scale_quantization.py Implements BlockScaleQuantizeLinearStaticActivation class with static activation quantization support
python/mlc_llm/model/model_preset.py Adds Ministral 3 preset and multiple other model presets (Gemma, Qwen3, SmolLM2 variants)
python/mlc_llm/model/model.py Registers the ministral3 model with appropriate loaders and quantization methods
python/mlc_llm/model/ministral3/ministral3_quantization.py Provides quantization functions for Ministral 3 (group, ft, awq, no_quant, block_scale)
python/mlc_llm/model/ministral3/ministral3_model.py Implements the Ministral 3 model architecture including attention, MLP, and decoder layers
python/mlc_llm/model/ministral3/ministral3_loader.py Handles parameter mapping from HuggingFace format including FP8 quantized weights
python/mlc_llm/interface/gen_config.py Adds ministral3 to the list of supported model types
python/mlc_llm/conversation_template/ministral3.py Defines the conversation template for Ministral 3 including system prompts
python/mlc_llm/conversation_template/init.py Imports the ministral3 conversation template
python/mlc_llm/model/ministral3/init.py Empty init file for the ministral3 module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +12 to +17
from mlc_llm.quantization import Quantization

from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration
from .ministral3_quantization import BlockScaleQuantize


Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement imports BlockScaleQuantize from the wrong module. It should be imported from mlc_llm.quantization instead of from .ministral3_quantization. The ministral3_quantization module doesn't export BlockScaleQuantize; it only imports it from mlc_llm.quantization to use in its functions.

Suggested change
from mlc_llm.quantization import Quantization
from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration
from .ministral3_quantization import BlockScaleQuantize
from mlc_llm.quantization import Quantization, BlockScaleQuantize
from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration

Copilot uses AI. Check for mistakes.
class Ministral3Attention(nn.Module): # pylint: disable=too-many-instance-attributes
"""Same as LlamaAttention, but with sliding window attention using a rolling buffer cache."""

def __init__(self, config: Ministral3Config):
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change
def __init__(self, config: Ministral3Config):
def __init__(self, config: Ministral3Config):
super().__init__()

Copilot uses AI. Check for mistakes.
class Ministral3DecoderLayer(nn.Module):
"""Exact same as LlamaDecoderLayer."""

def __init__(self, config: Ministral3Config):
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change
def __init__(self, config: Ministral3Config):
def __init__(self, config: Ministral3Config):
super().__init__()

Copilot uses AI. Check for mistakes.
class Ministral3Model(nn.Module):
"""Exact same as LlamaModel."""

def __init__(self, config: Ministral3Config):
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change
def __init__(self, config: Ministral3Config):
def __init__(self, config: Ministral3Config):
super().__init__()

Copilot uses AI. Check for mistakes.


class Mistral3ForConditionalGeneration(nn.Module): # pylint: disable=too-many-instance-attributes
def __init__(self, config: Ministral3Config):
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change
def __init__(self, config: Ministral3Config):
def __init__(self, config: Ministral3Config):
super().__init__()

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant