[Model] Support Ministral 3 #3409

akaashrp · 2026-01-21T04:17:11Z

Vision support will be addressed in a future commit
Instruct variant (block scale quantization) is buggy and will require additional changes for CUDA deployment

gemini-code-assist · 2026-01-21T04:18:24Z

Summary of Changes

Hello @akaashrp, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for the Ministral 3 model within the MLC LLM framework. This includes adding the model architecture, a dedicated conversation template, parameter mapping from HuggingFace, and various quantization methods to optimize performance and reduce memory footprint. The changes enable users to seamlessly utilize Ministral 3 models within the MLC LLM ecosystem.

Highlights

Model Support: Adds support for the Ministral 3 model architecture within the MLC LLM framework.
Conversation Template: Introduces a new conversation template specifically designed for Ministral 3 models, including system prompts and role templates.
Parameter Mapping: Implements parameter mapping from HuggingFace PyTorch models to MLC LLM, facilitating model loading and conversion.
Quantization Support: Adds quantization support for Ministral 3 models, including group quantization, FasterTransformer quantization, AWQ quantization, no quantization, and block-scale quantization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the Ministral 3 model, including its conversation template, model configuration, parameter loader, and quantization methods. It also updates various model presets and quantization configurations to accommodate the new model and enhance existing FP8 block-scale quantization with static activation scaling. The changes are well-structured and integrate seamlessly with the existing framework.

gemini-code-assist · 2026-01-21T04:20:25Z

python/mlc_llm/quantization/block_scale_quantization.py

+            if isinstance(shard, tp.ShardSingleDim) and shard.segs is not None:
+                shard.segs = [x // weight_block_size[shard.dim] for x in shard.segs]


The shard.segs modification shard.segs = [x // weight_block_size[shard.dim] for x in shard.segs] is applied directly to shard. If shard is a shared object, this modification could have unintended side effects on other parts of the code that might be using the same shard instance. It might be safer to create a copy of shard before modifying its segs attribute.

python/mlc_llm/conversation_template/ministral3.py

gemini-code-assist · 2026-01-21T04:20:25Z

python/mlc_llm/model/ministral3/ministral3_loader.py

+                    if (
+                        result.ndim == 1
+                        and result.size > 1
+                        and len(expected_weight_scale_shape) >= 2
+                        and expected_weight_scale_shape[0] % result.size == 0
+                    ):
+                        rows_per_segment = expected_weight_scale_shape[0] // result.size
+                        tiled = np.repeat(result, rows_per_segment)
+                        tiled = tiled.reshape(expected_weight_scale_shape[0], 1)
+                        return np.broadcast_to(tiled, expected_weight_scale_shape).astype(dtype)


In _weight_scale_transform, the logic for handling various result.shape scenarios is quite complex, especially the if result.ndim == 1 and result.size > 1 block. This might indicate that the _transform function or the input arrays could sometimes produce unexpected shapes. It would be beneficial to add comments explaining the rationale behind each shape-handling branch, or to simplify the _transform function if possible to ensure more predictable outputs.

gemini-code-assist · 2026-01-21T04:20:25Z

python/mlc_llm/model/ministral3/ministral3_loader.py

+                def _activation_scale_transform(*arrays, dtype: str, _transform=transform):
+                    result = _transform(*arrays, dtype=dtype)
+                    result = np.asarray(result, dtype=dtype)
+                    if result.shape == expected_shape:
+                        return result
+                    if result.shape == ():
+                        # HF checkpoint stores a single scale; broadcast across the expected dimension.
+                        return np.full(expected_shape, result.item(), dtype=dtype)
+                    if result.shape == (1,) and expected_shape != (1,):
+                        return np.broadcast_to(result, expected_shape).astype(dtype)
+                    if (
+                        result.ndim == 1
+                        and result.size > 1
+                        and len(expected_shape) >= 1
+                        and expected_shape[0] % result.size == 0
+                    ):
+                        rows_per_segment = expected_shape[0] // result.size
+                        tiled = np.repeat(result, rows_per_segment)
+                        return tiled.reshape(expected_shape).astype(dtype)


Similar to _weight_scale_transform, the _activation_scale_transform function also has complex logic for handling different result.shape scenarios. Specifically, the broadcasting logic for result.shape == () and result.shape == (1,) could be made more explicit with comments, or potentially refactored if there's a common utility for such broadcasting operations.

gemini-code-assist · 2026-01-21T04:20:25Z

python/mlc_llm/model/ministral3/ministral3_model.py

+                else:
+                    self.weight_block_size = [128, 128]
+                    logger.info(
+                        "Setting default weight_block_size=%s since quantization_config does not provide "
+                        "FP8 block-scale details required by MLC (activation_scheme=%s, quant_method=%s)",
+                        self.weight_block_size,
+                        activation_scheme,
+                        quant_method,
+                    )


The else branch sets self.weight_block_size = [128, 128] and logs an info message. While this provides a default, it might be better to explicitly define a default weight_block_size in the Ministral3Config dataclass definition itself, or raise a more specific warning if the quantization_config is malformed rather than silently defaulting. This ensures that the configuration is always explicit.

gemini-code-assist · 2026-01-21T04:20:26Z

python/mlc_llm/model/ministral3/ministral3_model.py

+                    raise ValueError(
+                        "Unable to determine the maximum sequence length, because none of "
+                        "`context_window_size`, `max_position_embeddings` or "
+                        "`max_sequence_length` is provided in `config.json`."
+                    )


The ValueError here is quite generic. It would be more helpful to the user if the error message explicitly listed which of context_window_size, max_position_embeddings, or max_sequence_length were checked and found missing, and perhaps suggested where to define them.

gemini-code-assist · 2026-01-21T04:20:26Z

python/mlc_llm/model/ministral3/ministral3_model.py

+            logger.info(
+                "%s defaults to %d",
+                bold("prefill_chunk_size"),
+                min(*prefill_chunk_size_candidates, 8192),
+            )
+            self.prefill_chunk_size = min(*prefill_chunk_size_candidates, 8192)


The min(*prefill_chunk_size_candidates, 8192) logic for prefill_chunk_size is a bit implicit. It would be clearer to explicitly state the order of preference or the conditions under which each candidate is chosen. For example, if sliding_window_size is preferred over context_window_size when both are present, that should be clear.

gemini-code-assist · 2026-01-21T04:20:26Z

python/mlc_llm/model/model_preset.py

+            "rope_theta": 1000000.0,
+            "rope_type": "yarn",
+            "type": "yarn",
+        },


The sliding_window parameter is set to None. If this means the feature is disabled or not applicable, it might be more explicit to use -1 or 0 (if 0 implies disabled) for consistency with other int parameters that represent sizes, or add a comment explaining why None is used here.

gemini-code-assist · 2026-01-21T04:20:26Z

python/mlc_llm/quantization/block_scale_quantization.py

+        shape_supported_by_cutlass = (
+            self.weight.shape[0] % 128 == 0 and self.weight.shape[1] % 128 == 0
+        )


The shape_supported_by_cutlass check is only for self.weight.shape[0] % 128 == 0 and self.weight.shape[1] % 128 == 0. This check might be too restrictive or not fully capture all conditions for Cutlass support, especially given the Todo: check "shape supported by cutlass" for Hopper comment in BlockScaleQuantizeLinear. It's important to ensure this condition is accurate for all target hardware or to provide a more robust fallback.

Copilot

Pull request overview

This pull request adds support for the Ministral 3 model architecture to MLC LLM, including FP8 block-scale quantization with static activation support. The PR also adds several other model presets that appear to be part of a broader batch of model additions.

Changes:

Adds Ministral 3 model implementation with support for YARN RoPE and sliding window attention
Introduces FP8 block-scale quantization with static activation scales for improved performance
Adds multiple model presets including Ministral 3, Gemma variants, Qwen3 variants, SmolLM2 variants, and uncomments previously disabled presets

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
python/mlc_llm/quantization/quantization.py	Adds new fp8_e4m3fn_bf16_block_scale_static_activation quantization scheme
python/mlc_llm/quantization/block_scale_quantization.py	Implements BlockScaleQuantizeLinearStaticActivation class with static activation quantization support
python/mlc_llm/model/model_preset.py	Adds Ministral 3 preset and multiple other model presets (Gemma, Qwen3, SmolLM2 variants)
python/mlc_llm/model/model.py	Registers the ministral3 model with appropriate loaders and quantization methods
python/mlc_llm/model/ministral3/ministral3_quantization.py	Provides quantization functions for Ministral 3 (group, ft, awq, no_quant, block_scale)
python/mlc_llm/model/ministral3/ministral3_model.py	Implements the Ministral 3 model architecture including attention, MLP, and decoder layers
python/mlc_llm/model/ministral3/ministral3_loader.py	Handles parameter mapping from HuggingFace format including FP8 quantized weights
python/mlc_llm/interface/gen_config.py	Adds ministral3 to the list of supported model types
python/mlc_llm/conversation_template/ministral3.py	Defines the conversation template for Ministral 3 including system prompts
python/mlc_llm/conversation_template/init.py	Imports the ministral3 conversation template
python/mlc_llm/model/ministral3/init.py	Empty init file for the ministral3 module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T04:22:10Z

python/mlc_llm/model/ministral3/ministral3_loader.py

+from mlc_llm.quantization import Quantization
+
+from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration
+from .ministral3_quantization import BlockScaleQuantize
+
+


The import statement imports BlockScaleQuantize from the wrong module. It should be imported from mlc_llm.quantization instead of from .ministral3_quantization. The ministral3_quantization module doesn't export BlockScaleQuantize; it only imports it from mlc_llm.quantization to use in its functions.

Suggested change

from mlc_llm.quantization import Quantization

from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration

from .ministral3_quantization import BlockScaleQuantize

from mlc_llm.quantization import Quantization, BlockScaleQuantize

from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration

Copilot · 2026-01-21T04:22:11Z

python/mlc_llm/model/ministral3/ministral3_model.py

+class Ministral3Attention(nn.Module):  # pylint: disable=too-many-instance-attributes
+    """Same as LlamaAttention, but with sliding window attention using a rolling buffer cache."""
+
+    def __init__(self, config: Ministral3Config):


The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change

def __init__(self, config: Ministral3Config):

def __init__(self, config: Ministral3Config):

super().__init__()

Copilot · 2026-01-21T04:22:11Z

python/mlc_llm/model/ministral3/ministral3_model.py

+class Ministral3DecoderLayer(nn.Module):
+    """Exact same as LlamaDecoderLayer."""
+
+    def __init__(self, config: Ministral3Config):


The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change

def __init__(self, config: Ministral3Config):

def __init__(self, config: Ministral3Config):

super().__init__()

Copilot · 2026-01-21T04:22:11Z

python/mlc_llm/model/ministral3/ministral3_model.py

+class Ministral3Model(nn.Module):
+    """Exact same as LlamaModel."""
+
+    def __init__(self, config: Ministral3Config):


The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change

def __init__(self, config: Ministral3Config):

def __init__(self, config: Ministral3Config):

super().__init__()

Copilot · 2026-01-21T04:22:11Z

python/mlc_llm/model/ministral3/ministral3_model.py

+
+
+class Mistral3ForConditionalGeneration(nn.Module):  # pylint: disable=too-many-instance-attributes
+    def __init__(self, config: Ministral3Config):


The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.

Suggested change

def __init__(self, config: Ministral3Config):

def __init__(self, config: Ministral3Config):

super().__init__()

[Model] Support Ministral 3

5e1b69a

akaashrp requested a review from Copilot January 21, 2026 04:17

Copilot started reviewing on behalf of akaashrp January 21, 2026 04:17 View session

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

Copilot AI reviewed Jan 21, 2026

View reviewed changes

akaashrp mentioned this pull request Jan 22, 2026

add Ministral-3-3B-Instruct-2512 mlc-ai/web-llm#751

Open

		if isinstance(shard, tp.ShardSingleDim) and shard.segs is not None:
		shard.segs = [x // weight_block_size[shard.dim] for x in shard.segs]

	def __init__(self, config: Ministral3Config):
	def __init__(self, config: Ministral3Config):
	super().__init__()



		class Mistral3ForConditionalGeneration(nn.Module): # pylint: disable=too-many-instance-attributes
		def __init__(self, config: Ministral3Config):

[Model] Support Ministral 3 #3409

Are you sure you want to change the base?

[Model] Support Ministral 3 #3409

Conversation

akaashrp commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akaashrp commented Jan 21, 2026 •

edited

Loading