-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Model] Support Ministral 3 #3409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @akaashrp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for the Ministral 3 model within the MLC LLM framework. This includes adding the model architecture, a dedicated conversation template, parameter mapping from HuggingFace, and various quantization methods to optimize performance and reduce memory footprint. The changes enable users to seamlessly utilize Ministral 3 models within the MLC LLM ecosystem. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Ministral 3 model, including its conversation template, model configuration, parameter loader, and quantization methods. It also updates various model presets and quantization configurations to accommodate the new model and enhance existing FP8 block-scale quantization with static activation scaling. The changes are well-structured and integrate seamlessly with the existing framework.
| if isinstance(shard, tp.ShardSingleDim) and shard.segs is not None: | ||
| shard.segs = [x // weight_block_size[shard.dim] for x in shard.segs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shard.segs modification shard.segs = [x // weight_block_size[shard.dim] for x in shard.segs] is applied directly to shard. If shard is a shared object, this modification could have unintended side effects on other parts of the code that might be using the same shard instance. It might be safer to create a copy of shard before modifying its segs attribute.
| if ( | ||
| result.ndim == 1 | ||
| and result.size > 1 | ||
| and len(expected_weight_scale_shape) >= 2 | ||
| and expected_weight_scale_shape[0] % result.size == 0 | ||
| ): | ||
| rows_per_segment = expected_weight_scale_shape[0] // result.size | ||
| tiled = np.repeat(result, rows_per_segment) | ||
| tiled = tiled.reshape(expected_weight_scale_shape[0], 1) | ||
| return np.broadcast_to(tiled, expected_weight_scale_shape).astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _weight_scale_transform, the logic for handling various result.shape scenarios is quite complex, especially the if result.ndim == 1 and result.size > 1 block. This might indicate that the _transform function or the input arrays could sometimes produce unexpected shapes. It would be beneficial to add comments explaining the rationale behind each shape-handling branch, or to simplify the _transform function if possible to ensure more predictable outputs.
| def _activation_scale_transform(*arrays, dtype: str, _transform=transform): | ||
| result = _transform(*arrays, dtype=dtype) | ||
| result = np.asarray(result, dtype=dtype) | ||
| if result.shape == expected_shape: | ||
| return result | ||
| if result.shape == (): | ||
| # HF checkpoint stores a single scale; broadcast across the expected dimension. | ||
| return np.full(expected_shape, result.item(), dtype=dtype) | ||
| if result.shape == (1,) and expected_shape != (1,): | ||
| return np.broadcast_to(result, expected_shape).astype(dtype) | ||
| if ( | ||
| result.ndim == 1 | ||
| and result.size > 1 | ||
| and len(expected_shape) >= 1 | ||
| and expected_shape[0] % result.size == 0 | ||
| ): | ||
| rows_per_segment = expected_shape[0] // result.size | ||
| tiled = np.repeat(result, rows_per_segment) | ||
| return tiled.reshape(expected_shape).astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to _weight_scale_transform, the _activation_scale_transform function also has complex logic for handling different result.shape scenarios. Specifically, the broadcasting logic for result.shape == () and result.shape == (1,) could be made more explicit with comments, or potentially refactored if there's a common utility for such broadcasting operations.
| else: | ||
| self.weight_block_size = [128, 128] | ||
| logger.info( | ||
| "Setting default weight_block_size=%s since quantization_config does not provide " | ||
| "FP8 block-scale details required by MLC (activation_scheme=%s, quant_method=%s)", | ||
| self.weight_block_size, | ||
| activation_scheme, | ||
| quant_method, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else branch sets self.weight_block_size = [128, 128] and logs an info message. While this provides a default, it might be better to explicitly define a default weight_block_size in the Ministral3Config dataclass definition itself, or raise a more specific warning if the quantization_config is malformed rather than silently defaulting. This ensures that the configuration is always explicit.
| raise ValueError( | ||
| "Unable to determine the maximum sequence length, because none of " | ||
| "`context_window_size`, `max_position_embeddings` or " | ||
| "`max_sequence_length` is provided in `config.json`." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger.info( | ||
| "%s defaults to %d", | ||
| bold("prefill_chunk_size"), | ||
| min(*prefill_chunk_size_candidates, 8192), | ||
| ) | ||
| self.prefill_chunk_size = min(*prefill_chunk_size_candidates, 8192) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min(*prefill_chunk_size_candidates, 8192) logic for prefill_chunk_size is a bit implicit. It would be clearer to explicitly state the order of preference or the conditions under which each candidate is chosen. For example, if sliding_window_size is preferred over context_window_size when both are present, that should be clear.
| "rope_theta": 1000000.0, | ||
| "rope_type": "yarn", | ||
| "type": "yarn", | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| shape_supported_by_cutlass = ( | ||
| self.weight.shape[0] % 128 == 0 and self.weight.shape[1] % 128 == 0 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shape_supported_by_cutlass check is only for self.weight.shape[0] % 128 == 0 and self.weight.shape[1] % 128 == 0. This check might be too restrictive or not fully capture all conditions for Cutlass support, especially given the Todo: check "shape supported by cutlass" for Hopper comment in BlockScaleQuantizeLinear. It's important to ensure this condition is accurate for all target hardware or to provide a more robust fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for the Ministral 3 model architecture to MLC LLM, including FP8 block-scale quantization with static activation support. The PR also adds several other model presets that appear to be part of a broader batch of model additions.
Changes:
- Adds Ministral 3 model implementation with support for YARN RoPE and sliding window attention
- Introduces FP8 block-scale quantization with static activation scales for improved performance
- Adds multiple model presets including Ministral 3, Gemma variants, Qwen3 variants, SmolLM2 variants, and uncomments previously disabled presets
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| python/mlc_llm/quantization/quantization.py | Adds new fp8_e4m3fn_bf16_block_scale_static_activation quantization scheme |
| python/mlc_llm/quantization/block_scale_quantization.py | Implements BlockScaleQuantizeLinearStaticActivation class with static activation quantization support |
| python/mlc_llm/model/model_preset.py | Adds Ministral 3 preset and multiple other model presets (Gemma, Qwen3, SmolLM2 variants) |
| python/mlc_llm/model/model.py | Registers the ministral3 model with appropriate loaders and quantization methods |
| python/mlc_llm/model/ministral3/ministral3_quantization.py | Provides quantization functions for Ministral 3 (group, ft, awq, no_quant, block_scale) |
| python/mlc_llm/model/ministral3/ministral3_model.py | Implements the Ministral 3 model architecture including attention, MLP, and decoder layers |
| python/mlc_llm/model/ministral3/ministral3_loader.py | Handles parameter mapping from HuggingFace format including FP8 quantized weights |
| python/mlc_llm/interface/gen_config.py | Adds ministral3 to the list of supported model types |
| python/mlc_llm/conversation_template/ministral3.py | Defines the conversation template for Ministral 3 including system prompts |
| python/mlc_llm/conversation_template/init.py | Imports the ministral3 conversation template |
| python/mlc_llm/model/ministral3/init.py | Empty init file for the ministral3 module |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from mlc_llm.quantization import Quantization | ||
|
|
||
| from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration | ||
| from .ministral3_quantization import BlockScaleQuantize | ||
|
|
||
|
|
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement imports BlockScaleQuantize from the wrong module. It should be imported from mlc_llm.quantization instead of from .ministral3_quantization. The ministral3_quantization module doesn't export BlockScaleQuantize; it only imports it from mlc_llm.quantization to use in its functions.
| from mlc_llm.quantization import Quantization | |
| from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration | |
| from .ministral3_quantization import BlockScaleQuantize | |
| from mlc_llm.quantization import Quantization, BlockScaleQuantize | |
| from .ministral3_model import Ministral3Config, Mistral3ForConditionalGeneration |
| class Ministral3Attention(nn.Module): # pylint: disable=too-many-instance-attributes | ||
| """Same as LlamaAttention, but with sliding window attention using a rolling buffer cache.""" | ||
|
|
||
| def __init__(self, config: Ministral3Config): |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.
| def __init__(self, config: Ministral3Config): | |
| def __init__(self, config: Ministral3Config): | |
| super().__init__() |
| class Ministral3DecoderLayer(nn.Module): | ||
| """Exact same as LlamaDecoderLayer.""" | ||
|
|
||
| def __init__(self, config: Ministral3Config): |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.
| def __init__(self, config: Ministral3Config): | |
| def __init__(self, config: Ministral3Config): | |
| super().__init__() |
| class Ministral3Model(nn.Module): | ||
| """Exact same as LlamaModel.""" | ||
|
|
||
| def __init__(self, config: Ministral3Config): |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.
| def __init__(self, config: Ministral3Config): | |
| def __init__(self, config: Ministral3Config): | |
| super().__init__() |
|
|
||
|
|
||
| class Mistral3ForConditionalGeneration(nn.Module): # pylint: disable=too-many-instance-attributes | ||
| def __init__(self, config: Ministral3Config): |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init method is missing a call to super().init(). This should be added as the first line after the method signature to properly initialize the parent nn.Module class.
| def __init__(self, config: Ministral3Config): | |
| def __init__(self, config: Ministral3Config): | |
| super().__init__() |
Uh oh!
There was an error while loading. Please reload this page.