Fixing Qwen3-VL-2B ONNX Export Logits

Nov 1, 2025 by Admin 38 views

Fixing Incorrect Logits in Qwen3-VL-2B ONNX Export

Hey guys, this is a deep dive into an issue I faced while exporting the Qwen3-VL-2B model to ONNX format. The problem? Incorrect logits distribution, leading to the model generating gibberish instead of coherent text. Let's break down the problem, the steps I took to troubleshoot, and the potential causes. This is pretty technical, so buckle up!

The Problem: Garbage In, Garbage Out

My initial goal was to get the Qwen3-VL-2B model running in ONNX for faster inference. I used the llmexport tool, which is pretty handy for this. After exporting the model and running a simple inference with an image input and the prompt <im_start>assistant\n, I ran into a major roadblock. Instead of the model predicting the expected words like "The", "This", or "A" with reasonable probabilities, it was spitting out a stream of spaces, commas, and newlines. Specifically, the first token predictions looked something like this:

" " (space): ~13%
"," (comma): ~11%
"\n" (newline): ~9%
"a": ~3%

This is a classic sign of the model's output distribution being completely off. Instead of generating meaningful text, it was getting stuck on repetitive, non-sensical tokens. Clearly, something was going wrong in the ONNX export or inference process.

Debugging and Troubleshooting Steps

I needed to find out why the model was behaving this way. So, I went through a rigorous debugging process to isolate the issue. Here's a rundown of my troubleshooting steps:

Visual Encoder Check: I confirmed that the visual encoder, which processes the image input, was working correctly. It was producing valid embeddings, which are numerical representations of the image's content.
Position ID Verification: I made sure that the position IDs, used for the model's understanding of word order, were being constructed properly for the MRoPE (Multi-head Rotary Positional Embedding) mechanism. This is crucial for the model to understand the sequence of words.
Deepstack Integration: I checked the integration of Deepstack features, which provide additional visual information, into the transformer layers. I confirmed that these features were being applied to the correct layers (5, 11, and 17). This is an important part of the model's visual reasoning capabilities.
KV Cache Inspection: I verified that the KV cache (Key-Value cache), used to store previous token information for efficient generation, was growing correctly. This cache is essential for the model to maintain context during the generation process.
Attention Mask Check: I ensured that the attention masks, which control which parts of the input the model should pay attention to, were correct. This is critical for preventing the model from attending to irrelevant parts of the input.
NaN/Inf Value Check: I thoroughly examined the model's weights and activations for any NaN (Not a Number) or Inf (Infinity) values, which can disrupt the calculations. Fortunately, no such values were found, which made this task a bit easier.
ONNX Graph Inspection: I inspected the ONNX graph structure to make sure everything looked correct. Specifically, I was looking for any unexpected nodes or operations, especially the dreaded "FakeLinear" nodes, which can indicate problems during the conversion process.

The Culprit: Where Things Might Go Wrong

After ruling out many potential issues on the inference side, the problem seems to be happening before text generation. The first token probabilities after prefill are already incorrect. This points towards issues during the export process itself. I have some hunches about what might be causing this issue:

Weight Transpose Issues: During the conversion from the original model to the ONNX format, there might be problems with the weight transpose during the FakeLinear to MatMul conversion in the onnx_rebuilder.py script. Transposing the weights correctly is very important for making sure that matrix multiplications work as expected.
MRoPE Export: The MROPE (Multi-head Rotary Positional Embedding) is a special encoding method used by Qwen3-VL. This model is very sensitive to position, and it uses 3D rotary position encoding, and this part might not be exporting correctly. ONNX export tools might struggle to handle this type of embedding without any issue.
Deepstack Integration Errors: The integration of Deepstack features is another potential area where things could go wrong. Deepstack provides additional visual context to the model and it can be difficult for the export process to correctly translate the feature injection into the ONNX graph.

Code Modifications and Experiments

To try and fix the issue, I made some changes to the export code. Specifically, I focused on properly applying the Deepstack features to the correct layers. I modified the llmexport/llmexport.py file to include Deepstack features in layers 5, 11, and 17. The code I updated looks like this:

if deepstack_embeds is not None:
    if hasattr(self, 'visual') and self.visual is not None and hasattr(self.visual, 'deepstack_visual_indexes'):
        if i in self.visual.deepstack_visual_indexes:
            idx = self.visual.deepstack_visual_indexes.index(i)
            if idx < deepstack_embeds.shape[0]:
                hidden_states += deepstack_embeds[idx]

I also updated the shape of the Deepstack embeddings from [3, 1, hidden_size] to [3, seq_len, hidden_size] to match the sequence length dynamically:

qwen3_dynamic_axes = self.model_dynamic_axes.copy()
qwen3_dynamic_axes['deepstack_embeds'] = {1: 'seq_len'}

Unfortunately, despite these changes, the issue persisted. This suggests that the problem might lie in other areas of the export process. It is possible that the original model is not fully compatible with the ONNX export tool.

What's Next?

So, where do we go from here? Here are some possible next steps:

Deep Dive into ONNX Conversion: I should spend more time studying how the model is converted to ONNX. Check for weight transposition errors or incorrect handling of 3D rotary embeddings.
ONNX Runtime Debugging: Inspecting the ONNX graph during runtime with tools like Netron, to see what happens step-by-step during inference. This may identify more areas where the model starts to go wrong.
Community Collaboration: I'm posting this here in the hopes that others can learn from my mistakes or contribute fresh insights. If anyone has experience with ONNX exports or similar issues, your input is greatly appreciated!

Thanks for reading, guys! Hopefully, this helps someone else facing the same challenges. Let me know if you have any questions or suggestions!"