Mixture of Experts (MoE) divides computation among multiple expert subnetworks. The MoE layer in DeepSeek is a transformer feed-forward block replaced by a sparse mixture of many MLP experts, where only a few are chosen for each token by a gating network. Axes (9 dimensions):Documentation Index
Fetch the complete documentation index at: https://bench.flashinfer.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
seq_len: variablenum_experts,num_local_experts,hidden_size,intermediate_size,gemm1_out_size,num_hidden_blocks,num_intermediate_blocks,num_gemm1_out_blocks: constant
routing_logits: Tensor of routing logits for expert selection [seq_len, num_experts]routing_bias: Bias tensor for routing. Pass all zeros for no bias [num_experts]hidden_states: Input hidden states tensor (FP8 quantized) [seq_len, hidden_size]hidden_states_scale: Block-wise scaling factors for hidden states [num_hidden_blocks, seq_len]gemm1_weights: First GEMM weights for all local experts (gate and up projections) [num_local_experts, gemm1_out_size, hidden_size]gemm1_weights_scale: Block-wise scaling factors for first GEMM weights [num_local_experts, num_gemm1_out_blocks, num_hidden_blocks]gemm2_weights: Second GEMM weights for all local experts (down projection) [num_local_experts, hidden_size, intermediate_size]gemm2_weights_scale: Block-wise scaling factors for second GEMM weights [num_local_experts, num_hidden_blocks, num_intermediate_blocks]local_expert_offset: Offset of local experts in global expert space (scalar)routed_scaling_factor: Scaling factor for routing weights (scalar)
output: Final MoE output tensor [seq_len, hidden_size]

