Skip to main content
Grouped Query Attention (GQA) with ragged (variable-length) tensor layout. This variant efficiently handles batches of sequences with different lengths by using ragged tensors, eliminating the need for padding and improving memory efficiency for variable-length inputs.

prefill

Axes (6 dimensions):
  • total_q, total_kv, len_indptr: variable
  • num_qo_heads, num_kv_heads, head_dim: constant
Inputs (5 tensors + 1 scalar):
  • q: query tensor [total_q, num_qo_heads, head_dim]
  • k, v: key-value tensors [total_kv, num_kv_heads, head_dim]
  • qo_indptr, kv_indptr: sequence offsets
  • sm_scale: softmax scale (scalar)
Outputs (2 tensors):
  • output: attention output [total_q, num_qo_heads, head_dim]
  • lse: log-sum-exp values [total_q, num_qo_heads]
Constraints:
  • total_q == qo_indptr[-1]
  • total_kv == kv_indptr[-1]
I