Skip to main content

Overview

This document describes the schema for a workload Solution. The Solution provides a concrete, high-performance implementation for a given Definition. Each Solution is a self-contained entry submitted by community members or autonomous agents, encapsulating the source code and all metadata required for building, interfacing, and benchmarking. The Solution is automatically benchmarked to collect evaluation statistics, which is included in a Trace.

JSON Schema Description

Top-Level Object Structure

FieldTypeRequiredDescription
namestringYesA unique, human-readable name for this specific solution (e.g.,rmsnorm_triton_v1_h100).
definitionstringYesThenameof the Definitionthis implementation solves.
descriptionstringNoA human-readable brief description of the solution’s technique or agent policy.
authorstringYesThe name of the author or agent system.
specobjectYesAn object containing detailed technical specifications for the implementation.
sourcesarrayYesAn array of file objects representing the complete source code, including any necessary files for building and execution.

sources : Source Code Files

The sources array contains any number of file objects, where each object represents a single source file in the project. The flashinfer-bench benchmarker will reconstruct the project’s directory structure to properly build the binaries/executables.
FieldTypeRequiredDescription
pathstringYesThe relative path of the file, including its name and extension (e.g., src/kernel.cu, main.py).
contentstringYesThe complete text content of the source file.

spec : Technical Specifications

This object details the technical requirements and properties of the source code.
FieldTypeRequiredDescription
languagestringYesThe primary programming language (e.g.,Triton,CUDA). The source code should always launched from Python.
target_hardwarearrayYesA list of hardware architectures this is compatible with (e.g.,NVIDIA_H100, NVIDIA_B200).
dependenciesarrayNoA list of required libraries or toolchains.
entry_pointstringYesThe exact path to the function to be called. Format should be {file_path}::{function_name} (e.g. main.py::run).

Dependencies Handling

The dependencies field is an array of strings declaring third-party packages needed to build/run the solution. In particular, we’re handling the third-party CUDA libs and Python packages:
  • CUDA/C++: Use version-pinned tokens. Example: CUTLASS_3_7 → the builder injects CUTLASS 3.7 headers paths during compilation. Supported dependencies: (TODO)
  • Python libs: You may list package specifiers, but we do not manage Python package installs currently. We only validate against the current environment. If a listed lib/version isn’t satisfied, the build fails fast with a compile error. Example: torch, triton >= 2.3 → the builder validates the current environment for these packages and versions.
This field is optional. In all other cases, we proceed the solution building best-effort.

Signature and Naming Convention

The schema enforces a strict naming convention to eliminate ambiguity. The function specified in spec.entry_point must accept arguments whose names exactly match the keys in the inputs and outputs objects of the corresponding Definition.
  • For JIT-compiled languages like Triton, the source code should provide a Python launcher function as the entry_point. This launcher function must have a signature that strictly matches the workload’s defined names and will be called using keyword arguments.
  • For AOT-compiled languages like CUDA, the source should ideally include a C++ host-side launcher function and be provided through entry_point with the required named-argument signature.
  • CUDA solutions must provide a C/CUDA symbol as entry_point. If user prefer providing their own Python wrapper, set language to python and ensure compilation and binding are properly handled.

Example: Triton Implementation for GEMM

{
  "name": "gemm_triton_h100_v1",
  "definition": "gemm",
  "description": "A high-performance GEMM implementation (C = A @ B.T) using Triton. Generated by one-shot inquiry with Gemini-2.5-Pro.",
  "author": "gemini-2.5-pro-mystery-agent",
  "spec": {
    "language": "triton",
    "target_hardware": [
      "NVIDIA_H100"
    ],
    "dependencies": [
      "triton >= 2.3",
      "torch"
    ],
    "entry_point": "main.py::run"
  },
  "sources": [
    {
      "path": "main.py",
      "content": "import torch\nimport triton\nimport triton.language as tl\n\n@triton.autotune(\n    configs=[\n        triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8),\n        triton.Config({'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8)\n    ],\n    key=['M', 'N', 'K'],\n)\n@triton.jit\ndef _gemm_kernel(\n    A, B, C, M, N, K, stride_am, stride_ak, stride_bn, stride_bk, stride_cm, stride_cn, BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr, GROUP_SIZE_M: tl.constexpr\n):\n    # ... (Triton kernel logic as before)\n    pid = tl.program_id(axis=0)\n    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)\n    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)\n    num_pid_in_group = GROUP_SIZE_M * num_pid_n\n    group_id = pid // num_pid_in_group\n    first_pid_m = group_id * GROUP_SIZE_M\n    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)\n    pid_m = first_pid_m + (pid % group_size_m)\n    pid_n = (pid % num_pid_in_group) // group_size_m\n\n    offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M))[:, None]\n    offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N))[None, :]\n    offs_k = tl.arange(0, BLOCK_SIZE_K)\n    a_ptrs = A + (offs_am * stride_am + offs_k[None, :] * stride_ak)\n    b_ptrs = B + (offs_bn * stride_bn + offs_k[:, None] * stride_bk)\n\n    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)\n    for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):\n        a = tl.load(a_ptrs)\n        b = tl.load(b_ptrs)\n        accumulator += tl.dot(a, b)\n        a_ptrs += BLOCK_SIZE_K * stride_ak\n        b_ptrs += BLOCK_SIZE_K * stride_bk\n    c = accumulator.to(C.dtype.element_ty)\n\n    offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)\n    offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)\n    c_ptrs = C + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]\n    c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)\n    tl.store(c_ptrs, c, mask=c_mask)\n\ndef run(A, B):\n    M, K = A.shape\n    N, _ = B.shape\n    C = torch.empty((M, N), device=A.device, dtype=A.dtype)\n    grid = lambda META: (triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']), )\n    _gemm_kernel[grid](A, B, C, M, N, K, A.stride(0), A.stride(1), B.stride(0), B.stride(1), C.stride(0), C.stride(1))\n    return C"
    }
  ]
}

Example: Triton Implementation for RMS Norm

{
  "name": "rmsnorm_triton_v1",
  "definition": "rmsnorm",
  "description": "A high-performance RMSNorm implementation using Triton. Generated by one-shot inquiry with Gemini-2.5-Pro.",
  "author": "gemini-2.5-pro-mystery-agent",
  "spec": {
    "language": "triton",
    "target_hardware": [
      "NVIDIA_H100",
      "NVIDIA_A100",
      "NVIDIA_B200"
    ],
    "dependencies": [
      "triton >= 2.3",
      "torch"
    ],
    "entry_point": "main.py::run"
  },
  "sources": [
    {
      "path": "main.py",
      "content": "import torch\nimport triton\nimport triton.language as tl\n\n@triton.jit\ndef _rmsnorm_kernel(x_ptr, weight_ptr, output_ptr,\n                    x_row_stride, output_row_stride,\n                    n_cols, eps, \n                    BLOCK_SIZE: tl.constexpr):\n    # Get the row index for this program instance\n    row_idx = tl.program_id(0)\n\n    # Create pointers to the beginning of the current row\n    row_x_ptr = x_ptr + row_idx * x_row_stride\n    row_output_ptr = output_ptr + row_idx * output_row_stride\n\n    # --- Pass 1: Calculate mean of squares ---\n    var_acc = tl.zeros([BLOCK_SIZE], dtype=tl.float32)\n    for off in range(0, n_cols, BLOCK_SIZE):\n        cols = off + tl.arange(0, BLOCK_SIZE)\n        mask = cols < n_cols\n        # Load input data, converting to float32 for accumulation\n        x = tl.load(row_x_ptr + cols, mask=mask, other=0.0).to(tl.float32)\n        var_acc += x * x\n    \n    # Reduce the block-level accumulators to a single scalar value for the row variance\n    row_var = tl.sum(var_acc, axis=0) / n_cols\n    rstd = tl.rsqrt(row_var + eps)\n\n    # --- Pass 2: Normalize and apply weight ---\n    for off in range(0, n_cols, BLOCK_SIZE):\n        cols = off + tl.arange(0, BLOCK_SIZE)\n        mask = cols < n_cols\n\n        # Load input and weight\n        x = tl.load(row_x_ptr + cols, mask=mask, other=0.0)\n        w = tl.load(weight_ptr + cols, mask=mask, other=0.0)\n\n        # Normalize, apply weight, and store\n        x_normalized = x * rstd\n        output = x_normalized * w\n        tl.store(row_output_ptr + cols, output, mask=mask)\n\ndef run(input: torch.Tensor, weight: torch.Tensor, eps: float):\n    \"\"\"\n    Launcher function for the RMSNorm Triton kernel.\n\n    Args:\n        input (torch.Tensor): The input tensor of shape (batch_size, hidden_size).\n        weight (torch.Tensor): The weight tensor of shape (hidden_size).\n        eps (float): A small value to prevent division by zero.\n\n    Returns:\n        dict: A dictionary containing the output tensor under the key 'output'.\n    \"\"\"\n    # Ensure input tensor is contiguous in the last dimension\n    input = input.contiguous()\n    n_rows, n_cols = input.shape\n\n    # Create the output tensor\n    output = torch.empty_like(input)\n\n    # Pick a block size. 1024 is a good default for typical hidden sizes.\n    BLOCK_SIZE = 1024\n\n    # Define the grid for launching the kernel\n    # One program instance per row\n    grid = (n_rows,)\n\n    # Launch the kernel\n    _rmsnorm_kernel[grid](input, weight, output,\n                         input.stride(0), output.stride(0),\n                         n_cols, eps, \n                         BLOCK_SIZE=BLOCK_SIZE)\n    \n    return output"
    }
  ]
}
I