Qwen/Qwen3.5-4B

Primitive: /generate · Generate · Qwen3 MoE

MultimodalLong contextTool callingConstrained outputStreaming

Overview

Hardware: — drives latency, throughput & cost

Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Capabilities	Tool calling · Constrained output (JSON Schema, Regex) · Streaming
Context length	8,192
Max output tokens	4,096

legal generation en

Quality

accuracy 0.5867

Performance RTX-PRO-6000 b1 c4

Throughput 234 tok/s

p50 latency 788.1ms

scientific generation en

Quality

accuracy 0.4495

Performance RTX-PRO-6000 b1 c4

Throughput 343 tok/s

p50 latency 863.8ms

medical generation en

Quality

accuracy 0.6700

Performance RTX-PRO-6000 b1 c4

Throughput 364 tok/s

p50 latency 735.2ms

general generation en

Quality

accuracy 0.5767

Performance RTX-PRO-6000 b1 c4

Throughput 390 tok/s

p50 latency 587.2ms