Why did we open-source our inference engine? Read the post

← Catalog

Qwen/Qwen3.5-4B

Open comparison →

Primitive: /generate · Generate · Qwen3 MoE

MultimodalLong contextTool callingConstrained outputStreaming

Overview

Hardware: — drives latency, throughput & cost

Size4.0B params
Tasks /generate
Licenseapache-2.0
Latency762 ms
Throughput353 tok/s
Cost$2.38 /1M tok

Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Generation

CapabilitiesTool calling · Constrained output (JSON Schema, Regex) · Streaming
Context length8,192
Max output tokens4,096

Benchmarks

CaseHOLD

legal generation en

Quality
accuracy 0.5867
Performance RTX-PRO-6000 b1 c4
Throughput 234 tok/s
p50 latency 788.1ms

GPQA Diamond

scientific generation en

Quality
accuracy 0.4495
Performance RTX-PRO-6000 b1 c4
Throughput 343 tok/s
p50 latency 863.8ms

MedQA

medical generation en

Quality
accuracy 0.6700
Performance RTX-PRO-6000 b1 c4
Throughput 364 tok/s
p50 latency 735.2ms

MMLU-Pro

general generation en

Quality
accuracy 0.5767
Performance RTX-PRO-6000 b1 c4
Throughput 390 tok/s
p50 latency 587.2ms

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.