Benchmark & route LLMs on real Android devices

Evalyard is a hosted dashboard + a real Android device lab — TTFT, tokens/sec, P50/P95, throttling, temperature, and battery metrics.

Self-service dashboard is coming soon.


Try the demo
TTFT
First-token latency per device & model
Tokens/sec
Sustained throughput under load
Thermals & Battery
Temperature, throttle, drain
Evalyard
Evalyard dashboard preview
Qualcomm
Samsung
Google
MediaTek
Arm
OpenAI

Data in. Phones optional.

Stream your own metrics into Evalyard, then add real devices when you need them.

Bring your metrics

Latency, quality, power — sent via API.

Logs
Evalyard

Unified bench view

API metrics and device runs in one UI.

ModelDeviceScenario

Real devices layer

Attach phones for TTFT & on-device perf.

TTFT tokens/sec

Explore & export

Slice, share, export CSVs & snapshots.

CSV PNG Link

Two ways to use Evalyard

Run on real Android devices – or plug in your own metrics via API.

Devices path

Run on real phones

Use our Android lab or your own phones for real-device TTFT & throughput.

1
Install
2
Models
3
Run
4
Metrics
API path

Send metrics via API

Stream latency / quality logs into Evalyard and reuse the same dashboards.

1
API
2
Dashboard

Get early access

We can provision specific phones, build adapters, and share a read-only dashboard for your team. No spam.

FAQ

Quick answers
Do the plans include devices?
By default it’s BYOD (bring your own Android phones). Device rental / dedicated racks are available for Fabric and Enterprise on request.
What are device-hours?
Time a phone is actively running your jobs. Hitting the limit? Pause runs or enable pay-as-you-go overage.
How do I access the dashboard now?
The self-service dashboard is not publicly available yet. Please book a private demo. We’ll walk you through the metrics and provide screenshots from the current version.
Can I cancel anytime?
Yes — monthly billing, cancel anytime. No long-term lock-in.

Need fully isolated infrastructure or shipped devices? Ask about Enterprise Fabric.

Vote on what we build next

Tell us what to build →
High-load stress testing for on-device LLMs Automated output grading / evals Image-based / multimodal models Plugins / SDK for game engines Per-device battery & thermal tracking
Have questions or an unusual setup? Talk to us. Talk to us