Two articles surfaced during the v8 deploy + eval on 2026-04-25: - Ollama: `ollama run` with piped stdin bypasses the chat template and SYSTEM prompt — output looks like raw base-model completion. Caught during initial v8 smoke test. Fix: use /api/chat HTTP endpoint. - rsync over Tailscale can hang in TCP teardown after the data has fully transferred. Verify with md5sum, then kill the hung pipeline. Includes a watcher-threshold gotcha (set below true file size, not above) and prevention tips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 KiB
| title | domain | category | tags | status | created | updated | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt | troubleshooting | ai-inference |
|
published | 2026-04-25 | 2026-04-25 |
Ollama: ollama run with Piped Stdin Bypasses Chat Template + SYSTEM Prompt
When eval'ing or smoke-testing an Ollama model, piping a prompt via stdin to ollama run skips the model's chat template and the SYSTEM prompt baked into the Modelfile. Output looks like raw base-model completion (often Mastodon-shaped or training-data-shaped), and you'll think the model is broken when it isn't.
The Short Answer
For evals and any test where you want the model's actual chat behavior, use the HTTP API at /api/chat — never ollama run with echo "..." | ollama run model.
import json, urllib.request
body = json.dumps({
"model": "majortwin-v8",
"messages": [{"role": "user", "content": "What's your name?"}],
"stream": False,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/chat",
data=body, headers={"Content-Type": "application/json"}, method="POST",
)
r = json.loads(urllib.request.urlopen(req).read())
print(r["message"]["content"])
Or with curl piped through jq:
curl -s http://localhost:11434/api/chat -d '{
"model": "majortwin-v8",
"messages": [{"role": "user", "content": "What is your name?"}],
"stream": false
}' | jq -r .message.content
How to Notice
Symptom: model responses are weirdly raw — Mastodon-style hashtag rants, news headlines, multiple unrelated thoughts strung together — even though the same model behaves normally in Open WebUI or via the chat API. This is the canonical fingerprint of a chat-template-bypassed call.
Why This Happens
ollama run is the CLI's interactive REPL. When stdin is a TTY, it reads input as user turns and applies the chat template. When stdin is a pipe (echo "..." | ollama run model), the CLI treats stdin as raw text and forwards it to /api/generate (the completion endpoint), not /api/chat. /api/generate does not apply the chat template, and the SYSTEM prompt only takes effect when the chat template wraps it.
The two endpoints serve different purposes:
/api/generate— raw completion, good for fill-in-the-blank or non-instruct base models/api/chat— applies the model's chat template, includes SYSTEM, handles multi-turn message arrays
For an instruct-tuned model (Qwen2.5-Instruct, Llama-3.1-Instruct, etc.), bypassing the chat template means the model never sees the <|im_start|>system ... <|im_end|> framing it was trained to expect, and its responses regress toward base-model behavior.
When You Actually Want /api/generate
Almost never, for instruct models. The legitimate use case is base models without a chat template, or specific completion-style prompts where you want the model to continue a string verbatim. For evals of a fine-tuned Modelfile, always use /api/chat.
Reusable Eval Pattern
A minimal stdlib-only eval harness used for MajorTwin evals lives at ~/MajorTwin/scripts/eval_v8.py. The key call is the chat() helper:
def chat(host, model, prompt, timeout=180):
body = json.dumps({
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
}).encode()
req = urllib.request.Request(
f"{host}/api/chat",
data=body,
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(req, timeout=timeout) as r:
return json.loads(r.read())["message"]["content"].strip()
This applies the chat template and the SYSTEM prompt baked into the Modelfile. No need to re-specify SYSTEM per-call.
Related
- ollama-macos-sleep-tailscale-disconnect — different Ollama gotcha (sleep + Tailscale)
- 20-Projects/MajorTwin/majortwin-v8-eval-report — caught this issue during initial smoke test on 2026-04-25