Two articles surfaced during the v8 deploy + eval on 2026-04-25: - Ollama: `ollama run` with piped stdin bypasses the chat template and SYSTEM prompt — output looks like raw base-model completion. Caught during initial v8 smoke test. Fix: use /api/chat HTTP endpoint. - rsync over Tailscale can hang in TCP teardown after the data has fully transferred. Verify with md5sum, then kill the hung pipeline. Includes a watcher-threshold gotcha (set below true file size, not above) and prevention tips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
88 lines
4 KiB
Markdown
88 lines
4 KiB
Markdown
---
|
|
title: "Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt"
|
|
domain: troubleshooting
|
|
category: ai-inference
|
|
tags: [ollama, eval, chat-template, system-prompt, majortwin, gotcha]
|
|
status: published
|
|
created: 2026-04-25
|
|
updated: 2026-04-25
|
|
---
|
|
|
|
# Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt
|
|
|
|
When eval'ing or smoke-testing an Ollama model, piping a prompt via stdin to `ollama run` skips the model's chat template **and** the SYSTEM prompt baked into the Modelfile. Output looks like raw base-model completion (often Mastodon-shaped or training-data-shaped), and you'll think the model is broken when it isn't.
|
|
|
|
## The Short Answer
|
|
|
|
For evals and any test where you want the model's actual chat behavior, **use the HTTP API at `/api/chat`** — never `ollama run` with `echo "..." | ollama run model`.
|
|
|
|
```python
|
|
import json, urllib.request
|
|
body = json.dumps({
|
|
"model": "majortwin-v8",
|
|
"messages": [{"role": "user", "content": "What's your name?"}],
|
|
"stream": False,
|
|
}).encode()
|
|
req = urllib.request.Request(
|
|
"http://localhost:11434/api/chat",
|
|
data=body, headers={"Content-Type": "application/json"}, method="POST",
|
|
)
|
|
r = json.loads(urllib.request.urlopen(req).read())
|
|
print(r["message"]["content"])
|
|
```
|
|
|
|
Or with curl piped through jq:
|
|
|
|
```bash
|
|
curl -s http://localhost:11434/api/chat -d '{
|
|
"model": "majortwin-v8",
|
|
"messages": [{"role": "user", "content": "What is your name?"}],
|
|
"stream": false
|
|
}' | jq -r .message.content
|
|
```
|
|
|
|
## How to Notice
|
|
|
|
Symptom: model responses are weirdly raw — Mastodon-style hashtag rants, news headlines, multiple unrelated thoughts strung together — even though the same model behaves normally in Open WebUI or via the chat API. This is the canonical fingerprint of a chat-template-bypassed call.
|
|
|
|
## Why This Happens
|
|
|
|
`ollama run` is the CLI's interactive REPL. When stdin is a TTY, it reads input as user turns and applies the chat template. When stdin is a **pipe** (`echo "..." | ollama run model`), the CLI treats stdin as raw text and forwards it to `/api/generate` (the completion endpoint), not `/api/chat`. `/api/generate` does **not** apply the chat template, and the SYSTEM prompt only takes effect when the chat template wraps it.
|
|
|
|
The two endpoints serve different purposes:
|
|
- `/api/generate` — raw completion, good for fill-in-the-blank or non-instruct base models
|
|
- `/api/chat` — applies the model's chat template, includes SYSTEM, handles multi-turn message arrays
|
|
|
|
For an instruct-tuned model (Qwen2.5-Instruct, Llama-3.1-Instruct, etc.), bypassing the chat template means the model never sees the `<|im_start|>system ... <|im_end|>` framing it was trained to expect, and its responses regress toward base-model behavior.
|
|
|
|
## When You Actually Want `/api/generate`
|
|
|
|
Almost never, for instruct models. The legitimate use case is base models without a chat template, or specific completion-style prompts where you want the model to continue a string verbatim. For evals of a fine-tuned Modelfile, always use `/api/chat`.
|
|
|
|
## Reusable Eval Pattern
|
|
|
|
A minimal stdlib-only eval harness used for MajorTwin evals lives at `~/MajorTwin/scripts/eval_v8.py`. The key call is the `chat()` helper:
|
|
|
|
```python
|
|
def chat(host, model, prompt, timeout=180):
|
|
body = json.dumps({
|
|
"model": model,
|
|
"messages": [{"role": "user", "content": prompt}],
|
|
"stream": False,
|
|
}).encode()
|
|
req = urllib.request.Request(
|
|
f"{host}/api/chat",
|
|
data=body,
|
|
headers={"Content-Type": "application/json"},
|
|
method="POST",
|
|
)
|
|
with urllib.request.urlopen(req, timeout=timeout) as r:
|
|
return json.loads(r.read())["message"]["content"].strip()
|
|
```
|
|
|
|
This applies the chat template and the SYSTEM prompt baked into the Modelfile. No need to re-specify SYSTEM per-call.
|
|
|
|
## Related
|
|
|
|
- [[ollama-macos-sleep-tailscale-disconnect]] — different Ollama gotcha (sleep + Tailscale)
|
|
- [[20-Projects/MajorTwin/majortwin-v8-eval-report|MajorTwin v8 eval report]] — caught this issue during initial smoke test on 2026-04-25
|