majorwiki/05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md
Marcus Summers 0996861512 wiki: add troubleshooting articles from MajorTwin v8 cycle
Two articles surfaced during the v8 deploy + eval on 2026-04-25:

- Ollama: `ollama run` with piped stdin bypasses the chat template and
  SYSTEM prompt — output looks like raw base-model completion. Caught
  during initial v8 smoke test. Fix: use /api/chat HTTP endpoint.

- rsync over Tailscale can hang in TCP teardown after the data has
  fully transferred. Verify with md5sum, then kill the hung pipeline.
  Includes a watcher-threshold gotcha (set below true file size, not
  above) and prevention tips.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 12:57:39 -04:00

88 lines
4 KiB
Markdown

---
title: "Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt"
domain: troubleshooting
category: ai-inference
tags: [ollama, eval, chat-template, system-prompt, majortwin, gotcha]
status: published
created: 2026-04-25
updated: 2026-04-25
---
# Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt
When eval'ing or smoke-testing an Ollama model, piping a prompt via stdin to `ollama run` skips the model's chat template **and** the SYSTEM prompt baked into the Modelfile. Output looks like raw base-model completion (often Mastodon-shaped or training-data-shaped), and you'll think the model is broken when it isn't.
## The Short Answer
For evals and any test where you want the model's actual chat behavior, **use the HTTP API at `/api/chat`** — never `ollama run` with `echo "..." | ollama run model`.
```python
import json, urllib.request
body = json.dumps({
"model": "majortwin-v8",
"messages": [{"role": "user", "content": "What's your name?"}],
"stream": False,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/chat",
data=body, headers={"Content-Type": "application/json"}, method="POST",
)
r = json.loads(urllib.request.urlopen(req).read())
print(r["message"]["content"])
```
Or with curl piped through jq:
```bash
curl -s http://localhost:11434/api/chat -d '{
"model": "majortwin-v8",
"messages": [{"role": "user", "content": "What is your name?"}],
"stream": false
}' | jq -r .message.content
```
## How to Notice
Symptom: model responses are weirdly raw — Mastodon-style hashtag rants, news headlines, multiple unrelated thoughts strung together — even though the same model behaves normally in Open WebUI or via the chat API. This is the canonical fingerprint of a chat-template-bypassed call.
## Why This Happens
`ollama run` is the CLI's interactive REPL. When stdin is a TTY, it reads input as user turns and applies the chat template. When stdin is a **pipe** (`echo "..." | ollama run model`), the CLI treats stdin as raw text and forwards it to `/api/generate` (the completion endpoint), not `/api/chat`. `/api/generate` does **not** apply the chat template, and the SYSTEM prompt only takes effect when the chat template wraps it.
The two endpoints serve different purposes:
- `/api/generate` — raw completion, good for fill-in-the-blank or non-instruct base models
- `/api/chat` — applies the model's chat template, includes SYSTEM, handles multi-turn message arrays
For an instruct-tuned model (Qwen2.5-Instruct, Llama-3.1-Instruct, etc.), bypassing the chat template means the model never sees the `<|im_start|>system ... <|im_end|>` framing it was trained to expect, and its responses regress toward base-model behavior.
## When You Actually Want `/api/generate`
Almost never, for instruct models. The legitimate use case is base models without a chat template, or specific completion-style prompts where you want the model to continue a string verbatim. For evals of a fine-tuned Modelfile, always use `/api/chat`.
## Reusable Eval Pattern
A minimal stdlib-only eval harness used for MajorTwin evals lives at `~/MajorTwin/scripts/eval_v8.py`. The key call is the `chat()` helper:
```python
def chat(host, model, prompt, timeout=180):
body = json.dumps({
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
}).encode()
req = urllib.request.Request(
f"{host}/api/chat",
data=body,
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(req, timeout=timeout) as r:
return json.loads(r.read())["message"]["content"].strip()
```
This applies the chat template and the SYSTEM prompt baked into the Modelfile. No need to re-specify SYSTEM per-call.
## Related
- [[ollama-macos-sleep-tailscale-disconnect]] — different Ollama gotcha (sleep + Tailscale)
- [[20-Projects/MajorTwin/majortwin-v8-eval-report|MajorTwin v8 eval report]] — caught this issue during initial smoke test on 2026-04-25