majorwiki/05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md
majorlinux 91455fac39 Add 7 articles; update nav and existing articles (2026-04-25)
New articles:
- pihole-doh-dot-bypass-defense
- pihole-v6-adlist-management
- mastodon-db-maintenance
- mastodon-federation
- fantastical-google-phantom-calendar-syncselect
- rsync-tailscale-teardown-stall
- ollama-chat-template-pipe-stdin-bypass

Updated: wsl2-backup, wsl2-rebuild, ssh-config-key-management,
selfhosting index, mastodon-instance-tuning, ansible-check-mode,
windows-openssh, windows-sshd, yt-dlp, README, SUMMARY, index
Removed: fedora-usrmerge-ebtables-blocker (superseded by prior push)
2026-04-25 17:52:48 +00:00

88 lines
4 KiB
Markdown

---
title: "Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt"
domain: troubleshooting
category: ai-inference
tags: [ollama, eval, chat-template, system-prompt, majortwin, gotcha]
status: published
created: 2026-04-25
updated: 2026-04-25
---
# Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt
When eval'ing or smoke-testing an Ollama model, piping a prompt via stdin to `ollama run` skips the model's chat template **and** the SYSTEM prompt baked into the Modelfile. Output looks like raw base-model completion (often Mastodon-shaped or training-data-shaped), and you'll think the model is broken when it isn't.
## The Short Answer
For evals and any test where you want the model's actual chat behavior, **use the HTTP API at `/api/chat`** — never `ollama run` with `echo "..." | ollama run model`.
```python
import json, urllib.request
body = json.dumps({
"model": "majortwin-v8",
"messages": [{"role": "user", "content": "What's your name?"}],
"stream": False,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/chat",
data=body, headers={"Content-Type": "application/json"}, method="POST",
)
r = json.loads(urllib.request.urlopen(req).read())
print(r["message"]["content"])
```
Or with curl piped through jq:
```bash
curl -s http://localhost:11434/api/chat -d '{
"model": "majortwin-v8",
"messages": [{"role": "user", "content": "What is your name?"}],
"stream": false
}' | jq -r .message.content
```
## How to Notice
Symptom: model responses are weirdly raw — Mastodon-style hashtag rants, news headlines, multiple unrelated thoughts strung together — even though the same model behaves normally in Open WebUI or via the chat API. This is the canonical fingerprint of a chat-template-bypassed call.
## Why This Happens
`ollama run` is the CLI's interactive REPL. When stdin is a TTY, it reads input as user turns and applies the chat template. When stdin is a **pipe** (`echo "..." | ollama run model`), the CLI treats stdin as raw text and forwards it to `/api/generate` (the completion endpoint), not `/api/chat`. `/api/generate` does **not** apply the chat template, and the SYSTEM prompt only takes effect when the chat template wraps it.
The two endpoints serve different purposes:
- `/api/generate` — raw completion, good for fill-in-the-blank or non-instruct base models
- `/api/chat` — applies the model's chat template, includes SYSTEM, handles multi-turn message arrays
For an instruct-tuned model (Qwen2.5-Instruct, Llama-3.1-Instruct, etc.), bypassing the chat template means the model never sees the `<|im_start|>system ... <|im_end|>` framing it was trained to expect, and its responses regress toward base-model behavior.
## When You Actually Want `/api/generate`
Almost never, for instruct models. The legitimate use case is base models without a chat template, or specific completion-style prompts where you want the model to continue a string verbatim. For evals of a fine-tuned Modelfile, always use `/api/chat`.
## Reusable Eval Pattern
A minimal stdlib-only eval harness used for MajorTwin evals lives at `~/MajorTwin/scripts/eval_v8.py`. The key call is the `chat()` helper:
```python
def chat(host, model, prompt, timeout=180):
body = json.dumps({
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
}).encode()
req = urllib.request.Request(
f"{host}/api/chat",
data=body,
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(req, timeout=timeout) as r:
return json.loads(r.read())["message"]["content"].strip()
```
This applies the chat template and the SYSTEM prompt baked into the Modelfile. No need to re-specify SYSTEM per-call.
## Related
- [[ollama-macos-sleep-tailscale-disconnect]] — different Ollama gotcha (sleep + Tailscale)
- [[20-Projects/MajorTwin/majortwin-v8-eval-report|MajorTwin v8 eval report]] — caught this issue during initial smoke test on 2026-04-25