majorwiki/05-troubleshooting/ollama-chat-template-pipe-stdin-bypass.md
majorlinux 91455fac39 Add 7 articles; update nav and existing articles (2026-04-25)
New articles:
- pihole-doh-dot-bypass-defense
- pihole-v6-adlist-management
- mastodon-db-maintenance
- mastodon-federation
- fantastical-google-phantom-calendar-syncselect
- rsync-tailscale-teardown-stall
- ollama-chat-template-pipe-stdin-bypass

Updated: wsl2-backup, wsl2-rebuild, ssh-config-key-management,
selfhosting index, mastodon-instance-tuning, ansible-check-mode,
windows-openssh, windows-sshd, yt-dlp, README, SUMMARY, index
Removed: fedora-usrmerge-ebtables-blocker (superseded by prior push)
2026-04-25 17:52:48 +00:00

4 KiB

title domain category tags status created updated
Ollama: `ollama run` with Piped Stdin Bypasses Chat Template + SYSTEM Prompt troubleshooting ai-inference
ollama
eval
chat-template
system-prompt
majortwin
gotcha
published 2026-04-25 2026-04-25

Ollama: ollama run with Piped Stdin Bypasses Chat Template + SYSTEM Prompt

When eval'ing or smoke-testing an Ollama model, piping a prompt via stdin to ollama run skips the model's chat template and the SYSTEM prompt baked into the Modelfile. Output looks like raw base-model completion (often Mastodon-shaped or training-data-shaped), and you'll think the model is broken when it isn't.

The Short Answer

For evals and any test where you want the model's actual chat behavior, use the HTTP API at /api/chat — never ollama run with echo "..." | ollama run model.

import json, urllib.request
body = json.dumps({
    "model": "majortwin-v8",
    "messages": [{"role": "user", "content": "What's your name?"}],
    "stream": False,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/chat",
    data=body, headers={"Content-Type": "application/json"}, method="POST",
)
r = json.loads(urllib.request.urlopen(req).read())
print(r["message"]["content"])

Or with curl piped through jq:

curl -s http://localhost:11434/api/chat -d '{
  "model": "majortwin-v8",
  "messages": [{"role": "user", "content": "What is your name?"}],
  "stream": false
}' | jq -r .message.content

How to Notice

Symptom: model responses are weirdly raw — Mastodon-style hashtag rants, news headlines, multiple unrelated thoughts strung together — even though the same model behaves normally in Open WebUI or via the chat API. This is the canonical fingerprint of a chat-template-bypassed call.

Why This Happens

ollama run is the CLI's interactive REPL. When stdin is a TTY, it reads input as user turns and applies the chat template. When stdin is a pipe (echo "..." | ollama run model), the CLI treats stdin as raw text and forwards it to /api/generate (the completion endpoint), not /api/chat. /api/generate does not apply the chat template, and the SYSTEM prompt only takes effect when the chat template wraps it.

The two endpoints serve different purposes:

  • /api/generate — raw completion, good for fill-in-the-blank or non-instruct base models
  • /api/chat — applies the model's chat template, includes SYSTEM, handles multi-turn message arrays

For an instruct-tuned model (Qwen2.5-Instruct, Llama-3.1-Instruct, etc.), bypassing the chat template means the model never sees the <|im_start|>system ... <|im_end|> framing it was trained to expect, and its responses regress toward base-model behavior.

When You Actually Want /api/generate

Almost never, for instruct models. The legitimate use case is base models without a chat template, or specific completion-style prompts where you want the model to continue a string verbatim. For evals of a fine-tuned Modelfile, always use /api/chat.

Reusable Eval Pattern

A minimal stdlib-only eval harness used for MajorTwin evals lives at ~/MajorTwin/scripts/eval_v8.py. The key call is the chat() helper:

def chat(host, model, prompt, timeout=180):
    body = json.dumps({
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "stream": False,
    }).encode()
    req = urllib.request.Request(
        f"{host}/api/chat",
        data=body,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=timeout) as r:
        return json.loads(r.read())["message"]["content"].strip()

This applies the chat template and the SYSTEM prompt baked into the Modelfile. No need to re-specify SYSTEM per-call.