Operating tg-proxy in production
A reference for setting up, monitoring, and maintaining a Tool Guard proxy under production load.
Deployment shapes
Single-binary on a VM
The simplest deployment: copy bin/tg-proxy to the host, write a systemd unit, and load the policies from a directory mounted read-only.
# /etc/systemd/system/tg-proxy.service
[Unit]
Description=Tool Guard Core policy decision service
After=network-online.target
[Service]
Type=simple
User=tgproxy
Group=tgproxy
WorkingDirectory=/var/lib/tg-proxy
ExecStart=/usr/local/bin/tg-proxy \
-listen 127.0.0.1:9090 \
-policy-dir /etc/tg-proxy/policies \
-audit-log /var/lib/tg-proxy/audit/decisions.jsonl \
-default-mode enforcement \
-fail-closed=true \
-unknown-tools-deny=true \
-rate-limit-rps 20 \
-rate-limit-burst 100 \
-audit-sync-mode interval \
-audit-sync-every 10 \
-audit-rotate-bytes 104857600 \
-approver-token-file /etc/tg-proxy/approver.token \
-max-envelope-depth 32
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=2
LimitNOFILE=65536
ProtectSystem=full
ProtectHome=true
NoNewPrivileges=true
PrivateTmp=true
[Install]
WantedBy=multi-user.target
After systemctl daemon-reload && systemctl enable --now tg-proxy, the proxy reads /etc/tg-proxy/policies/*.yaml, fail-closes if any policy file is malformed, and writes the audit log to /var/lib/tg-proxy/audit/.
Docker / Kubernetes
The shipped Dockerfile produces a distroless-nonroot image with tg-proxy as the default entrypoint. Multi-stage; final image is ~10 MB statically linked.
docker build -t ghcr.io/dimaggi-ai/tool-guard-core:0.1.0 .
docker run --rm \
-p 9090:9090 \
-v "$(pwd)/policies:/policies:ro" \
-v "$(pwd)/audit:/var/lib/tg" \
ghcr.io/dimaggi-ai/tool-guard-core:0.1.0 \
-policy-dir /policies \
-audit-log /var/lib/tg/decisions.jsonl \
-listen :9090
For Kubernetes, mount the policy directory as a ConfigMap (or a git-sync sidecar for live edits) and the audit-log directory as a PersistentVolumeClaim. Run with readinessProbe against /readyz (returns 200 once at least one policy is loaded) and livenessProbe against /healthz.
Behind an API gateway
If a tool runs inside a managed agent runtime (LangChain on Cloud Run, MCP servers on Fly.io, etc.), point the runtime's tool-call interceptor at tg-proxy's /evaluate. The proxy returns 200 OK with decision: allowed to pass-through, 200 with denied, or 202 Accepted with a poll_url for escalations.
The proxy is stateless beyond its in-memory escalation store and the on-disk audit chain. It scales horizontally - N proxies sharing the same policy directory and writing to N independent audit logs. Each log is its own hash chain: run tg verify against each file separately. There is no tooling to merge or cross-link independent chains.
Network exposure and authentication
tg-proxy has no built-in authentication or TLS. /evaluate, /reload, /policies, /metrics, and the read-only escalation listing are all unauthenticated; /evaluate and the audit log carry tool-call payloads (potentially sensitive). Bind to 127.0.0.1 or a private network and put authentication, TLS, and rate limiting at an API gateway or service mesh in front of it. The only built-in secret is the optional -approver-token, which gates the escalation approve/deny endpoints.
Flag reference
The full flag list, copied from tg-proxy -help:
-listen string
host:port to bind (default ":9090")
-policy-dir string
directory of YAML policy files to load (default "./policies")
-audit-log string
path to append the JSONL audit chain (default "./decisions.jsonl")
-default-mode string
shadow | enforcement (default "enforcement")
-fail-closed
deny calls when no policies are loaded (default true)
-unknown-tools-deny
deny any tool_name not in scope.tool_names of some loaded
enforcement policy (closes the tool-name-spoofing class)
-max-envelope-depth int
reject /evaluate envelopes whose JSON nests deeper than this
(DoS defense) (default 32)
-audit-sync-mode string
audit fsync mode: every | interval | none (default "every")
-audit-sync-every int
when audit-sync-mode=interval, fsync once every N appends (default 100)
-audit-rotate-bytes int
rotate audit log when active file exceeds this many bytes
(0 = never rotate)
-rate-limit-rps float
per-agent steady-state limit (req/s); 0 disables
-rate-limit-burst float
per-agent burst capacity used when rate-limit-rps > 0 (default 50)
-rate-limit-key-by string
envelope field to key the limiter on: agent_id | session_id | org_id
(default "agent_id")
-tools-yaml string
path to a tools.yaml function classification registry
-approver-token string
static bearer token required on POST /escalations/<id>/approve|deny
-approver-token-file string
read the approver token from this file instead of the command line
(keeps it out of /proc cmdline); mutually exclusive with -approver-token
-escalation-default-timeout-min int
default timeout (minutes) for an escalation that doesn't
specify one (default 15)
-version
print build version and exit
Observability
Health endpoints
| Endpoint | Returns | |---|---| | GET /healthz | 200 OK if the process is alive | | GET /readyz | 200 OK if at least one policy is loaded | | GET /policies | JSON snapshot of loaded policy IDs (debugging) | | GET /escalations | JSON snapshot of pending+resolved escalations |
Metrics
GET /metrics returns Prometheus-format counters and gauges:
tg_proxy_uptime_seconds 65
tg_proxy_policies_loaded 4
tg_proxy_policy_reloads_total 2
tg_proxy_evaluations_total 12451
tg_proxy_evaluations_allowed_total 9120
tg_proxy_evaluations_denied_total 2401
tg_proxy_evaluations_escalated_total 612
tg_proxy_evaluations_flagged_total 318
tg_proxy_evaluations_fail_closed_total 0
tg_proxy_audit_append_failures_total 0
tg_proxy_audit_current_bytes 16384231
tg_proxy_audit_appends_total 12453
tg_proxy_regex_cache_size 7
tg_proxy_rate_limit_keys 312
The audit counters are read under the audit mutex so /metrics does not race with append.
Logs
Every /evaluate writes an access-log line:
2026/06/08 14:03:08 POST /evaluate → 200 in 145us
Errors (audit append failures, classifier timeouts, ollama unreachable) log to stderr and increment the corresponding *_failures_total counter.
Policy lifecycle
Authoring
1. Write the policy YAML. 2. tg lint -policy - fix any error-severity findings. 3. tg evaluate -policy - sanity check against representative tool calls. 4. Stage in shadow mode (mode: shadow in YAML) for a week and read the near-miss column on each trace to verify the policy behaves as intended. 5. Promote to enforcement (mode: enforcement).
Deploying
Drop the new file into -policy-dir and either restart the proxy or send kill -HUP $(pidof tg-proxy) / POST /reload. Validation runs on every load; if any file fails, the OLD policy set stays live. There is no half-load state.
Retiring a policy
Set status: archived and reload. Archived policies are skipped at evaluation (only approved policies are evaluated), but the file stays in place so the policy history remains reviewable in version control. Delete the file once you've confirmed nothing references it.
Backup and recovery
Audit chain
The audit log is the legal record. Treat it like any other append-only ledger:
- Storage - on a filesystem with atomic writes
(ext4, xfs, zfs, btrfs all fine). The proxy uses O_APPEND which is atomic at the page level.
- Rotation - `-audit-rotate-bytes` rotates the active file
when it crosses the cap. Rotated files are named , , ... tg verify reads the rotation set in order.
- Off-host backup - `cron` an rsync to a separate host every
hour. The hash chain links across rotations, so tg verify on the backup is the same operation as on the live host.
- Verification cadence - `tg verify` once a day at minimum. If
it returns intact: false with exit 5, you have an on-disk-tamper or a corrupted write. Stop the proxy (tg-proxy refuses to start with a tampered tail anyway) and triage.
Disaster recovery
If the audit log is destroyed or corrupted past the tail:
1. Stop tg-proxy (it refuses to start without a verifiable tail). 2. Restore the most recent verified backup. 3. Start tg-proxy - it resumes the chain from the restored tail. 4. The gap between the restored tail and the destroyed live tail is PERMANENTLY lost from the audit record. There is no way to reconstruct decisions made between the restore point and the crash.
This is the standard append-only-ledger recovery semantic: gaps are gaps. The system is fail-safe (decisions made after the gap are hashed and chained correctly), not fail-recoverable.
Upgrade path
Tool Guard follows semver. Between minor versions the canonical trace schema is locked at CanonicalTraceVersion = v1. A future v2 schema bump will be opt-in via build flag; existing v1 chains will remain tg verify-able forever.
To upgrade:
1. git pull && make build. 2. Stop the old proxy. 3. Start the new proxy. It resumes the chain from the same tail.
There is no migration step. Policy files do not change shape between patch versions; minor versions may add new condition forms (e.g. llm_classify shipped in 0.1.x) but existing policies continue to load.
Common operational issues
| Symptom | Likely cause | Action | |---|---|---| | Proxy refuses to start, "audit-log tail integrity check failed" | Audit log was tampered or corrupted | Run tg verify to locate the failure line; restore from backup; start with -audit-log pointing at the restored file | | Proxy returns 503 on every /evaluate | -fail-closed=true and no policies loaded | Check -policy-dir exists and contains a valid *.yaml | | Every llm_classify rule times out | Ollama unreachable; check -ollama_url in policy or that Ollama is running on the configured endpoint | curl http://localhost:11434/api/tags | | Latency suddenly 10x worse | Cold-start of a freshly-pulled Ollama model | First call after model swap is ~5-20s; subsequent calls are ~600ms | | Escalation poll returns 404 | The proxy restarted (in-memory store) or the entry expired | Restart agent; the agent's next call will re-evaluate | | Rate limit fires on the wrong agent | Multiple agents share the same agent_id | Make the agent_id unique per logical agent identity |
For anything not on this list, file an issue with the audit log line and the proxy stderr output.