"Guardrails for Agent Operations (3/4) — Designing Permissions, Approval Loops, Sandboxing, and Audit Logs"
Operational guardrails start from a simple assumption: the model can make mistakes. Good operations do not depend on perfect model behavior. They define what is allowed, what requires human approval, what must stay inside a sandbox, and what must always leave an audit trail.
Key Takeaways
- The first unit of operational safety is not capability expansion, but permission boundary design.
- Permissions are usually easier to manage when separated into categories like
allow / ask / deny, especially because edit, execute, and external-send actions do not carry the same risk. - Approval loops are not anti-automation. They are how you route high-cost or hard-to-reverse actions to humans before damage is done.
- Sandboxing does not make the model smarter, but it does make failure smaller.
- Without audit logs, post-incident diagnosis, policy tuning, and accountability all become weak.
- Strong guardrails optimize less for "how much can the agent do?" and more for "where does it stop, and what survives after it stops?"
1. Why guardrails deserve their own design layer
An AI agent is not just a chatbot producing prose. It reads files, runs commands, touches networks, and sometimes leaves changes in external systems. Once that happens, quality is no longer the only concern. Permission and responsibility enter the design.
Operationally, the real questions usually look like this:
- which files is the agent allowed to modify
- should external posting or transmission ever happen automatically
- if something goes wrong, can we reconstruct what changed
- where must a human intervene
That is why guardrails are not optional polish. They are a minimum operating condition.
2. The first principle is least privilege
The Chapter 10 notes in sources/260518_ํ๋ค์ค์์ง๋์ด๋ง_15์ฅ_๋ธ๋ก๊ทธํ์ฉ๋
ธํธ.md keep returning to the same idea: before you widen the agent's powers, start narrow and open only what is necessary.
In practice, that usually becomes something like this:
| Permission level | Meaning | Example |
|---|---|---|
allow |
repetitive and low-risk work | reading files, drafting inside constrained paths |
ask |
legitimate but higher-impact actions | file edits, command execution, external API use |
deny |
actions blocked by default | credential access, unauthorized publishing, broad deletion |
The important detail is that reading, editing, executing, and external transmission should not be treated as one undifferentiated tool surface.
3. Approval loops are a prerequisite for safe automation
Approval loops can slow things down. That is true. But they pay for themselves whenever an action is expensive to undo.
Typical examples include:
- live-environment changes
- external posting or outbound transmission
- broad file edits
- costly long-running execution
- tool calls that may touch sensitive data
For this class of action, "we can inspect the logs later" is not enough. It is better to stop before the step than to explain it after the fact.
The purpose of an approval loop is not to put humans into every step. It is to pull hard-to-reverse actions forward into explicit review.
That is why approval criteria are often better defined by reversibility cost than by technology type.
4. Sandboxing is about reducing blast radius
Sandboxing is often perceived as friction. Operationally, it is one of the most practical defenses available.
A sandbox usually does some combination of the following:
- restrict accessible file paths
- limit or isolate network access
- constrain command execution
- keep failures inside a local boundary
It is more reliable to reduce what can be broken than to assume the model will never break it.
This also aligns with the enforcement logic described in drafts/blog/260425_claude_code_hooks_complete_en.md: advice can be ignored, but boundaries can be enforced.
5. Permissions and sandboxing still are not enough
Even narrow permissions and strong sandboxes do not eliminate operational incidents. Two recurring reasons are:
- external documents are treated as trusted instructions
- nobody can later explain who did what and why
The first often appears as prompt injection. If web pages, repository files, and user notes are all treated as equally trustworthy, the agent may interpret external text as operational instruction.
That is why durable operations usually require separation like this:
- do not place system rules and external inputs in the same trust layer
- separate trusted from untrusted inputs
- do not auto-execute instructions embedded in outside documents
- route risky actions through approval or additional verification
So guardrails are not only about tool permissions. They are also about input trust architecture.
6. Audit logs matter during operations, not just after incidents
Audit logs are often framed as forensic tools. That is incomplete. In day-to-day operations, logs are what make policy improvement possible.
A useful audit trail should at least preserve:
| Log field | Why it matters |
|---|---|
| who or which session executed | separates responsibility scope |
| which tools were invoked | reconstructs action path |
| which approval steps occurred | reveals human intervention points |
| what was modified | defines impact radius |
| why something failed or was denied | supports policy tuning |
Without these fields, teams end up guessing both why an action was blocked and why another was allowed.
7. What guardrails look like in a workspace like ours
This repository already encodes several practical boundaries:
| Boundary | Meaning |
|---|---|
no editing credential-bearing files under config/ |
secret separation |
| external publishing requires explicit user confirmation | approval loop |
| edit only assigned files | file-ownership boundary |
| do not revert others' changes | concurrent-work safety |
| user-facing output stays in Korean | protocol adherence |
These are not style preferences. They are operational guardrails. In a mixed content-and-tooling workspace especially, drafting content and publishing content should never be treated as the same permission class.
8. Practical starting point
You do not need a giant policy engine to begin.
- Separate permissions for read, edit, execute, and external-send actions.
- Route only hard-to-reverse actions to
askfirst. - Add a sandbox that limits work scope.
- Define the minimum audit fields and require them consistently.
- Tune the permission table based on real approval and denial cases.
The core requirement is not policy complexity. It is consistency in action classification.
9. Common failure modes
Treating all tools as one permission class
Reading and deleting, drafting and publishing, do not have the same risk profile.
Running approvals without approval criteria
If humans are asked to approve actions without a consistent decision rule, operations slow down without becoming safer.
Deferring sandboxing as a later concern
Sandboxing is not a model-quality issue. It is a damage-containment issue.
Logging only raw output
If success, failure, and denial reasons are not structured, the logs do little to improve policy later.
10. A better completion test for operational guardrails
Operational safety is not mature until you can answer questions like:
- is it clear which actions are auto-allowed and which require approval
- if the agent fails, does the impact stay inside the sandbox boundary
- are outside inputs prevented from silently overriding system rules
- after an incident, can the action path be reconstructed from logs
If these answers are weak, raising the automation level is premature. Tighten the guardrails first.
References
docs/blog_series_ํ๋ค์ค์์ง๋์ด๋ง_์ด๊ด_design.mdsources/260518_ํ๋ค์ค์์ง๋์ด๋ง_15์ฅ_๋ธ๋ก๊ทธํ์ฉ๋ ธํธ.mddrafts/blog/260425_claude_code_hooks_complete_en.mddrafts/blog/260429_harness_series_04_tools_sandboxing_en.mddrafts/blog/260519_long_running_agents_c02_en.md
This is Part 3 of the Operations, Evaluation, and Memory series. Previous: long-running agents. Next: memory ownership and AI-agent lock-in.
Series overview: Harness Engineering Series Guide
๋๊ธ
๋๊ธ ์ฐ๊ธฐ