Agent Evaluation, Ops, and Memory Series (4 parts)

공유 링크 만들기
Facebook
X
Pinterest
이메일
기타 앱

Agent Evaluation, Ops, and Memory Series (4 parts)

How to stabilize long-running agents through evaluation, handoff, guardrails, and memory ownership

Prerequisites	Harness Engineering Basics Series (recommended)
Next series	Harness Patterns, Strategy, and Cases Series (4 parts)

All parts

1	Agent Evaluation Harnesses (1/4) — How to Validate AI Results with Tests, Rubrics, and Regression Loops The most common AI-agent illusion is mistaking "it worked a few times" for "it now works…
2	Long-Running Agents (2/4) — Designing Handoff Structure So Work Survives Context Breaks It is easy to frame long-running agents as a memory problem. That framing is incomplete.…
3	Guardrails for Agent Operations (3/4) — Designing Permissions, Approval Loops, Sandboxing, and Audit Logs Operational guardrails start from a simple assumption: the model can make mistakes. Good…
4	Memory Ownership (4/4) — Why You Must Own an AI Agent's Memory Agent memory is not an accessory. What the system remembers, where it stores that memory…

Recommended pace

Each part takes 25–40 minutes on average. One to three parts per week is the sweet spot for retention.

공유 링크 만들기
Facebook
X
Pinterest
이메일
기타 앱

댓글 쓰기

이 블로그의 인기 게시물

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

← 8/9 Deep Learning Architec… 📚 Series Index (series end) The final part. So far we've focused on models. Now we focus on the tools that actually run them . We'll lay out the philosophical difference between PyTorch and TensorFlow, and trace how Hugging Face transformers , llama.cpp , MLX , and Ollama built the bridge to running large language models on your own machine. By the end you should have the full mental model of "download a pretrained LLM and serve it locally." 0. Learning Objectives Explain the eager-vs-graph difference between PyTorch and TensorFlow. Explain, in graph terms, how autograd automates the backward pass. Use Hugging Face transformers ' AutoModel , AutoTokenizer , and pipeline abstractions. Describe llama.cpp's GGUF quantization, INT4 inference, and CPU-first flow. Describe Apple MLX's use of unified memory and how it differs from PyTorch. Run a local LLM with Ollama and call it through an OpenAI-compatible API. ...

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

← 1/10 SQLite Memory Engine f… 📚 Series Index 3/10 memcore Completed → A memory engine that runs without daemons, without dependencies, anywhere 핵심 요약 This post covers how to design an AI agent memory engine (memcore) on a single-file SQLite backend. - Structure: 25 modules, ~6,464 lines, 22 DB tables - Principles: no daemon, single-file portability, no required external dependencies, host-agnostic - Core techniques: 4-layer hybrid search (topic · vector · FTS5 · LIKE), U-tag dialectic-based confidence evolution, ingestion-layer bias rejection rules What This Post Covers One approach to building AI agent memory without a server or cloud vector DB. Specifically: (1) the limitations of text-file-based memory, (2) vector DB alternatives compared and selection criteria, (3) how to layer search tiers, (4) how to evolve confidence scores over time, and (5) how to block biased memory accumulation at the ingestion stage. Design Principles: Why Single-File SQLite Text-file-based ...

← 7/9 Deep Learning Training 📚 Series Index 9/9 PyTorch vs TensorFlow → What did we keep adding on top of an MLP? Vision went CNN, sequences went RNN/LSTM, and eventually both converged on Attention and the Transformer. Each architecture is easier to remember if you read it as what it gave up to gain something else . This part is the one-line summary of the decisive inflection points: LeNet → AlexNet → ResNet → LSTM → Attention → Transformer. 0. Learning Objectives Explain why CNN's convolution, pooling, and weight sharing match images so well. Trace what got unblocked from LeNet → AlexNet → VGG → ResNet as depth grew. State BPTT for RNNs and the vanishing/exploding gradient problem. Write the LSTM gate equations (forget, input, output) with intuition. Write the attention formula in query/key/value form and the scaled dot-product variant. State the precise reasons Transformers replaced RNNs (parallelism + long-range dependencies). 1. 핵심 요약 CNN : weight sharin...

Series overview: Series index 📚 Series Index Without an evaluation set, every RAG improvement is a story, not evidence. RAG systems fail in more than one place: retrieval, context selection, answer generation, and grounding. That means "looks better" is not a metric. Part 14 explains how to build a golden dataset , how RAGAS and DeepEval fit into the workflow, and why teams must separate retrieval quality from answer quality if they want tuning decisions to hold up. 0. Prerequisites Part 13 reranking — system quality now depends on multi-stage ranking. Part 1 grounding — a correct-looking answer is not enough. Part 12 Hybrid — multiple retrieval settings need comparison on fixed data. 1. Learning Objectives Build a small but useful golden dataset for RAG. Distinguish retrieval metrics from answer metrics . Use RAGAS and DeepEval in the right roles. Avoid the common traps in judge-model-based evaluation. 2. 핵심 요약 An evaluation set for RAG is ...

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

← 6/9 Neural Networks 📚 Series Index 8/9 Deep Learning Architec… → Building an MLP, as we did in Part 6, is not the same as getting it to train well . The right optimizer, regularization, initialization, and learning rate — when these line up, deep networks converge. When they don't, the network refuses to learn at all. This part is the catalog of those crafts, with formulas, paper citations, and code in one place. 0. Learning Objectives Compare and write the update rules for SGD, Momentum, Nesterov, and Adam. Explain how Dropout, BatchNorm, and LayerNorm work and where they belong in a model. Derive the variance formulas for Xavier and He initialization and match them to activations. Implement step, cosine, and warmup learning-rate schedules in PyTorch. Explain why gradient clipping is effectively required for RNNs and Transformers. Diagnose the most common training failures (NaN, plateau, overfitting) and apply first-line fixes. 1. 핵심 요약 SGD : \(w \leftarro...

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System

← 4/7 My Life Organizer Agent 📚 Series Index 6/7 Dual → Designing a Markdown-Based Auto-Publish Pipeline with Blogger API v3 + OAuth 2.0 Key Summary Blogger API v3 + OAuth 2.0 enables a pipeline that converts Markdown files into publishable HTML posts blogger_publish.py handles the full flow — frontmatter parsing → body cleanup → HTML conversion → CSS injection → API post — in a single script For bulk publishing, quota management (delay, retry, TOC suppression) is the critical design concern Platform Selection Criteria API access is mandatory for AI agent-driven content publishing. A comparison of platforms: Platform API Cost AdSense Decision WordPress.com REST API Paid plan required Paid only Cost overhead Ghost REST API Self-hosted or paid Manual setup Ops overhead Blogger v3 API Free Native integration Adopted Blogger is free, exposes a REST API, and integrates natively with AdSense. For personal blog automation, this combination i...

← 5/9 Regularization & Model… 📚 Series Index 7/9 Deep Learning Training → A neural network is logistic regression from Part 4 stacked into layers. Each layer passes through a new nonlinear activation, expanding the family of functions the network can represent. In this part we put how a neural network learns onto one page. Perceptrons, XOR, MLPs, backpropagation, activation functions, and the universal approximation theorem all meet here. 0. Learning Objectives Write the perceptron's learning rule and explain why it cannot solve XOR. Express an MLP's forward pass as one line of matrix products. Derive backpropagation by hand on a two-layer MLP. Compare the definitions, derivatives, and trade-offs of sigmoid, tanh, ReLU, and GELU. State the universal approximation theorem (Cybenko 1989 / Hornik 1991) precisely. Train an MLP end-to-end in PyTorch. 1. 핵심 요약 Perceptron (1958): \(\hat{y} = \mathrm{step}(w^\top x + b)\). Converges only on linearly separabl...

이 블로그 검색

MaJu Tech Notes