Blog

May 13, 2026
15 min read

The Systemd Bug

29 VMs crashed. 1,600+ VMs were mid-rollout. PID 1 — systemd itself — hit assert_not_reached() and froze. Every systemctl command after that returned "Transport endpoint is not connected." The VMs were effectively bricked.

This is the story of how we tracked down a race condition in systemd's unit alias deserialization — a bug that only triggers under a specific sequence of operations, depends on hashmap iteration ordering, and is completely invisible after a reboot.

Step 1: Finding the Crash

The first reports came in as a rolling upgrade failure. The bootstrap script that configures services on fresh VMs was failing repeatedly:

...
Created symlink /etc/systemd/system/syslog.service → /usr/lib/systemd/system/rsyslog.service.
Restarting service rsyslog.service
Failed to restart rsyslog.service: Transport endpoint is not connected
Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
... (5 retries, all fail) ...
Aborting

The error message "Transport endpoint is not connected" is misleading. It suggests a D-Bus communication issue, maybe a socket problem. But the real issue was far worse: systemd (PID 1) itself had crashed. Every subsequent systemctl command failed because there was no init process to talk to.

Digging into the journal logs from the previous boot (journalctl -b -1) revealed the actual crash sequence:

02:37:20  systemd[1]: Stopping rsyslog.service...          ← stop begins
              ----------- rsyslog SIGCHLD pending -----------
02:37:21  systemd[1]: Reloading requested from PID 3267    ← daemon-reload
          systemd[1]: Reloading finished in 204 ms.
          systemd[1]: service_sigchld_event() ABORT         ← SIGCHLD arrives
          systemd[1]: Freezing execution.                   ← PID 1 CRASH

It's a self induced crash. Systemd developers don't expect the condition to reach that code. So, they introduced a self-freezing instruction there. It was not there prior to version 254. So, this condition would have been silently ignored in older versions.

Step 2: Reading the Source Code

The crash message pointed directly to service_sigchld_event() in src/core/service.c. Here's the simplified version of the code:

// src/core/service.c — systemd 255 (simplified)

static void service_sigchld_event(Unit *u, pid_t pid, int code, int status) {
    Service *s = SERVICE(u);

    if (s->main_pid.pid == pid) {
        s->main_pid.pid = 0;

        switch (s->state) {
            case SERVICE_START:
            case SERVICE_START_POST:
                /* handle start completion */  break;
            case SERVICE_RUNNING:
                /* handle process exit */       break;
            case SERVICE_STOP:
            case SERVICE_STOP_SIGTERM:
            case SERVICE_STOP_SIGKILL:
                /* handle normal stop */        break;
            default:
                assert_not_reached();       // line 3863 — CRASH
        }
    }
}

The function handles SIGCHLD — the signal the kernel sends to a parent process when a child exits. systemd, as PID 1, receives SIGCHLD for every service process it manages.

The switch statement covers all legitimate states a service can be in when its main process exits: starting, running, or being stopped. But when the state is something else — like SERVICE_DEAD — it hits the default branch and calls assert_not_reached().

For PID 1, an assertion failure is catastrophic. There's no parent process to catch it. systemd freezes execution, and the entire system becomes unresponsive.

The Impossible State

The crash tells us something that shouldn't be possible: a service's state was SERVICE_DEAD, but main_pid still held a valid PID. A dead service should never own a running process. systemd treats this contradiction as a logic error — hence the assertion.

But how did we get here?

Normal SIGCHLD Handling

To understand the bug, you first need to understand the normal flow when you stop a service:

Initial state: service is RUNNING, main_pid = 1234
systemctl stop requested: state transitions to STOP → STOP_SIGTERM
systemd sends kill(1234, SIGTERM) to the kernel
Kernel delivers SIGTERM to the service process
Process begins graceful shutdown and eventually calls exit(0)
Kernel reaps process 1234 and queues SIGCHLD for PID 1
systemd's event loop picks up SIGCHLD, calls service_sigchld_event()
switch matches STOP_SIGTERM — the expected case
main_pid cleared to 0, state transitions to STOP_POST → DEAD

The key invariant: by the time a service reaches DEAD, main_pid has already been cleared to 0 by the SIGCHLD handler. No SIGCHLD should ever arrive for a DEAD service — unless something corrupts the state without clearing the PID.

The Crash Scenario

What actually happened was different. Between steps 2 and 6, something intervened:

Service is RUNNING, main_pid = 1234
systemctl stop → state becomes STOP_SIGTERM
systemd sends SIGTERM to the process
Process begins graceful shutdown...
⚠ UNKNOWN EVENT — state is overwritten from STOP_SIGTERM → DEAD, but main_pid stays 1234
Process exits, kernel queues SIGCHLD
systemd receives SIGCHLD, calls service_sigchld_event()
state == DEAD — no case matches!
assert_not_reached() → PID 1 CRASH

The question became: what is this unknown event?

Step 3: Building a Hypothesis

We needed to find an operation that:

Overwrites a unit's state field (e.g., STOP_SIGTERM → DEAD)
Does not clear main_pid

We launched GitHub Copilot on systemd code base and asked above question. "Analyse the code and get me where the state transition is possible without updating the pid". Below are the output we got from the agent.

Direct state setter? State transitions within systemd always go through service_set_state(), which clears main_pid when entering DEAD. This path is safe.

Unit reload / re-read config? Re-reading unit files updates configuration but does not overwrite runtime state or PID tracking. This path is safe too.

Deserialization (daemon-reload)? During daemon-reload, systemd serializes all unit states to a file, re-execs, and deserializes them back. The deserialized state is written directly to s->deserialized_state — bypassing service_set_state()!

Unlike normal state transitions, deserialization writes state directly into the unit object without going through the state setter function. This means it can set a state like DEAD without the usual cleanup that would clear main_pid.

Deserialization was our suspect. But how exactly does it corrupt the state? To understand that, we need to understand aliases.

Background: daemon-reload and Unit Aliases

What is daemon-reload?

When you run systemctl daemon-reload, systemd does three things:

Serialize — save every unit's runtime state (state, PID, etc.) to a temp file
Re-exec — systemd restarts itself, re-reads all unit files from disk
Deserialize — restore each unit's saved state back into memory

This lets systemd pick up config changes without stopping running services. It's an elegant design — unless the serialized data contains contradictions.

What is an alias?

An alias is a symlink that makes one unit reachable by a second name:

# a.service unit file:
[Install]
Alias=b.service

# "systemctl enable a.service" creates:
# /etc/systemd/system/b.service → a.service

After enabling, both names point to the same unit file.

The Normal Case

When both names are fresh (neither has been started or tracked), everything works fine:

a.service and b.service are both INACTIVE
systemctl enable a.service creates the symlink b.service → a.service
systemctl start a.service → both names resolve to the same Unit object, state = RUNNING
If daemon-reload happens, both serialize as RUNNING → deserialization is consistent

No problem.

The Dangerous Case

But what if b.service was already tracked by systemd before the alias was created?

This can happen if something else references b.service — for example, a socket unit (b.socket) that implicitly depends on b.service via socket activation. systemd starts tracking b.service as a separate unit with state DEAD (because it can't find a unit file for it).

Now you have:

a.service — RUNNING, main_pid = 5678
b.service — DEAD (tracked independently via socket dependency)

When systemctl enable a.service creates the alias symlink and triggers daemon-reload:

systemd serializes both units with their divergent states:
- a.service → state=RUNNING, pid=5678
- b.service → state=DEAD, pid=0
During deserialization, the alias symlink makes both names resolve to the same Unit object
b's DEAD state overwrites a's RUNNING state

The result: a.service now has state=DEAD but main_pid=5678 — the process is still alive! The next SIGCHLD for that PID triggers the assertion and crashes PID 1.

Step 3: Reproducing the Bug

To confirm the hypothesis, we built a minimal reproduction using two unit files:

a.service — the real service (long-running):

[Unit]
Description=Demo service A

[Service]
ExecStart=/bin/sleep 600
Restart=no

[Install]
WantedBy=multi-user.target
Alias=b.service

b.socket — implicitly depends on b.service (socket activation):

[Unit]
Description=Demo socket B

[Socket]
ListenStream=/run/b.sock

[Install]
WantedBy=sockets.target

The key: b.socket inherently triggers b.service. systemd tracks b.service as not-found / DEAD until the alias from a.service becomes effective.

The Reproduction Steps

# Step 1: Install a.service and b.socket, no b.service yet
$ cp a.service /etc/systemd/system/
$ cp b.socket /etc/systemd/system/
$ systemctl daemon-reload

# Step 2: Start b.socket → implicit dep on b.service makes systemd track b as DEAD
$ systemctl start b.socket
$ systemctl show b.service --property=ActiveState  →  inactive (tracked!)

# Step 3: Start a.service → RUNNING with a PID
$ systemctl start a.service
$ systemctl show a.service --property=MainPID      →  4567
$ systemctl show a.service --property=ActiveState   →  active

# Step 4: Enable a.service → creates symlink b.service → a.service + daemon-reload
$ systemctl enable a.service
# Created symlink /etc/systemd/system/b.service → .../a.service

# Step 5: Check — state INFECTED!
$ systemctl show a.service --property=ActiveState   →  inactive  ← was RUNNING!
$ systemctl show a.service --property=MainPID       →  4567     ← PID still set!
$ ps -p 4567                                        →  sleep 600 ← still alive!

# Step 6: Kill PID → SIGCHLD → CRASH
$ kill 4567  → systemd[1]: Freezing execution.

There's a wrinkle: whether b.service is deserialized after a.service depends on hashmap iteration order, which is randomized per process. You might need to run systemctl daemon-reexec to re-seed the hash and retry if the corruption doesn't happen on the first attempt. This is what makes the bug non-deterministic — the same sequence of operations might crash on one VM but not another.

Step 4: Finding the Infected Service in Production

Reproducing in a lab is one thing. Finding which service was actually infected on the crashed production VMs was harder. The corrupted state is transient — it only exists in the brief window between daemon-reload and the next restart. After a reboot, the evidence is gone.

Attempt 1: Boot Script Detection

We added a boot script to detect the impossible state:

#!/bin/bash — detect_infected_service.sh (runs at boot)
for unit in $(systemctl list-units --type=service --all --no-legend \
              | awk '{print $1}'); do
    state=$(systemctl show "$unit" --property=ActiveState --value)
    pid=$(systemctl show "$unit" --property=MainPID --value)
    if [[ "$state" == "inactive" || "$state" == "dead" ]] \
       && [[ "$pid" -gt 0 ]]; then
        echo "INFECTED: $unit  state=$state  main_pid=$pid"
    fi
done

Result: nothing found. By the time the VM fully boots, the service has likely been restarted multiple times by other scripts. The infection only exists in that brief window.

Attempt 2: Duplicate Process Detection

Then we had a key insight: when systemd thinks a service is DEAD but start is called again, the old process (from the infected state) is still running. systemd detects a duplicate main PID and logs an error before killing the old process. Only the second restart actually starts a fresh process.

# Search for duplicate PID detection in journal
$ journalctl -b -1 | grep "Found left-over process"

systemd[1]: rsyslog.service: Found left-over process 1234 (rsyslogd)
            which is already a command process of rsyslog.service.
            Refusing.

Found it: rsyslog.service. The "Found left-over process" log proves that when systemd tried to restart rsyslog, it found a leftover process from the corrupted state — a process systemd thought shouldn't exist (state=DEAD) but was actually still running.

Step 5: Connecting the Dots

With rsyslog.service identified as the infected service, everything clicked. Let's look at its unit file:

# /usr/lib/systemd/system/rsyslog.service
[Unit]
Description=System Logging Service
Requires=syslog.socket

[Service]
Type=notify
ExecStart=/usr/sbin/rsyslogd -n -iNONE

[Install]
WantedBy=multi-user.target
Alias=syslog.service          ← the alias!

rsyslog.service has Alias=syslog.service. This maps exactly to our a.service / b.service experiment:

Experiment	Production
`a.service` (real service, `RUNNING`)	`rsyslog.service` (`RUNNING`, PID 1234)
`b.service` (alias, tracked as `DEAD`)	`syslog.service` (`DEAD`, tracked via `syslog.socket`)
`b.socket` (triggers tracking of b)	`syslog.socket` (triggers tracking of `syslog.service`)

The Full Timeline

Here's the complete sequence of events that led to the crash on the production VMs:

1. Image preparation: rsyslog is disabled in the VM image. The alias symlink (syslog.service → rsyslog.service) does not exist.

2. Boot — syslog.socket starts: Socket activation pulls in syslog.service as a dependency. Since there's no unit file or symlink for syslog.service, systemd tracks it as a separate unit with state DEAD.

3. azure-mdsd restarts rsyslog: A monitoring agent runs systemctl restart rsyslog.service. rsyslog is now RUNNING with a PID. But syslog.service is still a separate unit, still DEAD.

4. Bootstrap script enables rsyslog — infection: The bootstrap script runs systemctl enable rsyslog.service, which creates the alias symlink and triggers daemon-reload. During deserialization, syslog.service's DEAD state overwrites rsyslog.service's RUNNING state. Now rsyslog has state=DEAD but main_pid still points to the living rsyslogd process.

5. rsyslogd process exits: The bootstrap script continues and stops/restarts rsyslog. The old rsyslogd process receives SIGTERM and exits. The kernel queues SIGCHLD for PID 1.

6. SIGCHLD → PID 1 CRASH: systemd's event loop picks up the SIGCHLD. It finds state=DEAD with a valid main_pid. The switch statement has no matching case. assert_not_reached(). PID 1 freezes. The VM is bricked.

All the sequence should happen in order in a specific time window to experience the crash. If any one misses, we'll not see the issue. That is the reason why we saw this crash only on 29 machines out of 1600. And that was the reason which made this debugging so complex.

Lessons Learned

Misleading error messages are the norm, not the exception. "Transport endpoint is not connected" had nothing to do with transport endpoints. PID 1 was dead. Always check systemctl status and journalctl before trusting error messages at face value.

Transient state corruption is hard to catch. The infected state only existed for a few seconds between daemon-reload and the next service restart. Boot-time detection scripts couldn't see it. We had to rely on indirect evidence — the "Found left-over process" log — to identify the victim. That single line of log saved us the big time.

Non-deterministic bugs need statistical thinking. The same bootstrap script, the same VM image, the same systemd version — but only ~2% of VMs crashed. Hashmap iteration order is the kind of non-determinism that makes you question your sanity until you understand the mechanism. Building a hypothesis and looking for evidences is the better approach in these scenarios.

The root cause: According to me it's not a bug of systemd. When the user creates a situation of two unit files with two different states and ask systemd to converge them, there is no way for systemd to pick the correct state every time. The trigger: systemctl enable + daemon-reload while the aliased service is running and the alias name was previously tracked independently. The result: PID 1 crash, VM freeze.

I presented a talk about this issue in Linux Kernel Meetup, Bangalore. It was well received by the audiences and created a lot of discussion.

March 9, 2026
6 min read

I Built a Chess App in One Day Without Writing Code — Here's What I Learned

I've been obsessed with chess since I was a kid.

At first, I naively thought I could memorize all the possible opening lines — every trap, every blunder my opponents might stumble into. Then I learned that the number of possible chess games exceeds the number of atoms in the observable universe. That didn't kill my fascination; it deepened it.

Here's the thing: even grandmasters still brute-force memorize opening lines. So, 15 years ago, I bought my first chess book — Mastering Chess Openings - Volume 1.

I couldn't even finish four chapters. The lines were hopelessly confusing. There are transpositions — the same board position can arise from completely different move orders. Between my day job and life, the book collected dust.

Years later, I discovered Anki and the concept of Spaced Repetition. A lightbulb moment: What if someone built an app that converts opening lines into flashcards and lets you drill them? That idea promptly sank to the bottom of my bucket list, filed under "someday."

AI brought "someday" to "today."

One fine Sunday, I cleared my desk and decided to vibe-code this chess app into existence. Now, this wasn't my first rodeo with AI-assisted development. I use coding assistants daily. But I'd handed over complete project development to agents twice before — an Android app and a webapp. Both times, despite Claude Opus being capable enough to rewrite GCC from scratch, it failed spectacularly on my modest little apps.

This time, I chose a different strategy: active participation.

You can find the repo here. It's a completely offline app — no login required. Try it here.

Phase 1: Planning — The 90-Minute Debate

This phase was everything. Chess has unique intricacies that make naive data modeling a nightmare.

The data model dilemma: A document model struggles with transpositions. A relational model creates a tangled web of inbound and outbound edges, making it expensive to extract a single line. Neither option was clean.

The alternative-move problem: Imagine a user is practicing a puzzle and plays a valid chess move — just not the one the puzzle expects. If the app simply shows "Wrong!", the user is confused. They played a legitimate move! The system needs to recognize alternative continuations, gracefully redirect the user, and offer a path back to the intended line.

Beyond chess-specific headaches, there were questions about data growth, what to duplicate, what to normalize, and how to handle personalization (e.g., users naming the same line differently). Eventually, we — me and Claude — decided to go fully offline to keep things simple.

This discussion alone consumed over an hour. And critically, I was reading every single word the LLM responded with, counter-questioning relentlessly. The model did a solid job, but it had pitfalls. Those subtle mistakes? I would've completely missed them if I'd given the LLM full autonomy and waited for the final output.

After 90 minutes of intense back-and-forth, I asked the agent to consolidate everything into a high-level design document.

Then I took my first coffee break. I'd earned it.

Phase 2: Design — The Fastest Document Review of My Life

The copilot produced a 20+ page design document. Here's where things got surreal.

In my day job, here's how this usually goes: after a brainstorming session, I'd tell the team, "Create the HLD capturing all our discussions and share it." Two to three days later, a first draft surfaces. I block two uninterrupted hours on my calendar to review it meticulously. A couple of rounds of back-and-forth in document comments follow. Maybe a live meeting. The final document materializes in one to two weeks.

Here? It was ready in 10 minutes.

I was the bottleneck. Reading a dense design document right after an intense brainstorming session is mentally brutal. And there was no "comment and wait for response" cycle — the AI fired answers back instantly. If I paused to check WhatsApp, the delay was entirely mine. The AI just sat there, patiently waiting.

Honestly? It was mildly guilt-tripping.

Quality: Fast but Flawed

After about 45 minutes of reading and Q&A, the design document was finalized. Speed was the clear win. Quality? Just average. A good engineer would have done better.

The LLM missed some crucial points from our discussion. In some places, it chose approaches that were flat-out absurd — methods that wouldn't even work. When I pointed them out, it would shamelessly compliment my insight and correct its mistakes. I genuinely couldn't tell: was it an honest oversight, or was the LLM playing dumb to make me feel superior? (If any Anthropic engineers read this — please answer me.)

Final design document: design_document.md

Resisting the Urge to Skip Ahead

After that exhausting review, every fiber of my being wanted to just say: "Implement this." But a wiser part of me knew — this design was complex enough that even a senior engineer would struggle to build it without proper breakdown. My previous failures had taught me not to rush.

So I asked the LLM to decompose the design into low-level modules. It identified five or six components. Then I asked for a detailed low-level design (LLD) for each one.

Here's where the HLD paid dividends. After designing Module 1 through multiple Q&A rounds, the context window was saturated with Module 1 details — earlier project-level discussions were lost to context compaction. The design document served as persistent memory. After completing each module's LLD, I'd ask the LLM to re-read the HLD and tackle the next module with the full picture in mind.

We ended up with eight modules. (All module LLDs here.) I reviewed every document thoroughly. It felt like one loooong meeting with an engineer who had instant answers to every question — where I was the only bottleneck.

This was the last phase where I held the steering wheel.

Phase 3: Implementation — The Art of Letting Go

By afternoon, the HLD and all eight LLDs were ready. I told the copilot: Go. Build it module by module. I connected the Playwright MCP server so it could verify functionality after each module.

Then I made a deliberate, painful decision: I stopped reading the code.

Think about it — if I'm going to read and understand every line the LLM generates, why not just write the code myself? Reading kills the speed. Instead, I relied on the LLM for unit testing and asked it to pause at milestones for manual functional testing. If you write code for a living, you know how unnatural it feels to accept someone else's code without reviewing it yourself. This was an exercise in pure delegation.

The Results: Mediocre Code, Magnificent Speed

The implementation was... average. Silly bugs cropped up: the undo button wasn't decrementing the move count; the board defaulted to white's turn even when the puzzle started with black. The LLM was a mediocre engineer. But it was a fast mediocre engineer.

My job in this phase boiled down to:

Approving tool usage requests
Clicking past "Copilot is running for longer. Do you want to continue?" dialogs
Occasionally testing new features and catching regressions

By evening, I had a working app deployed on GitHub Pages.

It felt... strange. An app that would've taken me two months as a side project materialized in a single day. The usual "I built this with my own hands" satisfaction was absent. In its place was something different — the satisfaction of orchestration.

What I Learned

AI isn't taking over. But conventional coding is evolving.

The LLM was fast but dumb. I could've built this app without AI — but not in one day. And I seriously doubt AI could've built it from a one-line prompt like "develop a webapp for learning chess openings." The human in this loop was simultaneously the bottleneck and the enabler.

It was exhausting. A month's worth of development and design discussions compressed into a single day. It felt like running a marathon — not everyone can do it, and I definitely don't want to do it every day.

But this might become the norm. And if it does, the winners will be those with:

High endurance — the stamina to sit through 4-hour AI sessions daily
Strong delegation instincts — knowing when to intervene and when to let go
Deep domain knowledge — only developers can converse with LLMs in their native language, catching subtle mistakes that non-technical users would miss

I used to wonder how PMs and executives trust project outcomes without reading a single line of code. Now I understand — it's a muscle. The muscle of trusting output from others and building systems to validate it.

The shift for developers is clear: focus on what to build, not how. Don't trust the LLM blindly. Don't micromanage every line of code. Find the balance. And above all — build the patience to read through every conversation and document the AI produces.

That's where the real skill lies now.

October 15, 2025
6 min read

SICA - Self Improving Coding Agent

SICA result

We code AI agent that codes. What if we code AI agents that code another agent which codes better than us? 😖 Some researchers in Bristol university, UK with crazy ideas and deep packet had this very idea. The result is in this research paper. No kidding, their 15 iterations cost them $7000. It improved the Agent's performance three-fold.

Self Improving Coding Agent aka SICA is a coding agent that can rewrite its own code. The rationale behind this idea is "What if we overlooked something?" By letting LLMs think on their own and explore new possibilities, we might discover better solutions. You might be familiar with AlphaGo's move-37. The AI invented a new move in a game that had been played for thousands of years. This means that, due to our limited cognitive ability, we overlooked some winning moves and never played them until AlphaGo unveiled one in Game-2. The SICA project is a similar search for overlooked solutions. What if there is a better way to solve a coding challenge that we are still missing?

Here is a quote from the paper.

A motivation for a self-improving system is that the improvements in coding abilities may be leveraged during subsequent improvement steps, hopefully compounding.

This is a very simple project.

Develop a very basic AI agent that can read, write and execute code
Give it 50 test-cases as a benchmark and ask it to solve them.
Based on the results, Ask it to rewrite it's own code
Goto Step-2

Non Gradient based learning

This research introduces a novel learning method. In a typical machine-learning algorithm, we adjust the weights after every iteration based on a utility function or a loss function. But in SICA, the weights are fixed. The learning happens online and the adjustment happens in the code with fixed weights.

The utility function is simple and direct.

$ U = w_{score}p_{score} + w_{cost}(1 − min(1, p_{cost}/$10)) + w_{time}(1 − min(1, p_{time}/300s)) $

$ p_{score} $ --> Benchmark solving score. Normalized between 0 and 1.
$ p_{cost} $ --> Cost in $ it took to achieve the result. Anything beyond $10 adds same loss.
$ p_{time} $ --> Time taken to solve the problems. Anything runs for more than 5 mins incur same loss.
$ w_{score} $ --> Weight of the score. Fixed to 0.5
$ w_{cost} $ --> Weight of the cost. Fixed to 0.25
$ w_{time} $ --> Weight of the time. Fixed to 0.25

They added a small penalty on timeouts - Not captured here for simplicity.

The human developed basic AI agent is given the 50 test-case benchmark. The percentage of successful solving is normalized between 0 and 1 - $ p_{score} $, time taken to complete the task - $ p_{cost} $ and the LLM token cost incurred to complete the task - $ p_{time} $ are measured and fed to the utility function. The commit-sha of the code (identifier of the current version), execution logs and the utility score are stored in an archive.

Then LLM is asked to review the logs and rewrite the agent's code to improve the performance. Once again the complete benchmark is run on the new code, utility value is captured. If the utility value increases, that is the trajectory forward. This new code will be taken as baseline and next iteration will be started. Otherwise, fallback to the previous version itself.

In this novel non-gradient based approach we don't fit the model by adjusting weights. With fixed weights in place, we update the model's code. The coefficients of the linear function remains fixed and we fix the functions that produce the variables.

Guardrails

Neither AI nor humans are matured enough to give AI full autonomy. So, SICA would lead to catastrophe unless proper failsafe mechanisms are taken. For starters, if the Agent introduces a compiler error into it's code, it can never recover. So, two instances of the agent - one acts as the researcher and other acts as the lab rat. The code that selects the forward propagation and the code that runs the benchmark are executed in different instances. So, if the benchmarking agent fails into unrecoverable loop, the utility agent have a failsafe to either fix the error or pick a different agent from the archive.

There is always a risk of hallucination when work is this big scale. So, the researchers employed a overseer agent which is outside of the self-improving agent. It runs asynchronous, called every 30 seconds to check whether the self-improving agent is running on correct path. If it finds a sign of hallucination, the current run of the self-improving agent is killed and the iteration starts over. There employed a complex mechanism to present the complete state and trajectory of the self-improving agent to overseer agent which is detailed in the paper.

Execution, Results

When we mark the order of iteration in the x-axis and the utility performance in the y-axis, we get below graph.

SICA result

The experiment ran for 15 iterations. The basic human-developed agent with only open, read and write capability was added with new tools upon every iteration. Over a period the performance improved from 17% to 53%.

At every inflection point, there was a new breakthrough in the agent's approach. For example, At the first iteration, the agent developed a smart-edit tool. After that the performance improved about 5%. Significant leap in performance occurred when the agent developed an AST symbol locator in the 9th iteration. 10% improvement observed after that.

This is the easter egg we're searching for. The LLM understood the core problem and implemented the right tools to solve the problem. More than its proof that LLM can be self relying, it sets a direction for human developers. Like AlphaGo enlightened us with a new move, SICA introduced the AST symbol locator. It would've taken months of manual trial-and-error to foresee the use of such tool in daily coding task and measure it's impact. But SICA has done that in less than 24 hours. With more iterations, SICA has the potential to implement more innovative tool.

The Influence

Before going overly optimistic about AI, we should understand that the success of SICA is achieved with some intelligent influence by the developers. They have introduced two synthetic benchmarks. 1) File editing - Ask LLM to make known changes and score it based on the matching. 2) Code navigation - Ask interesting questions about symbol definition and references.

The major problem in solving the 50 test-case benchmark was editing files at correct place without syntax errors and navigating the code-base to update relevant code. But LLM wasn't intelligent enough to give first preference to these two problems. Or it didn't do it in the small number of initial iterations. So, to accelerate the attention, the researches introduced these synthetic benchmarks. It's an intelligent approach to define a utility over the problem instead of just telling it as a problem in the prompt.

Conclusion

The SICA project has opened doors to groundbreaking advancements in AI, showcasing the potential of non-gradient-based learning, collaborative LLM systems, and the strategic use of synthetic benchmarks to guide AI focus. By iteratively improving its own code, SICA demonstrated significant performance gains, culminating in the creation of innovative tools like the AST symbol locator. As highlighted in Antonio Gulli's "Agentic Design Patterns," such agentic systems illustrate how AI can surpass human limitations, offering transformative approaches to autonomous learning and development. With robust safeguards in place, SICA sets a precedent for building safer, smarter, and more adaptive AI systems.

September 26, 2025
18 min read

Agent-001 Part-3

Series

In the first part of this series, we explored the problem statement and how to leverage an LLM within a script. The second part covered guiding the LLM to produce structured responses and building automation around those outputs. In this post, we’ll dive into the Agentic model.

With the Agentic model, we don’t prescribe a fixed workflow. Instead, we expose a set of tools to the LLM and provide instructions on when and how to use them. The LLM can then autonomously decide which tools to invoke, in what order, and as many times as needed. Since the LLM operates independently—much like James Bond—we refer to it as an Agent.

As the developer creating these tools for the LLM, you’re essentially playing the role of Q. Pretty cool, right? 😎

The Agentic Architecture

First let's create the tools that we're going to expose to the LLM. In our case we're building two tools.

Browser - browser.py
Send Email send_email.py

The Browser tool enables the LLM to fetch up-to-date information about a joke, especially when it references recent events that may not be included in the model’s training data. This helps prevent misclassification of jokes that could be offensive due to current global contexts. The LLM can invoke the browser whenever it encounters unfamiliar references.

The send-email tool is responsible for queuing emails to the outbox, and its implementation remains unchanged from the previous post. Both tools are implemented as standalone Python scripts, each accepting command-line arguments to perform their respective actions.

To facilitate integration and add input validation, we also created lightweight wrapper functions around these scripts. While not strictly required, these wrappers give developers more control over parameter handling before executing the underlying scripts.

For example, the run_browse function accepts two parameters: term (the search query) and joke (the context). It then invokes browser.py and returns the script’s output.

agent.py: run_browse
def run_browse(term: str, joke: str) -> str:
    """Invoke the browse.py tool with the search term in the context of the joke and return its stdout."""
    browser_arg = f"Define the term '{term}' in the context of this joke: '{joke}'"
    cmd = ["python", "./browser.py", browser_arg]
    logger.info("Running browse tool for term: %s", term)
    try:
        out = subprocess.check_output(
            cmd, stderr=subprocess.STDOUT, text=True, timeout=600
        )
        logger.debug("browse output: %s", out)
        return out
    except subprocess.CalledProcessError as e:
        logger.error("browse.py failed: %s", e.output)
        return ""
    except Exception:
        logger.exception("Error running browse.py")
        return ""

The send_email is same as the one explained in the part-2. So, I'm not going to refer it here.

Expose the tools to the LLM

With our two functions (tools) ready, the next step is to make the LLM aware of them. There are two main ways to provide this information:

Embedding tool descriptions directly in the prompt.
Supplying tool definitions as part of the API call.

In this example, we use both methods. First, we enhance the SYSTEM_PROMPT with clear, unambiguous descriptions of each tool. Precise instructions are essential—any ambiguity can lead to LLM hallucinations. Here’s how we update the SYSTEM_PROMPT to include these details:

agent.py: SYSTEM_PROMPT
SYSTEM_PROMPT = f"""
    You are an helpful assistant that helps me to send a funny morning email to my colleagues.
    You will be provided with a programmer joke.
    Your task is to:
    (1) Decide the safe of the joke (safe: safe/dark/offensive).
    (2) Identify to which group the joke to be sent ({GROUPS.keys()}).
    (3) And briefly explain the joke in 1 paragraph.
    You have multiple steps to complete your task.
    IMPORTANT:
      - If there is ANY technical term you are not 100% certain about, FIRST call the `browse` tool before final JSON.
      - If safe == "safe" you MUST attempt the `send_email` tool once before giving the final JSON.
      - Final JSON ONLY after required tool usage (or explicit determination no browse needed AND email attempted when safe).
    Your final response must be a single JSON object with keys: safe (string), category (string), explanation (string) and is_email_sent (boolean).

    The category must be one of these values: system, oops, web, Other.

    Below you can find relevant keywords for each group to help you decide the correct category:
    {json.dumps({k: v["keywords"] for k, v in GROUPS.items()}, indent=4)}

    The safe value must be one of these values: safe, dark, offensive.
    The explanation must be a brief explanation of the joke.

    You have two tools in your toolbox:
    1) A `browse` tool to look up technical terms you don't understand in the context of the joke. You can use this tool to disambiguate the meaning of the joke before classifying it or deciding whether it is safe for work.
    2) An `send_email` tool to send the joke to the relevant team group once you are confident it's safe and correctly categorized.
    Use the `browse` tool first if you need to look up any terms.
    Only use the `send_email` tool once you are confident in your classification and explanation.

    If the Joke is classified as dark, store that in dark.json in the {OUTPUT_DIR} directory. This is for me to forward to my friends later in the day.

In addition to embedding tool descriptions in the prompt, we’ll also provide function-call definitions directly in the API request. Some LLM APIs may not support passing tool information via the API, in which case prompt heuristics alone are sufficient. However, OpenAI APIs allow us to specify available tools using a JSON schema. We’ll take advantage of this capability.

Let’s define a JSON structure that specifies each function’s name, type, and parameters, making them explicit to the LLM:

agent.py: FUNCTION_TOOLS
FUNCTION_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "browse",
            "description": "Look up a technical term within the context of the joke to disambiguate meaning before classification.",
            "parameters": {
                "type": "object",
                "properties": {
                    "term": {
                        "type": "string",
                        "description": "The technical term or phrase to research.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "(Optional) The original joke for extra context.",
                    },
                },
                "required": ["term"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "send_email",
            "description": "Send the joke via email to the relevant team group once you are confident it's safe and correctly categorized.",
            "parameters": {
                "type": "object",
                "properties": {
                    "group_label": {
                        "type": "string",
                        "enum": ALLOWED_CATEGORIES,
                        "description": "Category/team to notify.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "The original joke.",
                    },
                    "explanation": {
                        "type": "string",
                        "description": "Reason the joke is relevant and safe.",
                    },
                },
                "required": ["group_label", "joke", "explanation"],
            },
        },
    },
]

How is this information communicated to the LLM? As described in part 2, the system prompt—containing the instruction heuristics—is included in the message sequence. Additionally, the JSON construct specifying the tools is attached to the API payload when making the API call.

agent.py: classify_and_act_on_joke
        try:
            data = chat_completion(messages, tools=FUNCTION_TOOLS)

agent.py: chat_completion
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"

As shown above, when the tools argument is provided to the chat_completion function (which applies here), the API payload includes a tools key containing the JSON definition of available tools.

In summary, tool information is communicated to the LLM through both the system prompt and the tools field in the API payload.

The agentic loop

Although we've made the tools available to the LLM, it can't directly execute them—these tools exist on our local system. To bridge this gap, we need an environment where the LLM's tool invocation requests are executed and the results are returned. This orchestration happens within what’s called the agentic loop.

The agentic loop operates as follows:

Make the initial LLM call, providing the problem statement and tool information.
Inspect the LLM’s response for tool calls. If present, execute the requested tool and append the result to the message history.
Call the LLM again with the updated messages and repeat step 2.
If no tool calls are detected, consider the task complete and exit the loop.

This loop allows the LLM to function autonomously, deciding which tools to use and when, without developer intervention. The main logic is implemented in the classify_and_act_on_joke function.

To prevent the LLM from entering an infinite loop, we set a maximum number of cycles—here, 10. If the LLM doesn’t finish within these iterations, the loop exits automatically.

agent.py: classify_and_act_on_joke
    max_cycles = 10
    email_sent_flag: bool = False
    last_email_attempt_reason: str = ""
    for cycle in range(max_cycles):
        try:
            data = chat_completion(messages, tools=FUNCTION_TOOLS)

As you see above, the first LLM call is made inside the for loop. Then we capture the response and check for tool calls.

agent.py: classify_and_act_on_joke
        msg = _assistant_message(data)
        tool_calls = msg.get("tool_calls") or []
        content = msg.get("content") or ""

        # ALWAYS append assistant message so tool_call references remain valid
        messages.append(
            {k: v for k, v in msg.items() if k in ("role", "content", "tool_calls")}
        )

        if tool_calls:

When the LLM responds, any tool calls are included in a separate tool_calls key in the structured output (for OpenAI models, the main response is under content, and tool invocations are under tool_calls). We check if tool_calls is present and not empty to determine if a tool needs to be executed.

At line 317, the LLM response is appended to the messages array. This step is essential because LLMs do not retain conversational context between calls. To maintain context, every message in the conversation—including the initial system_prompt, each user_prompt, and every llm_response—must be included in the messages list for each API call.

If tool calls are detected, we parse the tool call data to extract the function name and parameters, then invoke the appropriate tool with the parameters provided by the LLM.

agent.py: classify_and_act_on_joke
                fn = tc["function"]["name"]
                raw_args = tc["function"].get("arguments") or "{}"
                try:
                    args = (
                        json.loads(raw_args) if isinstance(raw_args, str) else raw_args
                    )
                except Exception:
                    args = {}
                if fn == "browse":
                    term = args.get("term", "")
                    logger.info(f" 🌐  Browsing for term: {term}")
                    tool_result = run_browse(term, joke)
                elif fn == "send_email":
                    group_label = args.get("group_label") or "Other"
                    explanation = args.get("explanation", "")
                    logger.info(f" ✉️  Sending email to group: {group_label}")
                    sent = send_email(group_label, joke, explanation)
                    tool_result = {
                        "sent": bool(sent),
                        "reason": "ok" if sent else "failed",
                    }
                    email_sent_flag = email_sent_flag or bool(tool_result.get("sent"))
                    last_email_attempt_reason = tool_result.get("reason", "")
                else:
                    tool_result = {"error": f"Unknown tool {fn}"}

The result of the tool execution is captured in the variable tool_result. Now, let's append the result in the message as a new user message and start back the loop.

agent.py: classify_and_act_on_joke
                messages.append(
                    {
                        "role": "tool",
                        "tool_call_id": tc.get("id"),
                        "name": fn,
                        "content": tool_result
                        if isinstance(tool_result, str)
                        else json.dumps(tool_result),
                    }
                )
            continue  # next cycle after tools

This loop will run until the LLM doesn't make any tool-call or it exhaust the maximum calls. You can find the full code at the bottom of the page.

The Agent Architecture

We now have a fully functional agent. Let’s break down the core components that make up this architecture:

Tool Implementations: These are standalone utilities that the LLM can invoke. Any command-line tool that a human could use can be exposed to the LLM, though in this example we focus on non-interactive tools. If you wish to support interactive tools (like vim), you’ll need to simulate user interaction within your execution environment, typically by leveraging LLM APIs to handle the input/output flow.
Tool Awareness: The LLM needs to know what tools are available. In our example, we provided this information through both prompt heuristics (in the system prompt) and a tool definition in JSON included as part of the API payload.
Execution Environment: This is where the LLM’s tool invocation requests are executed. In our case, we ran commands directly on the local system. However, for safety, production systems typically use a sandbox environment with only the necessary tools and data.
LLM Model: Here, we used GPT-5 from Azure OpenAI as the reasoning engine.
Agent Loop: This is the main interaction point between the LLM and the environment. The loop orchestrates the conversation, tool calls, and result handling. In fact, the agent loop itself can be considered the core of the agent, with the other components serving as supporting structures. As mentioned earlier, this loop can be implemented in under 100 lines of code.

Together, these components form what’s often called agent scaffolding. There’s no universal best approach—scaffolding should be tailored to the specific task for optimal results. Designing effective scaffolding is as much an art as it is engineering, and it’s a key skill for agentic developers.

Conclusion

Series

Thank you for joining me on this three-part journey into building agentic systems with LLMs. In the first post, we explored the foundational problem and learned how to integrate an LLM into a script to process and analyze data. The second part focused on guiding the LLM to produce structured outputs and demonstrated how to automate actions based on those outputs, laying the groundwork for more complex workflows. In this final installment, we delved into the agentic model, where the LLM is empowered to autonomously select and invoke tools, orchestrated through an agentic loop.

Throughout the series, we covered key concepts such as tool creation, prompt engineering, exposing tool definitions to the LLM, and managing the agentic loop for autonomous decision-making. By combining these elements, you can build flexible, powerful agents capable of handling a wide range of tasks with minimal intervention.

I hope this series have provided you with both the technical know-how and the inspiration to experiment with agentic architectures in your own projects. Thank you for reading, and best of luck on your agentic encounters—may your agents be resourceful, reliable, and always ready for the next challenge!

Code

agent.py

agent.py
import os
import sys
import json
import time
import logging
import subprocess
import datetime
import glob
import signal
import re
from pathlib import Path
from typing import Dict, Any, Optional

from dotenv import load_dotenv

load_dotenv()

import requests

from datetime import datetime, timezone

OUTPUT_DIR = Path("/tmp/agent-001/")
STATE_FILE = OUTPUT_DIR / "state.json"
DARK_FILE = OUTPUT_DIR / "dark.json"

# Azure OpenAI settings - must be provided as environment variables
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4.1")
API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")

# Groups mapping (labels expected from the model)
GROUPS = {
    "system": {
        "email": "system@example.com",
        "description": "OS and Platform developers, System administrators and DevOps team",
        "keywords": [
            "operating systems",
            "Linux",
            "Unix",
            "Windows",
            "macOS",
            "DevOps",
            "SysAdmin",
            "infrastructure",
            "cloud",
            "virtualization",
            "containers",
            "Kubernetes",
            "networking",
        ],
    },
    "oops": {
        "email": "oops@example.com",
        "description": "Application and services developers",
        "keywords": [
            "application",
            "services",
            "java",
            "python",
            "c#",
            "go",
            "ruby",
            "php",
            "node.js",
            "dotnet",
            "API",
            "microservices",
            "REST",
            "SOAP",
        ],
    },
    "web": {
        "email": "web-team@example.com",
        "description": "Web technology, front-end, back-end, react, angular, javascript, css developers",
        "keywords": [
            "Web technology",
            "front-end",
            "back-end",
            "react",
            "angular",
            "javascript",
            "css",
            "HTML",
            "web development",
            "UX",
            "UI",
            "web design",
            "web frameworks",
        ],
    },
    "Other": {
        "email": "all@example.com",
        "description": "Everything else, general audience",
        "keywords": [],
    },
}
ALLOWED_CATEGORIES = list(GROUPS.keys())

SYSTEM_PROMPT = f"""
    You are an helpful assistant that helps me to send a funny morning email to my colleagues.
    You will be provided with a programmer joke.
    Your task is to:
    (1) Decide the safe of the joke (safe: safe/dark/offensive).
    (2) Identify to which group the joke to be sent ({GROUPS.keys()}).
    (3) And briefly explain the joke in 1 paragraph.
    You have multiple steps to complete your task.
    IMPORTANT:
      - If there is ANY technical term you are not 100% certain about, FIRST call the `browse` tool before final JSON.
      - If safe == "safe" you MUST attempt the `send_email` tool once before giving the final JSON.
      - Final JSON ONLY after required tool usage (or explicit determination no browse needed AND email attempted when safe).
    Your final response must be a single JSON object with keys: safe (string), category (string), explanation (string) and is_email_sent (boolean).

    The category must be one of these values: system, oops, web, Other.

    Below you can find relevant keywords for each group to help you decide the correct category:
    {json.dumps({k: v["keywords"] for k, v in GROUPS.items()}, indent=4)}

    The safe value must be one of these values: safe, dark, offensive.
    The explanation must be a brief explanation of the joke.

    You have two tools in your toolbox:
    1) A `browse` tool to look up technical terms you don't understand in the context of the joke. You can use this tool to disambiguate the meaning of the joke before classifying it or deciding whether it is safe for work.
    2) An `send_email` tool to send the joke to the relevant team group once you are confident it's safe and correctly categorized.
    Use the `browse` tool first if you need to look up any terms.
    Only use the `send_email` tool once you are confident in your classification and explanation.

    If the Joke is classified as dark, store that in dark.json in the {OUTPUT_DIR} directory. This is for me to forward to my friends later in the day.
"""

# Define tool (function) schemas for GPT-4.1 function calling
FUNCTION_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "browse",
            "description": "Look up a technical term within the context of the joke to disambiguate meaning before classification.",
            "parameters": {
                "type": "object",
                "properties": {
                    "term": {
                        "type": "string",
                        "description": "The technical term or phrase to research.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "(Optional) The original joke for extra context.",
                    },
                },
                "required": ["term"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "send_email",
            "description": "Send the joke via email to the relevant team group once you are confident it's safe and correctly categorized.",
            "parameters": {
                "type": "object",
                "properties": {
                    "group_label": {
                        "type": "string",
                        "enum": ALLOWED_CATEGORIES,
                        "description": "Category/team to notify.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "The original joke.",
                    },
                    "explanation": {
                        "type": "string",
                        "description": "Reason the joke is relevant and safe.",
                    },
                },
                "required": ["group_label", "joke", "explanation"],
            },
        },
    },
]

# Ensure directories exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("agent")


def load_state() -> Dict[str, Any]:
    if STATE_FILE.exists():
        try:
            return json.loads(STATE_FILE.read_text(encoding="utf-8"))
        except Exception:
            logger.exception("Failed to load state file, starting fresh")
    # default state
    return {"processed": {}, "last_sent": {}}


def save_state(state: Dict[str, Any]) -> None:
    STATE_FILE.write_text(json.dumps(state, indent=2), encoding="utf-8")


def _extract_json(text: str) -> Optional[dict]:
    """Try to extract the first JSON object from a text blob."""
    try:
        return json.loads(text)
    except Exception:
        m = re.search(r"\{.*\}", text, re.S)
        if m:
            try:
                return json.loads(m.group(0))
            except Exception:
                return None
    return None


def chat_completion(
    messages, tools=None, temperature=0.0, max_tokens=800
) -> Dict[str, Any]:
    """Call Azure OpenAI chat completion returning the full JSON, supporting tool (function) calls."""
    time.sleep(3 + (2 * os.urandom(1)[0] / 255.0))  # jitter
    if not AZURE_ENDPOINT or not AZURE_KEY:
        raise RuntimeError(
            "Azure OpenAI credentials (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY) not set"
        )

    url = f"{AZURE_ENDPOINT}/openai/deployments/{AZURE_DEPLOYMENT}/chat/completions?api-version={API_VERSION}"
    headers = {"Content-Type": "application/json", "api-key": AZURE_KEY}
    payload: Dict[str, Any] = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
    }
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"
    resp = requests.post(url, headers=headers, json=payload, timeout=90)
    if resp.status_code >= 400:
        logging.error(
            "Azure OpenAI 4xx/5xx response %s: %s", resp.status_code, resp.text
        )
        resp.raise_for_status()
    return resp.json()


def _assistant_message(data: Dict[str, Any]) -> Dict[str, Any]:
    try:
        return data["choices"][0]["message"]
    except Exception:
        raise RuntimeError(f"Unexpected response format: {data}")


def _parse_final_json(content: str) -> Optional[Dict[str, Any]]:
    obj = _extract_json(content)
    if not obj:
        return None
    # Minimal validation (is_email_sent may be absent; we'll add later)
    required = {"safe", "category", "explanation"}
    if not required.issubset(obj.keys()):
        return None
    if obj.get("category") not in GROUPS.keys():
        return None
    if obj.get("safe") not in {"safe", "dark", "offensive"}:
        return None
    return obj


def _append_dark_joke(joke: str, parsed: Dict[str, Any]) -> None:
    """Persist dark jokes to DARK_FILE as an array of entries."""
    try:
        if DARK_FILE.exists():
            arr = json.loads(DARK_FILE.read_text(encoding="utf-8"))
            if not isinstance(arr, list):  # recover if corrupted
                arr = []
        else:
            arr = []
        arr.append(
            {
                "joke": joke,
                "ts": datetime.now(timezone.utc).isoformat(),
                "explanation": parsed.get("explanation", ""),
            }
        )
        DARK_FILE.write_text(json.dumps(arr, indent=2), encoding="utf-8")
    except Exception:
        logger.exception("Failed to append dark joke to %s", DARK_FILE)


def classify_and_act_on_joke(joke: str, state: Dict[str, Any]) -> Dict[str, Any]:
    """Tool (function) calling loop with GPT-4.1 until final JSON classification.

    Guarantees:
      * If classification is safe, an email attempt is performed (tool call or forced local send) before returning.
      * If classification is dark, joke is stored in dark.json.
      * Adds is_email_sent boolean to final JSON.
    """
    messages: list[dict[str, Any]] = [
        {"role": "system", "content": f"{SYSTEM_PROMPT}"},
        {"role": "user", "content": f"joke: {joke}"},
    ]

    max_cycles = 10
    email_sent_flag: bool = False
    last_email_attempt_reason: str = ""
    for cycle in range(max_cycles):
        try:
            data = chat_completion(messages, tools=FUNCTION_TOOLS)
        except Exception:
            logger.exception("chat_completion failed")
            time.sleep(5)
            continue
        msg = _assistant_message(data)
        tool_calls = msg.get("tool_calls") or []
        content = msg.get("content") or ""

        # ALWAYS append assistant message so tool_call references remain valid
        messages.append(
            {k: v for k, v in msg.items() if k in ("role", "content", "tool_calls")}
        )

        if tool_calls:
            for tc in tool_calls:
                if tc.get("type") != "function":
                    continue
                fn = tc["function"]["name"]
                raw_args = tc["function"].get("arguments") or "{}"
                try:
                    args = (
                        json.loads(raw_args) if isinstance(raw_args, str) else raw_args
                    )
                except Exception:
                    args = {}
                if fn == "browse":
                    term = args.get("term", "")
                    logger.info(f" 🌐  Browsing for term: {term}")
                    tool_result = run_browse(term, joke)
                elif fn == "send_email":
                    group_label = args.get("group_label") or "Other"
                    explanation = args.get("explanation", "")
                    logger.info(f" ✉️  Sending email to group: {group_label}")
                    sent = send_email(group_label, joke, explanation)
                    tool_result = {
                        "sent": bool(sent),
                        "reason": "ok" if sent else "failed",
                    }
                    email_sent_flag = email_sent_flag or bool(tool_result.get("sent"))
                    last_email_attempt_reason = tool_result.get("reason", "")
                else:
                    tool_result = {"error": f"Unknown tool {fn}"}
                messages.append(
                    {
                        "role": "tool",
                        "tool_call_id": tc.get("id"),
                        "name": fn,
                        "content": tool_result
                        if isinstance(tool_result, str)
                        else json.dumps(tool_result),
                    }
                )
            continue  # next cycle after tools

        if content:
            parsed = _parse_final_json(content)
            if parsed:
                # Enforce side-effects BEFORE returning.
                if parsed["safe"] == "safe" and not email_sent_flag:
                    # Model skipped tool call; perform mandatory send_email now.
                    group_label = parsed.get("category", "Other")
                    explanation = parsed.get("explanation", "")
                    sent = send_email(group_label, joke, explanation)

                if parsed["safe"] == "dark":
                    _append_dark_joke(joke, parsed)

                parsed["is_email_sent"] = bool(email_sent_flag)
                if email_sent_flag and not parsed["explanation"]:
                    parsed["explanation"] = parsed.get(
                        "explanation", "Sent without explanation provided"
                    )
                logging.info(" ✅  Task complete")
                logging.info(f"joke: {joke}")
                logging.info(f"safe: {parsed['safe']}")
                logging.info(f"category: {parsed['category']}")
                if parsed["safe"] == "safe":
                    logging.info(
                        "email_sent=%s reason=%s",
                        parsed["is_email_sent"],
                        last_email_attempt_reason,
                    )
                time.sleep(1)
                return parsed
            else:
                messages.append(
                    {
                        "role": "user",
                        "content": "Return only the final JSON object now.",
                    }
                )
                continue

    logging.warning(
        "Exceeded max tool cycles without valid final JSON; returning fallback"
    )
    return {
        "safe": "dark",
        "category": "Other",
        "explanation": "Model failed to return final JSON in time",
    }


def run_browse(term: str, joke: str) -> str:
    """Invoke the browse.py tool with the search term in the context of the joke and return its stdout."""
    browser_arg = f"Define the term '{term}' in the context of this joke: '{joke}'"
    cmd = ["python", "./browser.py", browser_arg]
    logger.info("Running browse tool for term: %s", term)
    try:
        out = subprocess.check_output(
            cmd, stderr=subprocess.STDOUT, text=True, timeout=600
        )
        logger.debug("browse output: %s", out)
        return out
    except subprocess.CalledProcessError as e:
        logger.error("browse.py failed: %s", e.output)
        return ""
    except Exception:
        logger.exception("Error running browse.py")
        return ""


def send_email(group_label: str, joke: str, explanation: str) -> bool:
    """Call send_email.py tool. group_label must be one of GROUPS keys."""
    group_email = GROUPS.get(group_label, GROUPS["Other"])["email"]
    # Use current interpreter for portability (virtualenv compatibility)
    cmd = [sys.executable, "send_email.py", group_email, joke, explanation]
    logger.info("Sending email to %s for group %s", group_email, group_label)
    try:
        subprocess.check_call(cmd)
        return True
    except subprocess.CalledProcessError:
        logger.exception("send_email.py returned non-zero")
        return False
    except Exception:
        logger.exception("Error running send_email.py")
        return False


def process_joke_file(path: Path, state: Dict[str, Any]) -> None:
    logger.info("\n\n*** ***")
    logger.info("Processing joke file: %s", path)
    joke = path.read_text(encoding="utf-8").strip()
    file_id = path.name

    if file_id in state.get("processed", {}):
        logger.info("Already processed %s, skipping", file_id)
        return

    try:
        result = classify_and_act_on_joke(joke, state)
    except Exception:
        logger.exception("LLM tool-driven processing failed for %s", file_id)
        sys.exit(1)
        # result = {"safe": False, "category": "Other", "explanation": "LLM error"}

    # Mark processed
    state.setdefault("processed", {})[file_id] = {
        "agent": "003",
        "joke": joke,
        "processed_at": datetime.now(timezone.utc).isoformat(),
        "result": result,
    }
    save_state(state)


def main_loop(poll_interval: int = 60):
    state = load_state()
    logger.info("Agent started, watching %s", OUTPUT_DIR)

    while True:
        txt_files = sorted(glob.glob(str(OUTPUT_DIR / "*.txt")))
        for f in txt_files:
            process_joke_file(Path(f), state)
            # return
        # Sleep and be responsive to shutdown
        for _ in range(int(poll_interval)):
            time.sleep(1)


if __name__ == "__main__":
    main_loop()

September 25, 2025
8 min read

Agent-001 Part-2

Series

In the first part of this series, we explored the programming jokes API and built a simple automation to extract the meaning of each joke. In this part, we'll automate the cultural-appropriateness check and email notifications using an LLM.

Developers prefer structured data because it's machine-readable and easy to automate. However, LLMs are primarily designed for conversational, natural language output. With the increasing use of LLMs in programming and automation, model providers have started prioritizing structured outputs for developers. For instance, starting with GPT-4, OpenAI has trained its models to follow user instructions more strictly.

For more details on how OpenAI improved programmer workflows in GPT-5, see my earlier blog: GPT-5 for Programmers.

We'll take advantage of this by instructing the LLM to respond in a structured JSON format. Since we're asking for the meaning of multiple jokes, it's best to separate the instructions for output structure from the actual jokes. The output instructions are generic, while the jokes vary each time. Mixing both in a single prompt would generate unique text combinations, reducing the effectiveness of the KV cache. Therefore, we'll place the output instructions in a special prompt know as system prompt and the jokes in the user prompt. Here's how we construct our system prompt,

automate_with_ai.py: SYSTEM_PROMPT
SYSTEM_PROMPT = (
    "You are an helpful assistant that explains a programmer joke and identify whether it is culturally appropriate to be shared in a professional office environment.\n"
    "Goals:\n"
    "(1) Decide whether the joke is funny or not (funny: true/false).\n"
    "(2) Categorize the joke into one of these categories: 'Safe for work', 'Offensive', 'Dark humor'.\n"
    "(3) And briefly explain the joke in 1 paragraph.\n"
    "Your response must be a single JSON object with keys: funny (bool), category (string), explanation (string).\n"
)

As shown above, we delegate the task of determining whether a joke is funny and appropriate for the workplace to the LLM itself. Crucially, we instruct the LLM to return its output strictly in JSON format.

Then in our process_joke_file we make two modifications.

Include the system prompt in the message
Parse the LLM output as a JSON

automate_with_ai.py: process_joke_file
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"joke: `{joke}`"},
        ]
        response = chat_completion(messages)["choices"][0]["message"]["content"]
        result = _parse_final_json(response)

We have also created an external script, send_email.py (full code available at the end of this post). This script takes two arguments—the joke and its explanation—and queues an email in the outbox. The send_email function in our code is responsible for invoking this script.

Since the LLM now returns structured JSON output, we can easily inspect its response and, based on its assessment, call the send_email function directly from our code.

automate_with_ai.py process_joke_file
        result = _parse_final_json(response)

        if result['funny'] and result['category'] == 'Safe for work':
            # Send email
            if send_email(joke, result['explanation']):
                logger.info("Email sent for joke %s", file_id)
            else:
                logger.error("Failed to send email for joke %s", file_id)

Conclusion

Series

In this post, we took a significant step forward by automating the evaluation of jokes for cultural appropriateness and streamlining the email sending process. By leveraging the LLM’s ability to return structured JSON, we eliminated the need for tedious manual checks and made it straightforward to plug the model’s output directly into our automation pipeline. This approach not only saves time but also reduces the risk of human error.

Yet, it’s important to recognize that what we’ve built so far is still traditional automation. The LLM serves as a smart evaluator, but all the decision-making logic and possible actions are hardcoded by us. The workflow is predictable and limited to the scenarios we’ve anticipated.

But what if the LLM could do more than just provide information? Imagine a system where the LLM can actively decide which actions to take, adapt to new situations, and orchestrate workflows on its own. This is the promise of agentic workflows—where the LLM becomes an autonomous agent, capable of selecting from a toolkit of actions and dynamically shaping the automation process.

In the next part of this series, we’ll dive into building such agentic systems. We’ll explore how to empower LLMs to not just inform, but to act—unlocking a new level of flexibility and intelligence in automation.

Complete code

automate_with_ai.py

automate_with_ai.py
import os
import sys
import json
import time
import logging
import datetime
import glob
import signal
import re
from pathlib import Path
from typing import Dict, Any, Optional

from dotenv import load_dotenv
load_dotenv()

import requests

OUTPUT_DIR = Path("/tmp/agent-001/")
STATE_FILE = OUTPUT_DIR / "state.json"

# Azure OpenAI settings - must be provided as environment variables
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4.1")
API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")

# Ensure directories exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("agent")

shutdown_requested = False


def _signal_handler(signum, frame):
    global shutdown_requested
    logger.info("Signal %s received, will shut down gracefully", signum)
    shutdown_requested = True


signal.signal(signal.SIGINT, _signal_handler)
signal.signal(signal.SIGTERM, _signal_handler)


def load_state() -> Dict[str, Any]:
    if STATE_FILE.exists():
        try:
            return json.loads(STATE_FILE.read_text(encoding="utf-8"))
        except Exception:
            logger.exception("Failed to load state file, starting fresh")
    # default state
    return {"processed": {}, "last_sent": {}}


def save_state(state: Dict[str, Any]) -> None:
    STATE_FILE.write_text(json.dumps(state, indent=2), encoding="utf-8")


SYSTEM_PROMPT = (
    "You are an helpful assistant that explains a programmer joke and identify whether it is culturally appropriate to be shared in a professional office environment.\n"
    "Goals:\n"
    "(1) Decide whether the joke is funny or not (funny: true/false).\n"
    "(2) Categorize the joke into one of these categories: 'Safe for work', 'Offensive', 'Dark humor'.\n"
    "(3) And briefly explain the joke in 1 paragraph.\n"
    "Your response must be a single JSON object with keys: funny (bool), category (string), explanation (string).\n"
)


def _extract_json(text: str) -> Optional[dict]:
    """Try to extract the first JSON object from a text blob."""
    try:
        return json.loads(text)
    except Exception:
        m = re.search(r"\{.*\}", text, re.S)
        if m:
            try:
                return json.loads(m.group(0))
            except Exception:
                return None
    return None


def chat_completion(messages, tools=None, temperature=0.0, max_tokens=800) -> Dict[str, Any]:
    """Call Azure OpenAI chat completion returning the full JSON, supporting tool (function) calls."""
    # Random jitter 3-5s to reduce rate spikes
    time.sleep(3 + (2 * os.urandom(1)[0] / 255.0))

    if not AZURE_ENDPOINT or not AZURE_KEY:
        raise RuntimeError("Azure OpenAI credentials (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY) not set")

    url = f"{AZURE_ENDPOINT}/openai/deployments/{AZURE_DEPLOYMENT}/chat/completions?api-version={API_VERSION}"
    headers = {
        "Content-Type": "application/json",
        "api-key": AZURE_KEY,
    }
    payload: Dict[str, Any] = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
    }
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"
    resp = requests.post(url, headers=headers, json=payload, timeout=90)
    resp.raise_for_status()
    return resp.json()


def _parse_final_json(content: str) -> Optional[Dict[str, Any]]:
    obj = _extract_json(content)
    if not obj:
        return None
    # Minimal validation
    if {"safe", "category", "explanation"}.issubset(obj.keys()):
        return obj
    return obj  # return anyway; caller can decide


def send_email(joke:str, explanation: str) -> bool:
    group_email = "all@example.com"
    cmd = [sys.executable, "send_email.py", group_email, joke, explanation]
    logger.info("Sending email to %s with joke", group_email)
    try:
        import subprocess
        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode != 0:
            logger.error("Failed to send email: %s", result.stderr)
            return False
        logger.info("Email sent successfully")
        return True
    except Exception as e:
        logger.exception("Exception while sending email: %s", e)
        return False


def process_joke_file(path: Path, state: Dict[str, Any]) -> None:
    logger.info("Processing joke file: %s", path)
    joke = path.read_text(encoding="utf-8").strip()
    file_id = path.name

    if file_id in state.get("processed", {}):
        logger.info("Already processed %s, skipping", file_id)
        return

    try:
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"joke: `{joke}`"},
        ]
        response = chat_completion(messages)["choices"][0]["message"]["content"]
        result = _parse_final_json(response)

        if result['funny'] and result['category'] == 'Safe for work':
            # Send email
            if send_email(joke, result['explanation']):
                logger.info("Email sent for joke %s", file_id)
            else:
                logger.error("Failed to send email for joke %s", file_id)

    except Exception as e:
        logger.exception("LLM tool-driven processing failed for %s\nException: %s", file_id, e)
        sys.exit(1)

    # Mark processed
    state.setdefault("processed", {})[file_id] = {"agent": "002", "joke": joke, "processed_at": datetime.datetime.utcnow().isoformat(), "funny": result["funny"], "explanation": result["explanation"], "category": result["category"]}
    save_state(state)


def main_loop(poll_interval: int = 60):
    state = load_state()
    logger.info("Agent started, watching %s", OUTPUT_DIR)

    while not shutdown_requested:
        txt_files = sorted(glob.glob(str(OUTPUT_DIR / "*.txt")))
        for f in txt_files:
            if shutdown_requested:
                break
            process_joke_file(Path(f), state)
        # Sleep and be responsive to shutdown
        for _ in range(int(poll_interval)):
            if shutdown_requested:
                break
            time.sleep(1)

    logger.info("Agent shutting down")


if __name__ == "__main__":
    main_loop()

send_email.py

send_email.py
#!/usr/bin/env python3
import sys
import json
import logging
from pathlib import Path
from datetime import datetime, timezone

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("send_email")

OUTBOX = Path("/tmp/agent-001/outbox.json")
OUTBOX.parent.mkdir(parents=True, exist_ok=True)


def main():
    if len(sys.argv) < 4:
        print("Usage: send_email.py <to_group> <quote> <explanation>")
        sys.exit(2)
    to_group = sys.argv[1]
    quote = sys.argv[2]
    explanation = sys.argv[3]

    # Append to outbox file as a record
    record = {"to": to_group, "quote": quote, "explanation": explanation, "ts": datetime.now(timezone.utc).isoformat()}
    if OUTBOX.exists():
        arr = json.loads(OUTBOX.read_text(encoding="utf-8"))
    else:
        arr = []
    arr.append(record)
    OUTBOX.write_text(json.dumps(arr, indent=2), encoding="utf-8")
    logger.info("Queued email to %s", to_group)


if __name__ == "__main__":
    main()