Beyond the Chat Box: The Protect Agent — moderation that doesn't make you choose between safety and speed

Share
Beyond the Chat Box: The Protect Agent — moderation that doesn't make you choose between safety and speed
ReactLive - The Protect Agent

Post five of a series on what live audience engagement actually looks like when there's an AI team in the room.


There's a question every host running an internal event has had to answer at least once.

Should we turn on anonymous Q&A?

The case for is obvious. Anonymous Q&A is where the real questions live. The ones people are too junior, too new, or too cautious to put their name to. The question about pay equity. The question about whether the layoffs are over. The question about why the strategy changed. These are the questions that, if answered well in front of the whole company, build trust — and if dodged or never asked, corrode it.

The case against is also obvious. Anonymous Q&A is where the abuse lives. The personal attack on a manager. The slur. The forty-character rant about the CEO. The submission that's just "you're all terrible." The feed that turns into something the comms team has to apologise for the next morning.

For a decade, the standard answer to this question has been: pre-moderate everything. Every submission goes through a human moderator before it appears in the feed. Safety problem solved. But you've also killed the event. The questions that were timely two minutes ago appear five minutes later, when the speaker has moved on. The audience watches an empty Q&A panel for the first ten minutes while a moderator chews through the backlog. The "live" event isn't live anymore.

The other answer was: post-moderate. Submissions go straight to the feed; moderators remove abuse after the fact. That keeps the event live, but the slur sits on the wall until someone catches it, and screenshots travel faster than moderation does.

Hosts have been stuck choosing between safety and speed for as long as anonymous Q&A has existed.

The Protect Agent is built to dissolve that choice.

The third option

Real-time content classification, running on every submission, before it hits the feed.

The Protect Agent reads each incoming submission against the event's policies and produces a classification: clean, suspicious, or actively abusive. Clean submissions go through. Actively abusive ones get auto-blocked. Suspicious ones get queued for human review.

The host doesn't choose between pre-moderation and post-moderation anymore. They get a hybrid that's faster than pre-moderation and safer than post-moderation, because the bulk of the work — the obvious cases — is handled at machine speed, and the human moderator only sees the edge cases that actually need judgment.

That's the architectural shift. Now the details.

What the Protect Agent does

The Protect Agent has four core skills:

Content classification. Spam, abuse, off-topic. Each submission gets evaluated in milliseconds against the policies the event is running. Spam is the easiest call — repetitive submissions, obvious bot patterns, link spam. Abuse is harder: the agent has to distinguish between a legitimately harsh question ("why did you take the bonus pool to zero?") and an actual personal attack ("[CEO name] is a [slur]"). Off-topic is judged against the event's agenda and uploaded context — a question about the holiday party at a quarterly all-hands isn't abuse, but it doesn't belong in this room.

Risk detection. Beyond abuse, there are submissions that are problematic for other reasons. Submissions that would leak material non-public information at a regulated event. Submissions that name specific employees in ways that need HR involvement. Submissions that look like they're from outside the intended audience. The agent flags these for the moderator without making the call to publish.

Policy enforcement. Every event has its own rules. Some events allow rough language; others don't. Some events welcome questions about competitors; some don't. Some events are explicitly about politics; most aren't. The Protect Agent applies the policies the event is configured with — defined in SOUL.md and the event setup — rather than applying a one-size-fits-all moderation policy.

Real-time filtering. Speed is the whole point. A moderation system that catches everything but adds five seconds of latency between submission and feed is functionally useless for a live event — the audience experience breaks. The Protect Agent runs synchronously with the submission flow, sub-second, so clean questions appear in the feed effectively instantly.

The output is a feed where the audience sees the legitimate questions, the moderator sees only the calls that actually need a human, and the abuse never sees daylight at all.

The dial, applied to moderation

In the Off, Suggest, Assist, Auto post we walked through how the four strength levels work in general. They land especially clearly for moderation, because the cost-benefit of each level is obvious.

Off. No AI filtering. The moderator sees every submission and decides what publishes. Useful when the audience is small and trusted, the moderator wants full control, or the event is so high-stakes that nothing goes live without human review.

Suggest. The agent flags risky and spam content. The moderator decides what to do. The moderator's queue has the agent's confidence and reasoning attached to each flag, so they can make calls quickly. This is the right setting for a host's first event — they get to see what the agent would have caught without committing to its judgment yet.

Assist. Auto-blocks obvious spam and abuse. Flags edge cases for review. This is the default for most events once a host has run the agent in Suggest mode and confirmed it's calibrated correctly. It removes 80–90% of moderation work without putting any judgment calls on autopilot.

Auto. Fully moderates. Blocks, filters, and prioritises automatically. The agent handles the live moderation completely. The moderator can override anything at any time, but doesn't have to be in the queue to keep the event running cleanly. This is the setting for events where the host has run this configuration enough times to trust the agent's calibration.

The crucial product decision: Auto on Protect doesn't mean blocking aggressively. It means acting autonomously within the rules you set. A conservative policy on Auto is still a conservative policy — the agent just enforces it without waiting for permission. Hosts who want a low intervention rate can have it, on Auto, by configuring the policy to be lenient. The dial controls the autonomy, not the aggression.

The unlock: anonymous Q&A becomes default

Here's what changes when moderation runs at machine speed against well-defined policies.

Anonymous Q&A — the feature that was previously a calculated risk for high-stakes internal events — becomes a default. The reason it was a risk was that the moderation cost scaled with the size of the audience: 4,000 employees on an all-hands meant 4,000 potential abuse vectors, and the moderator couldn't keep up. The Protect Agent breaks that scaling. Whether the audience is forty people or forty thousand, the cost of moderation per submission is roughly constant.

This matters more than it sounds. Anonymous Q&A is the single feature that most distinguishes a town hall that builds trust from one that performs trust. The questions people will sign their name to are different from the questions they'll ask anonymously, and the difference is exactly where the most important conversations live. When anonymous Q&A is too risky to enable, the most important conversations don't happen.

The Protect Agent is the reason a host can run a 4,000-person all-hands with anonymous Q&A on, the morning after a hard quarter, and not spend the call refreshing the moderation queue with their stomach in a knot.

What the Protect Agent does not do

This is worth being explicit about, because moderation is one of those areas where over-reach is a real failure mode.

It doesn't auto-block low-confidence cases, even on Auto. The product rule is hard: only auto-block high-confidence abuse. Anything ambiguous — a sharp question that might be abusive depending on context, a submission that uses harsh language about a topic the event is explicitly there to discuss, a question that's borderline off-topic — gets surfaced for human review even at the highest autonomy level. False positives in moderation are corrosive; the agent is tuned to escalate rather than over-block.

It doesn't moderate ideas. A submission that disagrees forcefully with the speaker, challenges the strategy, or expresses dissatisfaction with the company isn't abuse — it's the audience doing exactly what audience engagement is for. The Protect Agent's job is to filter behaviour, not opinion. The most pointed question in the feed should make it through.

It doesn't replace human judgment on hard calls. When the agent isn't sure, the moderator gets the call. Always. The agent's confidence is shown to the moderator, but the agent never claims certainty it doesn't have, and never pretends a borderline call is clean.

It doesn't carry policies between events. Every event has its own SOUL.md and its own configured policies. The Protect Agent's calibration on a board meeting isn't the same as its calibration on a customer AMA — different events, different rules, different rooms.

Why this is the agent that earns trust first

Protect is, in our experience, the first agent most hosts move from Suggest to Assist, and from Assist to Auto.

The reason is that the cost-benefit is the easiest to verify. A host can review the agent's flagging decisions after one event and immediately tell whether it's calibrated correctly. Did it catch the bad stuff? Did it pass through anything it shouldn't have? Did it block anything it shouldn't have? Three questions, clear answers. No deep judgment about tone, voice, or framing required — just safety calls that are easy to evaluate in hindsight.

That fast feedback loop is why Protect is the agent that builds trust in the rest of the system. Once a host has watched the Protect Agent run cleanly across two or three events, they're far more willing to extend trust to Engage and Answer — the agents whose work is harder to evaluate in retrospect.

The Protect Agent isn't the most magical agent in the system. It's the agent that makes the magical ones credible.


Next up: The Engage Agent — nobody should have to read 400 questions live. How clustering, priority ranking, and topic-shift detection give your moderator their attention back.

Join the waitlist to get early access. Three free events. Locked-in pricing.