How to Deploy AI Agents for Safety

How to Deploy AI Agents for Safety

Written by

Juliet Shen, Head of Product, ROOST; Rashmi Raghunandan, Block; Hailey Elizabeth, Discord; Mingyi Zhao, Notion; Vinay Rao, CTO, ROOST

Published

It’s 2016. A bad actor sits at a computer and executes a script that deploys thousands of bots to a social media site promoting a website that aims to steal the end user’s credit card information. Every post is the same, but the bots are sharing it at a pace that is impossible for users to ignore. Elsewhere, a different bad actor leans back in their chair to think about their next response as they traverse a dating platform’s chat feature. They have been talking to their target for four months, earning trust and goodwill through fake photos and heartfelt stories about their family. It was time to close and squeeze them for tens of thousands of dollars. The safety teams at both platforms rolled up their sleeves to hand-write rules matching the bot posts, review user reports for romance scams in a review console, and deploy different machine learning classifiers trained respectively for precision and breadth.

Ten years later and on most platforms, neither the adversaries nor the content and actors look like that anymore. The surface area of online harm is expanding rapidly and many platforms are shifting from static, reviewable content to real-time interactions. Gaming platforms generate experiences on the fly: procedural worlds, user-created levels, dynamic in-game economies. AI chatbots carry on millions of simultaneous one-on-one conversations, each unique and each gone the moment the session ends. Patterns and signals change in near-real time as agents adapt their behavior to evade detection. The challenge of ephemeral content and context is no longer limited to online spaces designed for impermanence.

The abuse and fraud that does persist is getting harder to detect. The wide availability of AI models means that adversaries can perform varied content generation at scale with the press of a key. Every piece of AI generated spam, misinformation, and harmful imagery is unique. Safety teams are still wrestling with basic hash-matching and duplicate-detection, while the rapid growth of novel content simultaneously evades these tools. AI agents can create accounts from real-browsers on common machine configurations with appropriately aged cookies and machine fingerprints. Those accounts can then be used by malicious actors to carry out things like romance and sextortion attacks across multiple platforms.

Meanwhile, the defenders face structural constraints that haven’t changed. The infrastructure is largely bespoke and expensive to build. Platforms are closed silos with limited ability to share threat intel or cross-reference attacks that span multiple services. The teams are constrained by false-positive fears limiting their ability to plan for operational headcount, automate, or to use AI for speed and scale.

This compounding asymmetry needs to be addressed, and quickly. With the rapid improvements and growth of AI models, evolving agent-to-agent protocols, and breakthroughs in their tool-use and reasoning, the window to act is now. This is an everyday conversation within the ROOST community, and to help accelerate collaborative efforts across the field, we are sharing a template outline of how safety teams can use AI to defend their platforms. As always, this work will grow, evolve, and learn from the community wisdom and other experiments underway.

From promise to production

Defenders can take advantage of open infrastructure and shared playbooks leading to compounding improvements. AI can and must be a multiplier for safety teams of all sizes (even if it’s a part-time single person operation), but only if set up correctly. With the right architecture and constraints, AI can operate within the precision and latency parameters of the organization deploying it to go toe-to-toe with adversaries. The question is what successful agent deployments have in common. Without that understanding, “deploy an agent” is a wish, not a strategy.

To act with agency and not make cascading errors, agents need directions and constraints. Directions give them the data model, entity hierarchy, and harm taxonomy, and point them to canonical data sources and querying techniques. Without them, the agent guesses at what exists and improvises how to find it. Constraints handle the harder problem, uncertainty, and the failure modes that come with it. There are two kinds that need to be treated separately: grounded evidence is about how complete and consistent the evidence is, and LLM uncertainty is about the reasoning and interpretation. An agent that conflates the two will be confidently wrong in ways that are hard to catch. This design keeps them distinct and leans conservative when either one is low.

With directions and constraints, the agent's behavior is bounded, its decisions are reproducible, and a human reviewer can always reconstruct exactly what evidence was used and why.

Three ingredients

Working with safety teams who are deploying agents in anti-abuse workflows, we’ve found that effective agents require three things to be in place: entity understanding, safety taxonomy, and playbooks. Entity understanding and safety taxonomy are directions to the agents and playbooks are constraints.

1. Entity understanding

An agent is only as smart as its map of the terrain and only as useful as the data they can access and reason over. A raw event that says “user X posted content Y in channel Z” is far less actionable than one enriched at ingestion that says “user X (account age: 2 days, reputation: low, failed verification) posted content Y (similarity to bad cluster: 0.94) in channel Z (flags: 12)”. Agents also need to understand the relationship between data models and entities. Every organization has a different data model: accounts, posts, threads, channels, transactions, sessions etc with varying interrelationships between the entities. Often the data is separated into different tools at different places in the safety infrastructure. If an AI agent cannot traverse the relationship between a flagged comment, the commenters account, the channel it was posted in, the cluster of similar accounts or similar comments, then it will always be playing whack-a-mole.

This is why platforms need to describe its entity relationships and signals in a language the agents can navigate and verify. We propose an entity schema or an entity graph that captures for example, an account that has many posts, a post belongs to a thread, a thread belongs to a channel. With that hierarchy, the agent can structure its investigations rather than prompt ad-hoc.

2. Safety taxonomies

Humans and agents need to speak the same language about harm. For content, behaviors, accounts and clusters, the taxonomy defines what content, events, aggregations are violative, needs review, or needs classification. An agent will use the taxonomy and decide the content is harmful and auto-action. An investigation agent might decide that a cluster of comments needs review and route to a queue of experts who need to review the cluster. In another case, an investigation agent might find a suspicious network of accounts that have various harm signals but nothing maps one-to-one in the harm taxonomy. This is routed to a needs classification queue to understand the new abuse pattern and potentially update the taxonomy.

3. Agent personas with playbooks

A common failure mode in deploying AI agents for safety work is giving them too much latitude. An agent with broad data access, open-ended decision authority, and unrestricted tools will hallucinate, make mistakes and produce results that no one can justify. OWASP’s Agent Security Guidelines calls this out as a violation of least privilege and limited tool access. NIST’s AI Agent Standards Initiative is still in deliberation but they are converging on similar framing.

One safety-forward approach is to construct each agent from the three things in lockstep: the job it does, the tools it gets, and the playbook that tells it how to do it. “Persona” is a shorthand for these three pieces. There are three personas that cover most safety work:

  • An “investigation agent” has read-access to the entity graph, no enforcement tools, and a playbook that tells it the investigation paths, signal extraction methods, and harm thresholds. Its output is a case and not a decision.

  • An “assistant agent”’s role starts when a reviewer opens a case and ends when the verdict is rendered. The agent read-access to the case in front of a reviewer and a playbook that turns evidence into reviewer-ready summaries suggesting rules, flagging policy gaps, or saving cognitive load by evaluating large documents. It never enforces.

  • A “production agent” has narrow write-access to a single decision type, acts only within high-confidence thresholds defined in its playbook and escalates anything ambiguous. Its actions are sampled and monitored continuously.

Together these personas can form a cascade by scaling a trust and safety team’s reach without expanding the blast radius of any mistakes. Agents operating against declared entity schemas, shared harm taxonomy, and explicit playbooks to gather signals, handle uncertainty and produce accurate, auditable, and consistent results.

Here is a starting-point template for setting up entity schemas, safety taxonomy, and playbooks, acknowledging that many platforms will seek to build their own agentic infrastructure.

What this looks like in practice

Here’s what these three personas look like when teams put them to work. First, let’s look at an example based on the template recommended in this blog of an agent investigating a suspicious comment case from report to recommendation. Then we’ll survey how Block and Notion are using safety agents in their workflows today.

Example walk for an investigation agent

One way to see what "agent-navigable data" means in practice is to watch an investigation agent work a suspicious commenting case. A reviewer flags a promotional comment on a popular post. The agent's job is to decide whether it is a lone spammer or a coordinated ring and to recommend, not enforce. The following example is based on example schemas in the ROOST model community GitHub repo.

The agent has access to a set of files. An entity schema tells it which entities exist: actors, content, spaces, clusters, cases and their inter-relationships. A signals file tells it which enrichments attach to each entity like reputation, device fingerprints, classifier scores, cluster memberships. A harm taxonomy that tells it what is harmful. And a playbook tells it how to conduct the review.

The walk then looks like this:

[Content Report Generated]
  |
  ├──(author)──► [Actor]
  |      ├ reputation
  |      ├ trust_tier
  |      ├ device_fingerprints
  |      └ cluster_memberships ──► [Cluster]
  |                               ├ coordination_score
  |                               └ confidence
  |
  └──(space)──► [Space]
      └──(child_content)──► [sibling Content]
              ├ similarity_hash
              └ classifier_scores

  Result:

  [Case]  evidence_refs, related_clusters,
          decisions, assigned_reviewer

The agent fans out to sibling comments in the same space, compares similarity hashes, collects the distinct set of authors, fetches their reputations and device fingerprints, and if the evidence supports it, materializes a cluster entity tying them together. It opens a case, points its evidence at the comments and the accounts, and routes it to the review queue. At no point does the agent guess. The hierarchy tells it where to go and the signals tell it what to bring back. Everything it does is a step a human can audit.

The same walk applies to platform-specific forks of this schema. The ROOST repo includes example forks for YouTube and Discord based on public APIs that keep the abstract schema vocabulary, specialize enum values for each platform’s API, and drop fields the platform does not expose. These are references designed as starting points.

Example: Block

Block has already put this pattern into production, and the results help make the case for bounded, auditable agents. A practical version of this pattern comes from their work, where the team explored how agentic systems can support label generation at scale without directly taking enforcement actions. In this approach, the agent operates over a bounded evidence set assembled through existing internal tooling and produces structured summaries for a reviewer. The result was a materially faster path to reviewer-ready summaries, improved consistency in how labels were applied, and clearer visibility into the evidence behind a recommendation.

This is intentionally narrower than a fully autonomous production agent. The agent does not make policy decisions, does not expand its own access surface, and does not replace analyst judgment. Instead, it accelerates one of the most operationally expensive parts of the feedback loop: turning fragmented signals and case context into high-quality labels and reviewer-ready summaries. For readers interested in the broader direction of this work, we discussed related ideas in a recent Coop working group meeting.

These battle-tested patterns are an opportunity to standardize structures to make agentic safety work more auditable and interoperable.

Example: Notion

Notion’s team, on their end, has deployed all three agent personas in production across their safety workflows, notably building on their internal use of Coop. Over the past few months, these agents, built by their Trust teams on top of Notion’s Custom Agent, have worked alongside human team members across several parts of the Trust and Safety workflow:

  • Investigation agents automatically triage alerts and scan logs to detect suspicious activity. Work that previously required an hour of human investigation can now be completed in a few minutes by piecing together evidence from multiple sources.

  • Assistant agents handle operational follow-up and coordination work, such as tracking abuse cases, ensuring issues are followed up on, and identifying new product features that may have safety implications. This reduces manual overhead and gives human reviewers more time for higher-level analysis and strategy.

  • Production agents help respond to spam and phishing attacks to protect the Notion community. They operate in narrowly scoped, high-confidence situations, with limited write access.

Open Infrastructure

ROOST’s tools (notably Osprey, real-time rules-engine for events, and Coop for case-management and review) are designed as the substrate on which agentic safety workflows can be built responsibly. Organizations that integrate with Coop and Osprey will have widely varying entity hierarchies, data storage systems, and harm taxonomies. The roadmap update we recently opened for community review introduces two cross-cutting infrastructure pieces that make this possible.

  • A data abstraction layer gives agents a declarative understanding of an organization’s entity model (ie. accounts, posts, threads, channels, transactions) and the relationships between them.

  • The safety decision taxonomy provides the shared vocabulary between humans and agents. It defines generic harm categories, severity levels, and corresponding types of actions so that an agent’s output can be audited by a human that understands and corrects its decisions.

These abstractions have been made concrete in this blog with entity schema and harm taxonomy template. The playbooks define the precise safety agent personas that teams using Osprey and Coop will deploy. An “investigation agent” writes a detection rule in SML for Osprey where it can be tested, validated, and version controlled. When an “assistant agent” helps a reviewer with a large channel with many videos, it does so within the context of the case in the review console. When a “production agent” makes a decision it is recorded in Coop where it can be audited by a human expert and in case of a disagreement trigger a recalibration.

Built this way, agentic infrastructure can be developed in the open once and reused across organizations with each improvement compounding for everyone.

Let’s build together

ROOST is building infrastructure as a foundation, along with multiple safety teams contributing different playbooks to accelerate innovation throughout the field. A playbook for detecting inauthentic behavior on a social platform is different from one investigating card-testing on a payment platform, which is different from one that focuses on grooming patterns in gaming environments. We are starting a community effort to develop and publish playbook templates. These templates can be adopted, forked, and improved upon by any organization and will serve as reference designs for teams deploying agents in production.

  • If you’re an operations practitioner, we want to work with you to document patterns based on your experience. This does not need to be sensitive or proprietary information; the goal is to capture the structure of how good investigations actually work.

  • If you’re a developer, we want contributions to our infrastructure and want to hear your opinions about what the right abstractions are.

The path forward for trust and safety teams is to encode knowledge gained by one team and make it available to everyone. AI agents with well-defined roles, problem specific playbooks, and governed by clear configurations will help the safety teams scale and amplify their impact.