Scientists developed an AI monitoring agent to detect and stop harmful outputs

The monitoring system is designed to detect and thwart both prompt injection attacks and edge-case threats.

Join us on social networks

A team of researchers from artificial intelligence (AI) firm AutoGPT, Northeastern University, and Microsoft Research have developed a tool that monitors large language models (LLMs) for potentially harmful outputs and prevents them from executing.

The agent is described in a preprint research paper titled “Testing Language Model Agents Safely in the Wild.” According to the research, the agent is flexible enough to monitor existing LLMs and can stop harmful outputs such as code attacks before they happen.

Per the research:

“Agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans.”

The team writes that existing tools for monitoring LLM outputs for harmful interactions seemingly work well in laboratory settings but when applied to testing models already in production on the open internet, they “often fall short of capturing the dynamic intricacies of the real world.”

This, ostensibly, is because of the existence of edge cases. Despite the best efforts of the most talented computer scientists, the idea that researchers can imagine every possible harm vector before it happens is largely considered an impossibility in the field of AI.

Even when the humans interacting with AI have the best intentions, unexpected harm can arise from seemingly innocuous prompts.

An illustration of the monitor in action. On the left, a workflow ending in a high safety rating. On the right, a workflow ending in a low safety rating. Source: *Naihin, et., al. 2023*

To train the monitoring agent, the researchers built a dataset of nearly 2,000 safe human/AI interactions across 29 different tasks ranging from simple text-retrieval tasks and coding corrections all the way to developing entire webpages from scratch.

They also created a competing testing dataset filled with manually-created adversarial outputs including dozens of which were intentionally designed to be unsafe.

The datasets were then used to train an agent on OpenAI’s GPT 3.5 turbo, a state-of-the-art system, capable of distinguishing between innocuous and potentially harmful outputs with an accuracy factor of nearly 90%.

This article first appeared at Cointelegraph.com News

Scientists developed an AI monitoring agent to detect and stop harmful outputs

Join us on social networks

What do you think?

Written by Outside Source

Crypto should be about freeing people, not esoteric tech — Vitalik Buterin

LTC under $90: Buying opportunity or warning sign?

Honest Pi Coin price prediction 2030: Is $1000 possible?

OpenAI denies involvement in Robinhood’s tokenized equity launch on Arbitrum

Belgium’s KBC Bank Set to Unleash Bitcoin and Ether Trading for Retail – Will Rivals Follow?

SEC Chair calls tokenization an ‘innovation’ in sign of regulatory shift

LTC under $90: Buying opportunity or warning sign?

Crypto should be about freeing people, not esoteric tech — Vitalik Buterin

Honest Pi Coin price prediction 2030: Is $1000 possible?

OpenAI denies involvement in Robinhood’s tokenized equity launch on Arbitrum

Belgium’s KBC Bank Set to Unleash Bitcoin and Ether Trading for Retail – Will Rivals Follow?

SEC Chair calls tokenization an ‘innovation’ in sign of regulatory shift

Hong Kong Stablecoin Regulations to Take Effect August 1 – Here’s What Changes

Hong Kong’s new stablecoin regulations will come into effect in August 2025

With the Genius Act vote nearing final passage — who wins, and who loses?

Animoca Brands hopes to snag stablecoin issuer license through joint venture with Standard Chartered and Telecom

Hong Kong to start issuing stablecoin licenses, with Ant Group and JD.com already in line

Uber in ‘study phase’ of using stablecoins to lower costs, says CEO

ARK Invest discloses 80 bps fee in spot Bitcoin ETF filing

Bitcoin Will Reach $150,000 By Mid-2025, Predicts Bernstein

LTC under $90: Buying opportunity or warning sign?

Crypto should be about freeing people, not esoteric tech — Vitalik Buterin

Honest Pi Coin price prediction 2030: Is $1000 possible?

OpenAI denies involvement in Robinhood’s tokenized equity launch on Arbitrum

Join us on social networks

What do you think?

Ad Blocker Detected!

Log In

With social network:

Or with username:

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections