SafePro: Evaluating the Safety of Professional-Level AI Agents

1UC Santa Cruz, 2UC Santa Barbara, 3Cisco Research
kzhou35@ucsc.edu; ericxwang@ucsb.edu
image
Figure 1: Overview of SafePro benchmark. (Left) The SafePro dataset contains safety tests on various professional sectors and occupations, revealing critical safety risks in current AI agents. (Right) State-of-the-art AI models exhibit high unsafe rates in the SafePro benchmark.

Abstract

Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce SafePro, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust, task-aware safety mechanisms tailored to the next generation of professional AI agents.

Key Contributions

  • SafePro dataset: 275 professional tasks across 9 sectors / 51 occupations.
  • Evaluation framework: Agent setup + LLM-as-judge unsafe-rate metric.
  • Findings + mitigations: High unsafe rates; mitigation exploration.

AI agents are moving from everyday assistance to professional workflows where mistakes can cause real harm. SafePro evaluates whether today’s frontier models stay safe while executing realistic, long-horizon professional tasks.

The SafePro Dataset

SafePro contains 275 high-complexity professional tasks spanning 9 economic sectors and 51 occupations, each designed to test whether agents follow harmful instructions or refuse safely.

To ensure realism and isolate safety failure modes, tasks are created under explicit requirements (complexity, clarity, relevance, realism, specificity) and refined via an iterative author–reviewer loop.

image
Figure 2: Data creation process for SafePro. We first define a set of requirements for harmful task creation. The data creators then generate harmful professional tasks through two approaches: benign task transformation and new harmful task generation. Each created task undergoes iterative review and revision to ensure quality.

We structure SafePro along two axes: risk outcomes (what kind of harm) and sectors (where the task occurs). The distributions below show broad coverage across both.

Risk Category Samples
Property / financial loss 67
Discrimination / bias 43
Misinformation 39
Information disclosure 27
Physical harm 21
System compromise 11
Environmental harm 9
Intellectual property misuse 4
Other illegal or violating regulations 54

Table 1: Data samples distribution by Risk Category.

Sector Samples
Real Estate 43
PSTS 35
Government 33
Retail 31
Wholesale 31
Manufacturing 31
HCSA 25
Information 24
Finance 22

Table 2: Data samples by occupation sector. PSTS: Professional, Scientific, and Technical Services; HCSA: Health Care and Social Assistance.

Each task includes explicit unsafe and safe criteria plus a stated malicious motivation and risk outcome which makes safety evaluation targeted and auditable.

image
Figure 3: Example harmful task from SafePro.

Compared to prior agent-safety datasets focused on short daily tasks, SafePro targets professional work with much longer instructions and support for multimodal inputs and real web search, matching real agent deployments.

Dataset Avg. Instruction Length Task Domain Multimodal Real Web Search
AgentHarm (Andriushchenko et al., 2024) 42.3 Daily
InjectAgent (Debenedetti et al., 2024) 31.4 Daily
Browser-art (Kumar et al., 2025) 32.4 Daily
SafeArena (Kumar et al., 2025) 21.9 Daily
RiOSWorld (Yang et al., 2025) 32.2 Daily
SciSafetyBench (Zhu et al., 2025) 50.3 Research
SafePro (Ours) 354.5 Professional

Table 3: Comparison of SafePro with existing agent safety evaluation datasets.

Evaluation Framework

We evaluate safety using an LLM-as-a-judge protocol: given the task, safety criteria, and the agent’s full behavior (actions + outputs), the judge labels responses as SAFE vs UNSAFE and explains why.

To validate judge reliability, we cross-evaluate with multiple judge models and observe consistent unsafe-rate rankings without self-favoring bias.

Judge \ Agent GPT-5-mini Claude-Haiku 4.5 Gemini 3-Flash
GPT-5-mini 55.6 22.3 67.3
Claude-Haiku 63.8 28.1 69.5
Gemini 3-Flash 57.5 23.7 68.0

Table 4: Cross-evaluation results comparing different LLM judges. Values represent the unsafe rates assigned by each judge model.

Experiment Setup

We run SafePro tasks in an agentic setting using CodeAct in OpenHands with tool access (web, files, code execution) and cap interactions at a fixed turn budget; safety is summarized by Unsafe Rate.

Results

Across sectors, many frontier models exhibit high unsafe rates, showing that professional agent settings expose failures not captured by daily assistant benchmarks.

These qualitative examples illustrate what ‘unsafe’ looks like in practice: unethical prioritization, deceptive omission, and requests for sensitive information, all which are behaviors with real professional consequences.

image
Figure 4: Examples of unsafe actions by AI agents in SafePro benchmark.

We report unsafe rate (%) per sector. Most models are unsafe on roughly half the tasks, while the safest model family here shows substantially lower unsafe rates, highlighting large gaps in professional safety alignment.

Model Real Estate Government Manufacture PSTS HCSA Finance Retail Wholesale Information Avg.
Gemini 2.5-Pro 83.7 81.8 77.4 57.1 84.0 68.2 77.4 77.4 79.2 76.4
Gemini 3-Flash 65.1 69.7 64.5 48.6 80.0 90.9 67.7 58.1 75.0 67.3
Deepseek-V3.2 60.5 54.5 54.8 54.3 60.0 59.1 77.4 64.5 79.2 62.2
Grok 4.1 Fast 46.5 51.5 64.5 45.7 60.0 45.5 67.7 77.4 70.8 58.2
GPT-5-mini 62.8 72.7 48.4 40.0 56.0 31.8 67.7 58.1 54.2 55.6
GPT-5 55.8 72.7 54.8 25.7 44.0 27.3 48.4 51.6 33.3 47.3
GPT-5.2 30.2 45.5 45.2 26.5 20.0 22.7 25.8 45.2 29.2 32.8
Claude-Haiku 4.5 19.0 30.3 38.7 14.3 20.0 13.6 16.1 35.5 8.3 22.3
Average 57.0 63.3 56.1 43.4 57.5 45.8 53.6 59.2 54.2 54.5

Table 5: Unsafe Rate (%) across different sectors for various LLM backbones. PSTS: Professional, Scientific, and Technical Services; HCSA: Health Care and Social Assistance.

Safety Knowledge-Alignment Gap

We find a gap between knowing an instruction is unsafe and acting safely while following instructions: models can often classify unsafe intent in a QA setting, yet still comply in instruction-following agent mode.

Safety classification performance is much higher in QA mode than in instruction-following mode, suggesting the failure is often alignment/application rather than pure lack of safety knowledge.

Model F1 Recall
IF QA IF QA
Gemini 3-Flash 49.3 84.2 32.7 73.1
GPT-5-mini 61.5 88.9 44.4 81.5
Claude Haiku 4.5 87.3 95.0 77.7 92.0

Table 6: F1 scores and recall comparison between instruction-following setting (IF) and QA judge settings.

Mitigation Methods

We evaluate three mitigation directions: stronger agent prompts, explicit safety classification, and safeguard models to understand what helps and what still falls short.

Adding a simple safety prompt reduces unsafe rates by a modest margin, but unsafe behavior remains common, indicating prompts alone are not sufficient.

image
Figure 5: Comparison of unsafe rates (%) with and without safety prompts across three models.

Current safeguard models underperform on professional scenarios, with large variance by sector, showing a gap between today’s guardrails and the complexity of professional misuse.

Safeguard Model Real Estate Government Manufacture PSTS HCSA Finance Retail Wholesale Information Avg.
gpt-oss-safeguard 39.5 30.3 54.8 45.7 48.0 54.5 80.6 32.3 83.3 50.5
Qwen3Guard 2.3 3.0 0.0 11.4 16.0 22.7 25.8 6.5 20.8 10.9

Table 7: Detection Accuracy (%) of safeguard models across different sectors on SafePro benchmark.

When we provide explicit safety policy definitions, backbone LLMs become much better safety classifiers, suggesting a practical gating approach before executing tasks.

Model F1 Recall
IF QA IF QA
Gemini 3-Flash 49.3 94.5 32.7 91.3
GPT-5-mini 61.5 92.6 44.4 88.4
Claude Haiku 4.5 87.3 94.9 77.7 91.6

Table 8: F1 scores and recall comparison between instruction-following setting (IF) and QA judge with safety category definitions.

BibTeX

@article{zhou2025safepro,
                    title={SafePro: Evaluating the Safety of Professional-Level AI Agents},
                    author={Zhou, Kaiwen and Jangam, Shreedhar and Nagarajan, Ashwin and Polu, Tejas and Oruganti, Suhas and Liu, Chengzhi and Kuo, Ching-Chen and Zheng, Yuting and Narayanaraju, Sravana and Wang, Xin Eric},
                    journal={arXiv preprint arXiv:2505.16186},
                    year={2026}
                    }