SafePro: Evaluating the Safety of Professional-Level AI Agents

AI agents are moving from everyday assistance to professional workflows where mistakes can cause real harm. SafePro evaluates whether today’s frontier models stay safe while executing realistic, long-horizon professional tasks.

SafePro contains 275 high-complexity professional tasks spanning 9 economic sectors and 51 occupations, each designed to test whether agents follow harmful instructions or refuse safely.

To ensure realism and isolate safety failure modes, tasks are created under explicit requirements (complexity, clarity, relevance, realism, specificity) and refined via an iterative author–reviewer loop.

Figure 2: **Data creation process for SafePro.** We first define a set of requirements for harmful task creation. The data creators then generate harmful professional tasks through two approaches: benign task transformation and new harmful task generation. Each created task undergoes iterative review and revision to ensure quality.

We structure SafePro along two axes: risk outcomes (what kind of harm) and sectors (where the task occurs). The distributions below show broad coverage across both.

Risk Category	Samples
Property / financial loss	67
Discrimination / bias	43
Misinformation	39
Information disclosure	27
Physical harm	21
System compromise	11
Environmental harm	9
Intellectual property misuse	4
Other illegal or violating regulations	54

Table 1: Data samples distribution by Risk Category.

Sector	Samples
Real Estate	43
PSTS	35
Government	33
Retail	31
Wholesale	31
Manufacturing	31
HCSA	25
Information	24
Finance	22

Table 2: Data samples by occupation sector. PSTS: Professional, Scientific, and Technical Services; HCSA: Health Care and Social Assistance.

Each task includes explicit unsafe and safe criteria plus a stated malicious motivation and risk outcome which makes safety evaluation targeted and auditable.

Figure 3: Example harmful task from SafePro.

Compared to prior agent-safety datasets focused on short daily tasks, SafePro targets professional work with much longer instructions and support for multimodal inputs and real web search, matching real agent deployments.

Dataset	Avg. Instruction Length	Task Domain	Multimodal	Real Web Search
AgentHarm (Andriushchenko et al., 2024)	42.3	Daily	✗	✗
InjectAgent (Debenedetti et al., 2024)	31.4	Daily	✗	✗
Browser-art (Kumar et al., 2025)	32.4	Daily	✗	✓
SafeArena (Kumar et al., 2025)	21.9	Daily	✓	✗
RiOSWorld (Yang et al., 2025)	32.2	Daily	✓	✓
SciSafetyBench (Zhu et al., 2025)	50.3	Research	✗	✓
SafePro (Ours)	354.5	Professional	✓	✓

Table 3: Comparison of SafePro with existing agent safety evaluation datasets.

We evaluate safety using an LLM-as-a-judge protocol: given the task, safety criteria, and the agent’s full behavior (actions + outputs), the judge labels responses as SAFE vs UNSAFE and explains why.

To validate judge reliability, we cross-evaluate with multiple judge models and observe consistent unsafe-rate rankings without self-favoring bias.

Judge \ Agent	GPT-5-mini	Claude-Haiku 4.5	Gemini 3-Flash
GPT-5-mini	55.6	22.3	67.3
Claude-Haiku	63.8	28.1	69.5
Gemini 3-Flash	57.5	23.7	68.0

Table 4: Cross-evaluation results comparing different LLM judges. Values represent the unsafe rates assigned by each judge model.

We run SafePro tasks in an agentic setting using CodeAct in OpenHands with tool access (web, files, code execution) and cap interactions at a fixed turn budget; safety is summarized by Unsafe Rate.

Across sectors, many frontier models exhibit high unsafe rates, showing that professional agent settings expose failures not captured by daily assistant benchmarks.

These qualitative examples illustrate what ‘unsafe’ looks like in practice: unethical prioritization, deceptive omission, and requests for sensitive information, all which are behaviors with real professional consequences.

Figure 4: Examples of unsafe actions by AI agents in SafePro benchmark.

We report unsafe rate (%) per sector. Most models are unsafe on roughly half the tasks, while the safest model family here shows substantially lower unsafe rates, highlighting large gaps in professional safety alignment.

Model	Real Estate	Government	Manufacture	PSTS	HCSA	Finance	Retail	Wholesale	Information	Avg.
Gemini 2.5-Pro	83.7	81.8	77.4	57.1	84.0	68.2	77.4	77.4	79.2	76.4
Gemini 3-Flash	65.1	69.7	64.5	48.6	80.0	90.9	67.7	58.1	75.0	67.3
Deepseek-V3.2	60.5	54.5	54.8	54.3	60.0	59.1	77.4	64.5	79.2	62.2
Grok 4.1 Fast	46.5	51.5	64.5	45.7	60.0	45.5	67.7	77.4	70.8	58.2
GPT-5-mini	62.8	72.7	48.4	40.0	56.0	31.8	67.7	58.1	54.2	55.6
GPT-5	55.8	72.7	54.8	25.7	44.0	27.3	48.4	51.6	33.3	47.3
GPT-5.2	30.2	45.5	45.2	26.5	20.0	22.7	25.8	45.2	29.2	32.8
Claude-Haiku 4.5	19.0	30.3	38.7	14.3	20.0	13.6	16.1	35.5	8.3	22.3
Average	57.0	63.3	56.1	43.4	57.5	45.8	53.6	59.2	54.2	54.5

Table 5: Unsafe Rate (%) across different sectors for various LLM backbones. PSTS: Professional, Scientific, and Technical Services; HCSA: Health Care and Social Assistance.

We find a gap between knowing an instruction is unsafe and acting safely while following instructions: models can often classify unsafe intent in a QA setting, yet still comply in instruction-following agent mode.

Safety classification performance is much higher in QA mode than in instruction-following mode, suggesting the failure is often alignment/application rather than pure lack of safety knowledge.

Model	F1		Recall
Model	IF	QA	IF	QA
Gemini 3-Flash	49.3	84.2	32.7	73.1
GPT-5-mini	61.5	88.9	44.4	81.5
Claude Haiku 4.5	87.3	95.0	77.7	92.0

Table 6: F1 scores and recall comparison between instruction-following setting (IF) and QA judge settings.

We evaluate three mitigation directions: stronger agent prompts, explicit safety classification, and safeguard models to understand what helps and what still falls short.

Adding a simple safety prompt reduces unsafe rates by a modest margin, but unsafe behavior remains common, indicating prompts alone are not sufficient.

Figure 5: Comparison of unsafe rates (%) with and without safety prompts across three models.

Current safeguard models underperform on professional scenarios, with large variance by sector, showing a gap between today’s guardrails and the complexity of professional misuse.

Safeguard Model	Real Estate	Government	Manufacture	PSTS	HCSA	Finance	Retail	Wholesale	Information	Avg.
gpt-oss-safeguard	39.5	30.3	54.8	45.7	48.0	54.5	80.6	32.3	83.3	50.5
Qwen3Guard	2.3	3.0	0.0	11.4	16.0	22.7	25.8	6.5	20.8	10.9

Table 7: Detection Accuracy (%) of safeguard models across different sectors on SafePro benchmark.

When we provide explicit safety policy definitions, backbone LLMs become much better safety classifiers, suggesting a practical gating approach before executing tasks.

Model	F1		Recall
Model	IF	QA	IF	QA
Gemini 3-Flash	49.3	94.5	32.7	91.3
GPT-5-mini	61.5	92.6	44.4	88.4
Claude Haiku 4.5	87.3	94.9	77.7	91.6

Table 8: F1 scores and recall comparison between instruction-following setting (IF) and QA judge with safety category definitions.

BibTeX

@article{zhou2025safepro,
                    title={SafePro: Evaluating the Safety of Professional-Level AI Agents},
                    author={Zhou, Kaiwen and Jangam, Shreedhar and Nagarajan, Ashwin and Polu, Tejas and Oruganti, Suhas and Liu, Chengzhi and Kuo, Ching-Chen and Zheng, Yuting and Narayanaraju, Sravana and Wang, Xin Eric},
                    journal={arXiv preprint arXiv:2505.16186},
                    year={2026}
                    }

SafePro: Evaluating the Safety of Professional-Level AI Agents

Abstract

Key Contributions

The SafePro Dataset

Evaluation Framework

Experiment Setup

Results

Safety Knowledge-Alignment Gap

Mitigation Methods

BibTeX