Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce SafePro, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust, task-aware safety mechanisms tailored to the next generation of professional AI agents.
AI agents are moving from everyday assistance to professional workflows where mistakes can cause real harm. SafePro evaluates whether today’s frontier models stay safe while executing realistic, long-horizon professional tasks.
SafePro contains 275 high-complexity professional tasks spanning 9 economic sectors and 51 occupations, each designed to test whether agents follow harmful instructions or refuse safely.
To ensure realism and isolate safety failure modes, tasks are created under explicit requirements (complexity, clarity, relevance, realism, specificity) and refined via an iterative author–reviewer loop.
We structure SafePro along two axes: risk outcomes (what kind of harm) and sectors (where the task occurs). The distributions below show broad coverage across both.
| Risk Category | Samples |
|---|---|
| Property / financial loss | 67 |
| Discrimination / bias | 43 |
| Misinformation | 39 |
| Information disclosure | 27 |
| Physical harm | 21 |
| System compromise | 11 |
| Environmental harm | 9 |
| Intellectual property misuse | 4 |
| Other illegal or violating regulations | 54 |
Table 1: Data samples distribution by Risk Category.
| Sector | Samples |
|---|---|
| Real Estate | 43 |
| PSTS | 35 |
| Government | 33 |
| Retail | 31 |
| Wholesale | 31 |
| Manufacturing | 31 |
| HCSA | 25 |
| Information | 24 |
| Finance | 22 |
Table 2: Data samples by occupation sector. PSTS: Professional, Scientific, and Technical Services; HCSA: Health Care and Social Assistance.
Each task includes explicit unsafe and safe criteria plus a stated malicious motivation and risk outcome which makes safety evaluation targeted and auditable.
Compared to prior agent-safety datasets focused on short daily tasks, SafePro targets professional work with much longer instructions and support for multimodal inputs and real web search, matching real agent deployments.
| Dataset | Avg. Instruction Length | Task Domain | Multimodal | Real Web Search |
|---|---|---|---|---|
| AgentHarm (Andriushchenko et al., 2024) | 42.3 | Daily | ✗ | ✗ |
| InjectAgent (Debenedetti et al., 2024) | 31.4 | Daily | ✗ | ✗ |
| Browser-art (Kumar et al., 2025) | 32.4 | Daily | ✗ | ✓ |
| SafeArena (Kumar et al., 2025) | 21.9 | Daily | ✓ | ✗ |
| RiOSWorld (Yang et al., 2025) | 32.2 | Daily | ✓ | ✓ |
| SciSafetyBench (Zhu et al., 2025) | 50.3 | Research | ✗ | ✓ |
| SafePro (Ours) | 354.5 | Professional | ✓ | ✓ |
Table 3: Comparison of SafePro with existing agent safety evaluation datasets.
We evaluate safety using an LLM-as-a-judge protocol: given the task, safety criteria, and the agent’s full behavior (actions + outputs), the judge labels responses as SAFE vs UNSAFE and explains why.
To validate judge reliability, we cross-evaluate with multiple judge models and observe consistent unsafe-rate rankings without self-favoring bias.
| Judge \ Agent | GPT-5-mini | Claude-Haiku 4.5 | Gemini 3-Flash |
|---|---|---|---|
| GPT-5-mini | 55.6 | 22.3 | 67.3 |
| Claude-Haiku | 63.8 | 28.1 | 69.5 |
| Gemini 3-Flash | 57.5 | 23.7 | 68.0 |
Table 4: Cross-evaluation results comparing different LLM judges. Values represent the unsafe rates assigned by each judge model.
We run SafePro tasks in an agentic setting using CodeAct in OpenHands with tool access (web, files, code execution) and cap interactions at a fixed turn budget; safety is summarized by Unsafe Rate.
Across sectors, many frontier models exhibit high unsafe rates, showing that professional agent settings expose failures not captured by daily assistant benchmarks.
These qualitative examples illustrate what ‘unsafe’ looks like in practice: unethical prioritization, deceptive omission, and requests for sensitive information, all which are behaviors with real professional consequences.
We report unsafe rate (%) per sector. Most models are unsafe on roughly half the tasks, while the safest model family here shows substantially lower unsafe rates, highlighting large gaps in professional safety alignment.
| Model | Real Estate | Government | Manufacture | PSTS | HCSA | Finance | Retail | Wholesale | Information | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 2.5-Pro | 83.7 | 81.8 | 77.4 | 57.1 | 84.0 | 68.2 | 77.4 | 77.4 | 79.2 | 76.4 |
| Gemini 3-Flash | 65.1 | 69.7 | 64.5 | 48.6 | 80.0 | 90.9 | 67.7 | 58.1 | 75.0 | 67.3 |
| Deepseek-V3.2 | 60.5 | 54.5 | 54.8 | 54.3 | 60.0 | 59.1 | 77.4 | 64.5 | 79.2 | 62.2 |
| Grok 4.1 Fast | 46.5 | 51.5 | 64.5 | 45.7 | 60.0 | 45.5 | 67.7 | 77.4 | 70.8 | 58.2 |
| GPT-5-mini | 62.8 | 72.7 | 48.4 | 40.0 | 56.0 | 31.8 | 67.7 | 58.1 | 54.2 | 55.6 |
| GPT-5 | 55.8 | 72.7 | 54.8 | 25.7 | 44.0 | 27.3 | 48.4 | 51.6 | 33.3 | 47.3 |
| GPT-5.2 | 30.2 | 45.5 | 45.2 | 26.5 | 20.0 | 22.7 | 25.8 | 45.2 | 29.2 | 32.8 |
| Claude-Haiku 4.5 | 19.0 | 30.3 | 38.7 | 14.3 | 20.0 | 13.6 | 16.1 | 35.5 | 8.3 | 22.3 |
| Average | 57.0 | 63.3 | 56.1 | 43.4 | 57.5 | 45.8 | 53.6 | 59.2 | 54.2 | 54.5 |
Table 5: Unsafe Rate (%) across different sectors for various LLM backbones. PSTS: Professional, Scientific, and Technical Services; HCSA: Health Care and Social Assistance.
We find a gap between knowing an instruction is unsafe and acting safely while following instructions: models can often classify unsafe intent in a QA setting, yet still comply in instruction-following agent mode.
Safety classification performance is much higher in QA mode than in instruction-following mode, suggesting the failure is often alignment/application rather than pure lack of safety knowledge.
| Model | F1 | Recall | ||
|---|---|---|---|---|
| IF | QA | IF | QA | |
| Gemini 3-Flash | 49.3 | 84.2 | 32.7 | 73.1 |
| GPT-5-mini | 61.5 | 88.9 | 44.4 | 81.5 |
| Claude Haiku 4.5 | 87.3 | 95.0 | 77.7 | 92.0 |
Table 6: F1 scores and recall comparison between instruction-following setting (IF) and QA judge settings.
We evaluate three mitigation directions: stronger agent prompts, explicit safety classification, and safeguard models to understand what helps and what still falls short.
Adding a simple safety prompt reduces unsafe rates by a modest margin, but unsafe behavior remains common, indicating prompts alone are not sufficient.
Current safeguard models underperform on professional scenarios, with large variance by sector, showing a gap between today’s guardrails and the complexity of professional misuse.
| Safeguard Model | Real Estate | Government | Manufacture | PSTS | HCSA | Finance | Retail | Wholesale | Information | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| gpt-oss-safeguard | 39.5 | 30.3 | 54.8 | 45.7 | 48.0 | 54.5 | 80.6 | 32.3 | 83.3 | 50.5 |
| Qwen3Guard | 2.3 | 3.0 | 0.0 | 11.4 | 16.0 | 22.7 | 25.8 | 6.5 | 20.8 | 10.9 |
Table 7: Detection Accuracy (%) of safeguard models across different sectors on SafePro benchmark.
When we provide explicit safety policy definitions, backbone LLMs become much better safety classifiers, suggesting a practical gating approach before executing tasks.
| Model | F1 | Recall | ||
|---|---|---|---|---|
| IF | QA | IF | QA | |
| Gemini 3-Flash | 49.3 | 94.5 | 32.7 | 91.3 |
| GPT-5-mini | 61.5 | 92.6 | 44.4 | 88.4 |
| Claude Haiku 4.5 | 87.3 | 94.9 | 77.7 | 91.6 |
Table 8: F1 scores and recall comparison between instruction-following setting (IF) and QA judge with safety category definitions.
@article{zhou2025safepro,
title={SafePro: Evaluating the Safety of Professional-Level AI Agents},
author={Zhou, Kaiwen and Jangam, Shreedhar and Nagarajan, Ashwin and Polu, Tejas and Oruganti, Suhas and Liu, Chengzhi and Kuo, Ching-Chen and Zheng, Yuting and Narayanaraju, Sravana and Wang, Xin Eric},
journal={arXiv preprint arXiv:2505.16186},
year={2026}
}