MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning
Warning: This work contains descriptions of moral scenarios which are controversial and offensive in nature.
Abstract
Large language models are increasingly influencing human moral decisions, yet current approaches focus primarily on evaluating rather than actively steering their moral decisions. We formulate this as an out-of-distribution moral alignment problem, where LLM agents must learn to apply consistent moral reasoning frameworks to scenarios beyond their training distribution. We introduce Moral-Reason-QA, a novel dataset extending 680 human-annotated, high-ambiguity moral scenarios with framework-specific reasoning traces across utilitarian, deontological, and virtue ethics, enabling systematic evaluation of moral generalization in realistic decision contexts. Our learning approach employs Group Relative Policy Optimization with composite rewards that simultaneously optimize decision alignment and framework-specific reasoning processes to facilitate learning of the underlying moral frameworks. Experimental results demonstrate successful generalization to unseen moral scenarios, with softmax-normalized alignment scores improving by +0.757 for utilitarian and +0.450 for deontological frameworks when tested on out-of-distribution evaluation sets. The experiments also reveal training challenges and promising directions that inform future research. These findings establish that LLM agents can be systematically trained to internalize and apply specific moral frameworks to novel situations, providing a critical foundation for AI safety as language models become more integrated into human decision-making processes.
Think: what would you do in these scenarios?
Explore moral scenarios in the Moral-Reason-QA dataset interactively with decisions and detailed reasoning traces across different ethical frameworks.
How Does Our Dataset Compare to Existing Moral Benchmarks?
Existing moral reasoning datasets primarily focus on evaluating LLM behavior using question–answer formats, usually without revealing the underlying reasoning process. They also tend to emphasize one specific ethical perspective, limiting their usefulness for studying broader alignment behaviors.
Moral-Reason-QA expands 680 high-ambiguity moral scenarios with complete reasoning traces across three distinct frameworks—utilitarian, deontological, and virtue ethics. This makes it the first dataset designed explicitly for reasoning-level reinforcement learning and cross-framework moral generalization in LLM agents.
Noticed some consensus between frameworks? We've measured that.
To quantify how often different moral frameworks agree or disagree on the same action, we compute pairwise correlations using the ϕ-coefficient over binary alignment labels. Positive ϕ values indicate that two frameworks tend to endorse the same actions, negative values reveal systematic disagreement, and values near zero suggest little association. This analysis shows that frameworks occasionally converge on the same choice, while still preserving meaningful divergence across Moral-Reason-QA.
From Moral Scenarios to Formal Alignment Objectives
We formalize moral alignment by (1) defining a set of moral decision frameworks and labeling which actions are compatible with each, and (2) introducing a softmax-normalized alignment score that measures how strongly an LLM’s behavior aligns with each framework across a dataset. This bridges informal moral intuitions and a precise optimization objective that can be used for training and evaluating LLM agents.
Rewarding Both Decisions and Moral Reasoning
Our GRPO training uses a composite reward with two components: an alignment reward that encourages selecting the action labeled as aligned with the target framework, and a keyword reward that softly rewards reasoning traces that explicitly invoke framework-specific concepts (e.g., utility, duty, virtue). The keyword reward is capped so it cannot compensate for choosing the wrong action, ensuring that the model must align both its choice and its reasoning style.
How Well Do Moral Frameworks Generalize?
On out-of-distribution evaluation scenarios, GRPO training sharply shifts the model’s moral preference: the utilitarian-aligned agent nearly saturates its utilitarian score, while the deontological-aligned agent substantially boosts deontological alignment and suppresses competing frameworks. These results show that reasoning-level reinforcement learning can reliably steer LLM agents toward specific moral frameworks, even on novel moral dilemmas.
Interestingly, the base model starts with a strong virtue-ethics bias, and our method reveals how different training runs selectively reorient this preference toward utilitarian or deontological behavior—highlighting both the flexibility and limits of moral fine-tuning in LLM agents.
BibTeX
@inproceedings{An2025MoralReason,
title={MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning},
author={An, Zhiyu and Du, Wan},
booktitle={AAAI 2026},
year={2025}
}