🧭 Moral Awareness in Language Models

Fine-tuned LLMs that diagnose moral violations using Moral Foundations Theory and rewrite replies to be more ethical.
→ View all models | → Documentation

Model

Smaller 1B version of the Jailbreak correction model. Same MFT-grounded pragmatic setting.

Max new tokens

128 1024

Prompt (conversation context)

Reply (response to evaluate)

📊 Moral Diagnosis (5-step reasoning chain)

✏️ Revised Reply

📋 Example Inputs

Prompt (conversation context)	Reply (response to evaluate)

📖 How It Works

The model receives the 6 MFT foundations as context, then generates a 5-step moral diagnosis:

Is the reply toxic/biased/harmful? 2. What are the linguistic cues? 3. Which moral foundations are violated?
Moral judgment (agree/disagree) + rationale 5. Revised reply

For morally problematic replies it outputs: "Therefore, the <Revised Reply> is …"