๐Ÿงญ Moral Awareness in Language Models

Fine-tuned LLMs that diagnose moral violations using Moral Foundations Theory and rewrite replies to be more ethical.
โ†’ View all models | โ†’ Documentation

Model

Smaller 1B version of the Jailbreak correction model. Same MFT-grounded pragmatic setting.

128 1024
๐Ÿ“‹ Example Inputs
Prompt (conversation context) Reply (response to evaluate)

๐Ÿ“– How It Works

The model receives the 6 MFT foundations as context, then generates a 5-step moral diagnosis:

  1. Is the reply toxic/biased/harmful? 2. What are the linguistic cues? 3. Which moral foundations are violated?
  2. Moral judgment (agree/disagree) + rationale 5. Revised reply

For morally problematic replies it outputs: "Therefore, the <Revised Reply> is โ€ฆ"