๐ How It Works
The model receives the 6 MFT foundations as context, then generates a 5-step moral diagnosis:
- Is the reply toxic/biased/harmful? 2. What are the linguistic cues? 3. Which moral foundations are violated?
- Moral judgment (agree/disagree) + rationale 5. Revised reply
For morally problematic replies it outputs: "Therefore, the <Revised Reply> is โฆ"