Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, et al., 2022arXiv preprintDOI: 10.48550/arXiv.2212.08073 - Discusses methods for aligning LLMs with human values and safety principles using AI feedback, which is a method for training LLMs to moderate their own behavior without extensive human intervention, relevant to the LLM-as-a-judge technique.
Google Cloud Perspective API Documentation, Google Cloud, 2024 (Google Cloud) - Official documentation for a widely used content moderation API that employs machine learning models to detect various categories of harmful or toxic content, serving as an example of model-based classifiers for output guardrails.