Building a Hybrid Rule-Based and Machine Learning Framework to Detect and Defend Against Jailbreak Prompts in LLM Systems

Building a Hybrid Rule-Based and Machine Learning Framework to Detect and Defend Against Jailbreak Prompts in LLM Systems

In this tutorial, we introduce a Jailbreak Defense that we built step-by-step to detect and safely handle policy-evasion prompts. We generate realistic attack and benign examples, craft rule-based signals, and combine those with TF-IDF features into a compact, interpretable classifier so we can catch evasive prompts without blocking legitimate requests. We demonstrate evaluation metrics, explain…

Read More
LlamaFirewall: Open-source framework to detect and mitigate AI centric security risks – Help Net Security

LlamaFirewall: Open-source framework to detect and mitigate AI centric security risks – Help Net Security

LlamaFirewall is a system-level security framework for LLM-powered applications, built with a modular design to support layered, adaptive defense. It is designed to mitigate a wide spectrum of AI agent security risks including jailbreaking and indirect prompt injection, goal hijacking, and insecure code outputs. Why Meta created LlamaFirewall LLMs are moving far beyond simple chatbot…

Read More