LlamaFirewall: Open-source framework to detect and mitigate AI centric security risks – Help Net Security

LlamaFirewall: Open-source framework to detect and mitigate AI centric security risks – Help Net Security

LlamaFirewall is a system-level security framework for LLM-powered applications, built with a modular design to support layered, adaptive defense. It is designed to mitigate a wide spectrum of AI agent security risks including jailbreaking and indirect prompt injection, goal hijacking, and insecure code outputs. Why Meta created LlamaFirewall LLMs are moving far beyond simple chatbot…

Read More
Google DeepMind Introduces MONA: A Novel Machine Learning Framework to Mitigate Multi-Step Reward Hacking in Reinforcement Learning

Google DeepMind Introduces MONA: A Novel Machine Learning Framework to Mitigate Multi-Step Reward Hacking in Reinforcement Learning

Reinforcement learning (RL) focuses on enabling agents to learn optimal behaviors through reward-based training mechanisms. These methods have empowered systems to tackle increasingly complex tasks, from mastering games to addressing real-world problems. However, as the complexity of these tasks increases, so does the potential for agents to exploit reward systems in unintended ways, creating new…

Read More