Security Stop Press : Jailbreak Bypasses Safety in Three Steps

Written by Pronetic

Pronetic is a leading provider of core IT support for ISO 27001, Cyber Essentials and Cyber Essentials Plus compliance.

October 30, 2024

Researchers have unveiled a new jailbreaking technique, ‘Deceptive Delight,’ which successfully manipulates AI models to produce unsafe responses in only three interactions.

Palo Alto Networks’ Unit 42 researchers developed the method by embedding restricted topics within benign prompts, effectively bypassing safety filters. By carefully layering requests, researchers managed to coerce AI models into generating unsafe outputs e.g., harmful instructions, such as guidance on creating dangerous items (e.g., Molotov cocktails).

Unit 42 reported that in tests across 8,000 scenarios and eight AI models, Deceptive Delight achieved a 65 per cent success rate in producing harmful content within three interactions, with some models reaching attack success rates (ASR) above 80 per cent. By contrast, sending unsafe prompts directly without jailbreaking yielded only a 5.8 per cent ASR.

This technique is part of a rising trend in AI manipulation. Previous methods include Brown University’s language translation bypass, which achieved nearly a 50 per cent success rate, and Microsoft’s ‘Skeleton Key’ technique, which prompts models to alter safe behaviour guidelines. Each approach reveals ways attackers exploit model vulnerabilities, underscoring AI’s ongoing security risks.

Businesses can mitigate these risks through updated model filtering tools, prompt analysis, and swift adoption of AI security patches. Enhanced oversight can prevent manipulation tactics like Deceptive Delight, reducing the chance of harmful content generation.

You May Also Like…

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Why Choose Pronetic

We Are ISO 27001 & Cyber Essentials Plus Certified

Be reassured that we have been externally audited. You can have complete peace of mind that the team managing your IT systems and safeguarding your data are independently vetted annually.

Seamless & Comprehensive IT Support

Our investment in people, tools and processes, continuously improved, ensures that we don’t just deliver exceptional I.T. support but include your compliance to Cyber Essentials or ISO 27001 “baked-in”. Yes, that means no more annual headaches and stress when your certification comes round.

Expert Support Money Back Guarantee

We're confident in the value we deliver. That's why we offer a 90-day, no-quibble money-back guarantee. If, for any reason, you're not completely satisfied with our IT support services, we'll provide a full refund and cancel your contract without any hassle.

Book Your Free IT Strategy Call Now!

Simply Fill In The Form Below To Receive Your Free IT Strategy Call:

By submitting this form, you consent to us using your personal information to contact you. For more information please see our privacy policy.