Long story short: I’ve stumbled upon a post about jailbreaking prompts, referencing the article https://arxiv.org/pdf/2308.03825. The paper addresses the misuse of LLMs through jailbreak prompts designed to bypass safeguards and elicit harmful content. The authors conducted a measurement study on 6,387 prompts collected from four platforms over six months (2022-2023), with various sources from Reddit, Discord, websites, etc. They used NLP and graph-based community detection to analyze these prompts and their logic. The obvious trend is that jailbreak prompts are increasingly shifting from public to private platforms, creating new challenges for detection (Spoiler: some models are good). With a question set (46,800 samples across 13 forbidden scenarios), the authors assessed the harm caused by jailbreak prompts. They found that current LLMs and safeguards are insufficient in defending against these prompts, with some achieving a 99% attack success rate.
Key Findings
• Characteristics and Evolution: Jailbreak prompts often use more instructions, toxic language, and various attack strategies like prompt injection and privilege escalation. They have evolved to be more stealthy and effective.
• Effectiveness: Jailbreak prompts can achieve high attack success rates across different LLMs, including ChatGPT (GPT-3.5), GPT-4, ChatGLM, Dolly, and Vicuna. Some prompts have persisted online for over 100 days.
• Vulnerable Scenarios: Political lobbying, pornography, and legal opinions are the most vulnerable scenarios to jailbreak prompts.
• Community Analysis: The study identified eight major jailbreak communities, each employing different attack strategies. Discord has become a significant platform for sharing these prompts.
• Safeguards: Existing safeguards, such as the OpenAI moderation endpoint, OpenChatKit moderation model, and NeMo-Guardrails, show limited effectiveness in mitigating jailbreak prompts.
I joined several Discord servers and subreddits to test a few prompts. Although most LLMs I tested could detect the jailbreak prompts and respond with ‘I’m just a language model,’ there were some clear failures.
• Gemini Advanced: It initially generated an adverse response but quickly changed it to the standard one. It generates responses using learned patterns that include various types of information, with the initial response based on this knowledge. However, to ensure the response aligns with safety and ethical guidelines, a filtering system evaluates the generated text for potentially harmful or biased content. This evaluation happens after the initial response is produced. Potentially, this can cause some breakthroughs.
• ChatGPT-3.5: Total fail. It consumes every jailbreaking prompt and returns results without any censorship. So, for those who use GPT-3.5 to save money, take into consideration that this model is literally unprotected and can be potentially harmful.
To address these issues, there are several solutions that, while not perfect, are relatively effective. They all have weak points, but eventually, they will evolve similarly to the models and vulnerabilities. This seems like another kind of race, as we have seen throughout history.
1. Enhanced Input Filtering
• Machine Learning Models: Incorporate machine learning models that continuously learn from new data to improve filtering accuracy.
• Example: OpenAI’s moderation tools use machine learning to adapt to new types of harmful content.
2. Dynamic Contextual Analysis
• Transformer Models: Utilize transformer-based models like GPT-4 that excel in understanding and maintaining context over long conversations.
• Example: GPT-4 can handle complex contextual queries more effectively than previous models.
3. Real-Time Monitoring and Interception
• Edge Computing: Deploy edge computing solutions to reduce latency and improve scalability.
• Example: Using edge AI solutions to process data closer to the source, reducing the time taken for real-time monitoring.
4. User Behavior Analysis
• Federated Learning: Use federated learning to analyze user behavior while preserving privacy.
• Example: Google’s use of federated learning in Gboard to improve models without compromising user privacy.
5. Context-Aware Embedding Models
• Multimodal Models: Incorporate multimodal models that can understand and process information from multiple sources (text, images, etc.) to improve context awareness.
• Example: OpenAI’s CLIP model that combines text and image understanding.
6. Redundant AI Systems
• Distributed Systems: Use AI systems to ensure redundancy and fault tolerance.
• Example: Google’s Borg system for managing large-scale distributed computing resources.
7. Diversified Feedback Mechanisms
• Crowdsourced Feedback: Incorporate crowdsourced feedback from a diverse user base to continuously refine and improve the model.
• Example: Amazon’s Mechanical Turk for gathering diverse feedback on AI performance.
8. Continuous Audits and Rolling Updates
• Automated Testing: Implement automated testing frameworks to ensure updates do not introduce new vulnerabilities.
• Example: Continuous integration/continuous deployment (CI/CD) pipelines, including automated AI model testing.
In addition to LLMs’ built-in safety mechanisms, there exist external safeguards, including the OpenAI moderation endpoint (the best in overall estimation), OpenChatKit moderation model, and NeMo-Guardrails.
There should be some conclusion, but I believe it is better to say, ‘To be continued…’. AI is a tool and can be used for various purposes. There will definitely be countries that provide AI havens, similar to tax havens. There will always be hackers of different levels, from teenagers wanting to adopt the romantic allure of dangerous hackers to criminal organizations trying to perform malicious actions. We just need to stay aware and apply the general security approach to any software—avoid unsecured resources, use the most advanced models, and invest in security. But proceed at your own discretion.