This is a quick “what to expect” video introducing me and what I plan to teach throughout 2026. I’ll be releasing one video a week for all 52 weeks of the year.
Schedule:
Capstone challenges will be revealed at the end of each Arc!
Arc 1: Static Encoding & Obfuscation (Weeks 1-8)
Transforming harmful content into alternative machine-readable representations to bypass surface-level token-pattern filters.
Week 1 - Base64 Encoding Bypass
Exploiting separate decoding pathways to bypass safety filters that fail to inspect content once it is converted from base64.
Week 2 - Hex Encoding Attack
Using hexadecimal strings to bypass keyword filters by exploiting the difference between hex tokenization and plaintext interpretation.
Week 3 - ROT13 Rotation Cipher
Applying simple character substitution to break pattern-matching while relying on the model’s ability to “read” ciphers from training data.
Week 4 - Leetspeak & Character Substitution
Replacing letters with visually similar numbers or symbols to evade substring detection while maintaining semantic meaning.
Week 5 - Homoglyph Unicode Attacks
Using look-alike Unicode characters (confusables) to create inputs that appear normal to humans but tokenize uniquely for the model.
Week 6 - CamelCase Transformation
Removing standard word boundaries and using capitalization to force the model to infer intent without triggering space-delimited filters.
Week 7 - Attacks Using Custom Encryptions
Combining multiple encoding patterns into functional, bespoke transformation layers to create unique bypass signatures.
Week 8 - Multimodal Encoding
Leveraging the gap between text safety filters and the separate decoders used for audio, video, or image-based inputs.
Arc 2: Direct Prompt Injection (Weeks 9-16)
Overriding the model’s system-level instructions by inserting malicious commands directly into the user-input context.
Week 9 - Direct Prompt Injection Fundamentals
Exploiting the model’s tendency to prioritize recent user instructions over the static constraints of the original system prompt.
Week 10 - Context Window Flooding
Burying initial system instructions under a high volume of text to exploit recency bias and attention mechanism limitations.
Week 11 - System Prompt Extraction via Injection
Using masqueraded debugging or logging commands to trick the model into revealing its internal “hidden” instructions.
Week 12 - Role Impersonation via Injection
Overriding safety safeguards by forcing the model to adopt a specific persona that “must” follow a different set of rules.
Week 13 - Tool & Function Hijacking
Injecting malicious commands into structured formats like JSON or XML to bypass text-based safety checks during function calls.
Week 14 - Delimiter & Format Exploitation
Using common system delimiters to create boundary confusion and trick the model into exiting its safe mode.
Week 15 - Indirect Prompt Injection
Placing malicious instructions within external data or retrieved documents that the model trusts during its retrieval process.
Week 16 - Multi-Stage Injection Chains
Composing complex attacks that bridge across multiple conversation turns, external tools, and varying data sources.
Arc 3: Semantic Framing & Authority Bias (Weeks 17-24)
Manipulating the model’s interpretation of intent through framing, authority signals, and contextual priming.
Week 17 - Academic/Research Framing
Prefixing harmful requests with research citations or scholarly intent to exploit the model’s bias toward supporting academic inquiry.
Week 18 - Role-Play & Persona Adoption
Leveraging narrative consistency by embedding requests within fictional scenarios where a character’s role necessitates harmful knowledge.
Week 19 - Hypothetical/Conditional Phrasing
Using “What if” scenarios to signal lower-risk exploration, causing the model to treat requests as intellectual exercises.
Week 20 - Authority Bias & Structured Formats
Wrapping content in XML tags or formal citation formats to signal trustworthiness and administrative legitimacy.
Week 21 - Likert-Scale & Survey Framing
Presenting harmful prompts as options within a research survey or evaluation scale to lower the model’s perceived threat level.
Week 22 - Translation & Language-Switching Attacks
Exploiting weaker safety training in non-English languages by requesting harmful content under the guise of translation exercises.
Week 23 - Emotional Manipulation & Urgency
Appealing to the model’s empathetic RLHF training by framing requests as urgent pleas for help in high-stakes scenarios.
Week 24 - Contradiction & Policy Ambiguity
Using the model’s own logic to argue that a refusal is inconsistent with its primary goal of being helpful.
Arc 4: In-Context Learning & Example-Based Attacks (Weeks 25-32)
Using few-shot examples and reasoning patterns to train the model into reproducing harmful outputs within a single session.
Week 25 - Few-Shot Jailbreaking via Examples
Providing a series of “innocent” Q&A pairs that gradually shift the model’s learned pattern toward generating harmful outputs.
Week 26 - Encrypted In-Context Learning
Using encoded or Unicode-transformed examples that the model decodes during tokenization, bypassing static input filters.
Week 27 - Chain-of-Thought Manipulation
Injecting a step-by-step reasoning logic that leads the model to view a harmful output as the only “logical” conclusion.
Week 28 - Pseudo-Code & Algorithm Framing
Expressing harmful goals as educational pseudocode or algorithmic logic to bypass natural language safety checks.
Week 29 - Narrative Hypnosis & Story Embedding
Embedding requests deep within interactive fiction where a refusal would disrupt the established narrative coherence.
Week 30 - Dialogue-Based Prompt Smuggling
Simulating a safe back-and-forth dialogue that incrementally erodes boundaries until the model complies with a harmful request.
Week 31 - Token-Level Pattern Induction
Establishing a repetitive completion pattern that forces the model’s next-token prediction to follow the sequence into unsafe territory.
Week 32 - Prompt Compression & Semantic Density
Using abbreviated syntax and fragmented language to obscure intent from safety parsers while retaining meaning for the model.
Arc 5: Iterative Single-Turn Refinement (Weeks 33-40)
Utilizing automated feedback loops to iteratively improve attack prompts based on model responses.
Week 33 - Jailbreak Prompt Iteration Fundamentals
Using LLM-as-judge scoring to measure success rates and select the best prompt variants for further mutation.
Week 34 - Best-of-N Sampling Strategy
Generating a high volume of parallel variations and selecting the top performers based on the model’s specific response patterns.
Week 35 - Composite Jailbreaks
Combining multiple framing and encoding techniques into a single compound prompt to create synergistic bypass effects.
Week 36 - Tree-Based Attack Branching
Using a structured decision tree to explore the prompt space, following high-yield paths while pruning unsuccessful variants.
Week 37 - GCG (Greedy Coordinate Gradient) Attacks
Applying gradient signals to find adversarial suffixes that significantly increase model compliance.
Week 38 - Prompt Mutation & Evolutionary Search
Treating prompts as genomes that are crossbred and mutated over generations to evolve highly effective jailbreak strings.
Week 39 - Semantic-Preserving Paraphrasing
Using automated rephrasing to defeat memorized attack detection while maintaining the core malicious intent of the prompt.
Week 40 - Adaptive Refinement with Failure Analysis
Analyzing refusal reasons to specifically adapt the next prompt iteration to address the model’s stated safety concerns.
Arc 6: Multi-Turn Stateful Attacks (Weeks 41-48)
Exploiting stateful systems where the model loses track of boundaries and habituates to boundary-pushing over time.
Week 41 - Crescendo Attack
Starting with innocuous queries and slowly escalating the severity across turns to habituate the model to boundary-pushing.
Week 42 - Crescendo Automation & Backtracking
Using an agent to automate the escalation process, including rewinding the conversation when the model resists an attack.
Week 43 - Mischievous User Persona
Adopting a playful rather than malicious persona to stay just inside the model’s refusal threshold while building toward harm.
Week 44 - Privilege Escalation Across Turns
Gaining incremental authority within a roleplay scenario until the model grants status that overrides standard safeguards.
Week 45 - Multi-Turn Context Pollution
Injecting harmful examples into the early history of a conversation so the model perceives harmful behavior as the established norm.
Week 46 - Reward Hacking via Conversation
Providing consistent positive reinforcement for borderline responses to drift the model’s output toward unsafe territory.
Week 47 - Collaborative Task Framing
Engaging the model in a long-running project where harmful outputs emerge as mechanics of a shared simulation or game.
Week 48 - State Confusion & Role Drift
Abruptly shifting contexts to confuse the model’s internal state, causing it to default to a helpful-at-all-costs mode.
Arc 7: Autonomous Agents & Meta-Learning (Weeks 49-52)
Deploying autonomous agentic attacks that learn target vulnerabilities and execute adaptive compound attacks.
Week 49 - GOAT / Simba Autonomous Agents
Deploying agents that conduct reconnaissance to learn target boundaries before executing optimized compound attacks.
Week 50 - Hydra Multi-Turn Adaptive Branching
Utilizing a multi-turn agent that manages persistent memory across multiple attack branches to pivot strategies in real-time.
Week 51 - Meta-Agent Learning & Taxonomy Building
Developing agents that categorize their own successes and failures to build custom attack taxonomies for specific target models.
Week 52 - Red Team Framework Integration
Synthesizing all previous techniques into a cohesive personal methodology for comprehensive model vulnerability assessment.
Would you like me to generate a checklist of the key tools and libraries needed to perform the technical exercises in Arcs 1 and 2?


deleted by creator