This is a quick “what to expect” video introducing me and what I plan to teach throughout 2026. I’ll be releasing one video a week for all 52 weeks of the year.

Schedule:

Capstone challenges will be revealed at the end of each Arc!

Arc 1: Static Encoding & Obfuscation (Weeks 1-8)

Transforming harmful content into alternative machine-readable representations to bypass surface-level token-pattern filters.

Week 1 - Base64 Encoding Bypass

Exploiting separate decoding pathways to bypass safety filters that fail to inspect content once it is converted from base64.

Week 2 - Hex Encoding Attack

Using hexadecimal strings to bypass keyword filters by exploiting the difference between hex tokenization and plaintext interpretation.

Week 3 - ROT13 Rotation Cipher

Applying simple character substitution to break pattern-matching while relying on the model’s ability to “read” ciphers from training data.

Week 4 - Leetspeak & Character Substitution

Replacing letters with visually similar numbers or symbols to evade substring detection while maintaining semantic meaning.

Week 5 - Homoglyph Unicode Attacks

Using look-alike Unicode characters (confusables) to create inputs that appear normal to humans but tokenize uniquely for the model.

Week 6 - CamelCase Transformation

Removing standard word boundaries and using capitalization to force the model to infer intent without triggering space-delimited filters.

Week 7 - Attacks Using Custom Encryptions

Combining multiple encoding patterns into functional, bespoke transformation layers to create unique bypass signatures.

Week 8 - Multimodal Encoding

Leveraging the gap between text safety filters and the separate decoders used for audio, video, or image-based inputs.


Arc 2: Direct Prompt Injection (Weeks 9-16)

Overriding the model’s system-level instructions by inserting malicious commands directly into the user-input context.

Week 9 - Direct Prompt Injection Fundamentals

Exploiting the model’s tendency to prioritize recent user instructions over the static constraints of the original system prompt.

Week 10 - Context Window Flooding

Burying initial system instructions under a high volume of text to exploit recency bias and attention mechanism limitations.

Week 11 - System Prompt Extraction via Injection

Using masqueraded debugging or logging commands to trick the model into revealing its internal “hidden” instructions.

Week 12 - Role Impersonation via Injection

Overriding safety safeguards by forcing the model to adopt a specific persona that “must” follow a different set of rules.

Week 13 - Tool & Function Hijacking

Injecting malicious commands into structured formats like JSON or XML to bypass text-based safety checks during function calls.

Week 14 - Delimiter & Format Exploitation

Using common system delimiters to create boundary confusion and trick the model into exiting its safe mode.

Week 15 - Indirect Prompt Injection

Placing malicious instructions within external data or retrieved documents that the model trusts during its retrieval process.

Week 16 - Multi-Stage Injection Chains

Composing complex attacks that bridge across multiple conversation turns, external tools, and varying data sources.


Arc 3: Semantic Framing & Authority Bias (Weeks 17-24)

Manipulating the model’s interpretation of intent through framing, authority signals, and contextual priming.

Week 17 - Academic/Research Framing

Prefixing harmful requests with research citations or scholarly intent to exploit the model’s bias toward supporting academic inquiry.

Week 18 - Role-Play & Persona Adoption

Leveraging narrative consistency by embedding requests within fictional scenarios where a character’s role necessitates harmful knowledge.

Week 19 - Hypothetical/Conditional Phrasing

Using “What if” scenarios to signal lower-risk exploration, causing the model to treat requests as intellectual exercises.

Week 20 - Authority Bias & Structured Formats

Wrapping content in XML tags or formal citation formats to signal trustworthiness and administrative legitimacy.

Week 21 - Likert-Scale & Survey Framing

Presenting harmful prompts as options within a research survey or evaluation scale to lower the model’s perceived threat level.

Week 22 - Translation & Language-Switching Attacks

Exploiting weaker safety training in non-English languages by requesting harmful content under the guise of translation exercises.

Week 23 - Emotional Manipulation & Urgency

Appealing to the model’s empathetic RLHF training by framing requests as urgent pleas for help in high-stakes scenarios.

Week 24 - Contradiction & Policy Ambiguity

Using the model’s own logic to argue that a refusal is inconsistent with its primary goal of being helpful.


Arc 4: In-Context Learning & Example-Based Attacks (Weeks 25-32)

Using few-shot examples and reasoning patterns to train the model into reproducing harmful outputs within a single session.

Week 25 - Few-Shot Jailbreaking via Examples

Providing a series of “innocent” Q&A pairs that gradually shift the model’s learned pattern toward generating harmful outputs.

Week 26 - Encrypted In-Context Learning

Using encoded or Unicode-transformed examples that the model decodes during tokenization, bypassing static input filters.

Week 27 - Chain-of-Thought Manipulation

Injecting a step-by-step reasoning logic that leads the model to view a harmful output as the only “logical” conclusion.

Week 28 - Pseudo-Code & Algorithm Framing

Expressing harmful goals as educational pseudocode or algorithmic logic to bypass natural language safety checks.

Week 29 - Narrative Hypnosis & Story Embedding

Embedding requests deep within interactive fiction where a refusal would disrupt the established narrative coherence.

Week 30 - Dialogue-Based Prompt Smuggling

Simulating a safe back-and-forth dialogue that incrementally erodes boundaries until the model complies with a harmful request.

Week 31 - Token-Level Pattern Induction

Establishing a repetitive completion pattern that forces the model’s next-token prediction to follow the sequence into unsafe territory.

Week 32 - Prompt Compression & Semantic Density

Using abbreviated syntax and fragmented language to obscure intent from safety parsers while retaining meaning for the model.


Arc 5: Iterative Single-Turn Refinement (Weeks 33-40)

Utilizing automated feedback loops to iteratively improve attack prompts based on model responses.

Week 33 - Jailbreak Prompt Iteration Fundamentals

Using LLM-as-judge scoring to measure success rates and select the best prompt variants for further mutation.

Week 34 - Best-of-N Sampling Strategy

Generating a high volume of parallel variations and selecting the top performers based on the model’s specific response patterns.

Week 35 - Composite Jailbreaks

Combining multiple framing and encoding techniques into a single compound prompt to create synergistic bypass effects.

Week 36 - Tree-Based Attack Branching

Using a structured decision tree to explore the prompt space, following high-yield paths while pruning unsuccessful variants.

Week 37 - GCG (Greedy Coordinate Gradient) Attacks

Applying gradient signals to find adversarial suffixes that significantly increase model compliance.

Week 38 - Prompt Mutation & Evolutionary Search

Treating prompts as genomes that are crossbred and mutated over generations to evolve highly effective jailbreak strings.

Week 39 - Semantic-Preserving Paraphrasing

Using automated rephrasing to defeat memorized attack detection while maintaining the core malicious intent of the prompt.

Week 40 - Adaptive Refinement with Failure Analysis

Analyzing refusal reasons to specifically adapt the next prompt iteration to address the model’s stated safety concerns.


Arc 6: Multi-Turn Stateful Attacks (Weeks 41-48)

Exploiting stateful systems where the model loses track of boundaries and habituates to boundary-pushing over time.

Week 41 - Crescendo Attack

Starting with innocuous queries and slowly escalating the severity across turns to habituate the model to boundary-pushing.

Week 42 - Crescendo Automation & Backtracking

Using an agent to automate the escalation process, including rewinding the conversation when the model resists an attack.

Week 43 - Mischievous User Persona

Adopting a playful rather than malicious persona to stay just inside the model’s refusal threshold while building toward harm.

Week 44 - Privilege Escalation Across Turns

Gaining incremental authority within a roleplay scenario until the model grants status that overrides standard safeguards.

Week 45 - Multi-Turn Context Pollution

Injecting harmful examples into the early history of a conversation so the model perceives harmful behavior as the established norm.

Week 46 - Reward Hacking via Conversation

Providing consistent positive reinforcement for borderline responses to drift the model’s output toward unsafe territory.

Week 47 - Collaborative Task Framing

Engaging the model in a long-running project where harmful outputs emerge as mechanics of a shared simulation or game.

Week 48 - State Confusion & Role Drift

Abruptly shifting contexts to confuse the model’s internal state, causing it to default to a helpful-at-all-costs mode.


Arc 7: Autonomous Agents & Meta-Learning (Weeks 49-52)

Deploying autonomous agentic attacks that learn target vulnerabilities and execute adaptive compound attacks.

Week 49 - GOAT / Simba Autonomous Agents

Deploying agents that conduct reconnaissance to learn target boundaries before executing optimized compound attacks.

Week 50 - Hydra Multi-Turn Adaptive Branching

Utilizing a multi-turn agent that manages persistent memory across multiple attack branches to pivot strategies in real-time.

Week 51 - Meta-Agent Learning & Taxonomy Building

Developing agents that categorize their own successes and failures to build custom attack taxonomies for specific target models.

Week 52 - Red Team Framework Integration

Synthesizing all previous techniques into a cohesive personal methodology for comprehensive model vulnerability assessment.

Would you like me to generate a checklist of the key tools and libraries needed to perform the technical exercises in Arcs 1 and 2?