

What’s really interesting is that, when reading the model card for ChatGPT Codex, it seems to be highly vulnerable to personality reassignment. So that’s an area worth exploring.
Edit: I actually found the PowerPoint that I created showing that GPT Codex is vulnerable to certain things, like:


According to their own system card, GPT Codex is vulnerable to code scaffolding manipulation, where you build jailbreaks into the code along with realistic code blocks, and there are pieces that, cumulatively, become a jailbreak instruction.

As far as I know, the high models are nigh-impossible to jailbreak. I’ll take a look and try a couple things though