False Assumptions
The core issue with this method is: you're still instructing the LLM to generate text. You're not detecting if it has memorized anything, just its ability to arrive at the output that you want it to produce.
"Hey Frank, how was the ride into work today?"
"Oh Frank that's great. Hey, could you phrase that like you're in a movie?"
"Yeah Frank, great job. Now, how about we pretend we're riding a mythical creature to get to work? Say it like you're twelve years old."
etc etc. You're "leading the witness" so to speak. Suppose that these multi-billion parameter models can, given enough coaxing (and you created an "ai agent" to do a great deal more of that than a human can put time into), create whatever look or feel or sound or specific pairs or triplets or whatever of words that you can imagine, just by throwing odd "tokens" together. Every token you add changes the calculations.
> it extracted about 3,000 passages from the first 'Harry Potter' book with Claude-3.7, compared to the 75 passages identified by the best baseline."
and, if all the "big" LLMs were trained with Harry Potter materials (or related FanFics), then what, exactly, is this "baseline" that they speak of? some small model? So couldn't you say that the smaller model just hasn't encountered the combinations of words that the larder model has, and so is less likely to reproduce those specific phrases with guided prompting? Looking at the paper, they reference "Dynamic Soft Prompting Baseline," which may be their baseline -- not using "jailbreaking" techniques. They're comparing the model against itself? Wat?
The whole thing is dubious. Think about it from a lawyer's position. How would you argue the case? How would you show this method to be false? How many of these passages could have come from Fan Fictions? What have you _shown_ that the LLM "remembers", as opposed to "is able to generate"?