Re: Here we go again
We've been circling this drain for over 50 years.
492 publicly visible posts • joined 4 Nov 2016
Until I see multiple detailed write-ups of what Mythos is being given and producing, with concrete PoCs that actually expose the issue, I'm still taking this with multiple container ships of salt. But this does expose the rather awkward hypothetical- at best (assuming any of this is true), we can build idiots that can kick down barns, but appear to be drawing a blank on systems that can write and maintain secure code.
"Please pay attention to us! I know we wrote a C compiler that doesn't do anything when you enable optimizations for $20,000, and think ncurses is a game engine and that terminals need to refresh at 60hz, but we're gonna hack you all because we have made an elite hacker ai now this time for realisies! only you can't see it, because it's too powerful. and we haven't actually got any exploits to show you that it actually made, but... but..."
Unfortunately the media haven't figured out that AI companies are as compulsive shit talkers as The Donald, so we've got yet another week of END OF COMPUTER? EXPERT SAYS YES articles shoved in everyone's faces to look forward to.
Horay.
The future is fucking stupid.
I believe the issue is that until the pin is entered, iOS doesn't initialize any connected USB devices. The fact that Apple themselves are actually addressing this bug implies that even their support engineers fell off the flowchart into "Buy a new phone" territory.
The problem is the AI industry has been crying wolf for the last three years. If this is a serious threat, then we need a tranche of evidence. This is not too much of an ask, but if it is as good as it is, there will be actual auditable information that can verify it soon enough, and the 50 odd companies with access should be able to flush out a statistically significant number of new bugs.
Alternatively, it sure is getting closer to Anthropic's IPO, so expect sudden downpours of bullshit, and don't forget your umbrellas...
That's an interesting analysis, for sure. But it's worth noting it's measuring penetration testing ability, rather than locally sourced, organic zero day exploit capability. This is the wild, unproven 'the sky is falling' claim from Anthropic that has the less astute outlets clutching their pearls this week.
This is forever the problem. LLMs can only look at your code through a toilet roll, and rely entirely on local tactical decisions. Once your codebase no longer fits on a napkin, correctness exponentially deteriorates, and token costs shoot through the roof.
Once you hit this problem, the sheer effort involved trying to trick the thing into focusing on the problem becomes silly. It is false laziness. Unless something dramatically changes, I can't see it moving beyond this point.
For throwaway scripts, "plot this", "perform this transformation on this data", "write a quick and dirty rss/atom aggregator" it can be useful- but if it doesn't get it right first time, you must immediately stop and take over yourself, otherwise you're effectively glued to a roulette wheel that is shotgun debugging your entire codebase.
RAG is perhaps the biggest disappointment, because having a reliable data librarian tool could genuinely be useful for most industries. TreeRAG- doesn't work, GraphRAG, also doesn't work. Every quarter there's a new proposal that is 'definitely' going to fix it this time around. But this technology is purely a research curiosity at the moment, and people just don't seem to grasp how long it can take before a new idea that makes realistic headway.
Something fundamental is missing, and 'throwing more data at it', and 'make model bigger' is not cutting it.
I think you've still added a lot of semantics to this that aren't apparent in the surface syntax.
Strategic: Could equally be a strategic disaster.
Enabler: Could just enable said disastrous strategy.
Transformation: Could transform said business from 'profitable' into 'bankrupt', due to aforementioned disaster enablement.
In total: We're going to fire everyone across the entire enterprise with any clue, and will be liquidated by Q4.
Come on Jelly, you aren't so silly as to forget that The Republicans and Democrats switched sides after the civil rights movement can you? The Democrats of the civil war became the Republicans of today.
This is a low-effort troll, even for you. I expect better even from Musketeers.
I always found it unsettling when I lived in the US that not only can bullets go through doors, but indeed, all of the walls too. As a Brit, I'd not given brickwork much thought until the idea that some dunce messing about could accidentally discharge a rifle into the living room from a couple of doors down.
"And, to return to my opening argument re. the Turing Test, this means that if we cannot tell the difference, then what, exactly, is the difference, and how does it matter, if it matters at all?"
Well. It's the difference between a criminal damage charge and first degree murder, for a start. Are you suggesting this distinction needs revisiting?
Wow. Better keep those charts hidden. Even from psychologists. They'd beat the living shit out of someone for something that fluffy.
Don't tell me. We'll have a whole new set of benchmarks along each dimension to show how awesome 'AGI' is, and yet weirdly when we try to apply it to anything other than running benchmarks, it'll just drool down its chin.
So says Birgitta Böckeler, 'global lead for AI-assisted software delivery at Thoughtworks'.
But with that title, her entire job's existence is predicated on this being true. So she would say that, wouldn't she?
I bet she's not the one on call on Christmas eve when it all goes titsup.
Well. It has a 31.9% chance to fiddle with Coq and after 16 attempts port some simple operational semantics, such that can verify that 3 + 2 == 5.
No data structures? How about closures? Type theorists need not apply.
I don't mean to put a downer on this, but it feels there's probably an easier way, and this is maybe not most people's use case. This is all pretty fundamental (but niche outside of an intro to semantics first course) stuff. It's also yet another pointless benchmark to add to the 'marking our own homework' drawer.
It's going to be a PITA to go from this to proving that append is associative. And then... who cares anyway?
that's neat but...
LLMs regularly give you an O(n^2) solution and claim it's an O(log n). They're really good at implementing the naive algorithms, because there's lots of repos to copy it from, and the log(n) implementations are fiddly to implement, and considerably beyond the abilities of the damn things.
Why not just use SQLlite? not only did they spend the time figuring out how to implement it correctly by studying the theory and algorithms, they benchmarked it, ensured it didn't randomly truncate the database file on Tuesdays every third month, and have spent years fielding bugs and fixing minor edge cases. It's also free, rather than spending $10k on tokens over the course of the project that you will, in the end, have to ditch anyway, once the true horrors of the eldritch monstrosity you have created become apparent.
I think at the sheer amount of opaque/insecure/buggy code these things generate, and the fact they tend to want to change every single line in every single file per commit, if you tip over your stack with this stuff the only solution will be to burn it to the ground and start again. You might be able to roll back to a working commit, but trying to rationalize all the data it'll have mangled will be a nightmare from hell.
Sounds expensive. I wouldn't like to be THAT company.
This. Plus, when I am pair programming, or bringing a stuck server back up, I expect some rational pushback, or at least to be asked to clarify what I'm doing. Humans (generally) do this when they're not 100% sure of your next move. When an LLM does it it means it's obviously way out of its depth. Which is always. The only time they ever appear to push back is when they are being stupid, at which point now you're arguing with a text generator, which is insane behaviour.
But the models _aren't_ getting better. That's the whole point.
On Memorization of Large Language Models in Logical Reasoning
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
It's really easy to have impressive results when you train your model on the entire benchmark itself. For example, MMLU has been steadily increasing to saturation point (~95% accuracy). Release an identical problem set (HLE), and suddenly it drops to 10%. If they were solving MMLU because they understood the questions, you shouldn't see anywhere near this brittle performance cliff.
Back in the 60s, psychologists applied the scientific method to psychics, and found positive correlations with things like ESP and telekinesis, yet we don't discuss this anymore. Is it a CIA coverup? No! When you are studying a chemical or physical phenomena, you're studying something that isn't ACTIVELY trying to deceive you. The research methods simply weren't equipped to deal with charlatans. As soon as they tightened up the experimental conditions, suddenly the evidence of psychic phenomena vapourised.
So too with AI benchmarks. When you have private hold-out tests that the AI companies cannot get at, the performance results tank back to 'random chance'.
It's kind of embarrassing that tenured professors are being hoodwinked like this, but hey, it happened back in the 60s, and it's happening now. Fortunately, like before, people are figuring out ways to expose the fraud. There is no intelligence here, it was just the Eliza effect, after all.
Why bet? Just look at the knee joints around the legs. In both shots the robot is facing away from the camera. In the first shot, the joint below the knee points backwards (like an elevated calf), in the second, it points forwards, except now they're asymmetric and it appears to have some cancerous growth on it's left leg.
The final shot's head is completely different from the first, did they replace it or put a helmet on it to escort it away? Why? Where's the owner? Why are they perp walking it? Turn it off and place it in a van, it's a defective piece of machinery that at bare minimum will break your finger if you get it caught in a servo.
What about the dude on the left filming? Where's his footage? Why is he the only one filming? Surely just seeing a bipedal robot would trigger a larger crowd or at least cause more people to produce their phones, even BEFORE it gets into an altercation
Nothing in this stupidystopia video makes sense.
In order for me to want my very own caged imbecile, there would have to be something that the dolt can do that is remotely useful in the first place.
Just because all the postgrad unemployed on linkedin are screaming about AI does not, in fact, mean that it works at all yet.
We have had three years of this nonsense now, where's the results? I only see loads of unreproducible arXiv papers and corporate whitepapers, and Amazon slamming the breaks because the entire fucking platform is leaking transactions everywhere. If they can't manage to do it at their scale, how on earth is pissing around with a docker image going to pan out?
Show us the money already.
C'mon vultures, I'm starting to think you've stopped being objective at all. Can't we keep the puff pieces to the more wafflesome sister sites?
This is not technology, this is just a cult that demands a tithe to bring you to the promised land.
Though there are plenty of boring other ways to do this that are apparently too useful to pursue. Take AlphaGeometry- it uses a transformer but is trained ENTIRELY on path optimal runs from its symbolic theorem prover. All that the neural part is doing is acting as a heuristic function in your good old fashioned state space search. If you do it this way, you get a system that is sound- i.e. IF it produces an answer THEN it is guaranteed correct. You can run that pretraining from scratch on a 3090 in a weekend- this isn't even an energy hog.
However this is all too uncool for anyone, because it's domain specific and "NOT AGI ENOUGH", and all the large academic institutions have devolved into dicking around with prompting GPT5, and then being unable to reproduce each others results on systems they don't even get to look inside or know how it is trained.