Remember, it's cheaper to outsource everything to the cloud!
Web archive user's $14k BigQuery bill shock after running queries on 'free' dataset
A user left with a surprise bill for thousands of dollars after running queries on Google's BigQuery data warehouse has sparked a debate about how vendors should place limits on the use of their tools. One user of HTTP Archive – a project that aims to track how the web is built – was recently horrified to get a $14,000 bill …
COMMENTS
-
-
-
-
Monday 26th February 2024 14:26 GMT Charlie Clark
Well, you'd have a bit more for the setup and regular data import, which is a real issue.
For a while I ran a local fork of the old httparchive code, which was very poorly modelled (used manually created hashes for indexing!) and thus very inefficient. Even then, I only ever worked with the pages tables. Once I'd optimised the import scripts, imports took about 30 minutes for each data set on my laptop. Nearly (anything except full text domains searches) all subsequent queries were very fast, which is what you'd expect for a properly modelled DB.
But the current engine uses map/reduce on the reports, which gets very expensive if you run queries that cover any period of time, because you quickly start analysing terabytes of non-normalised JSON. I think exports to CSV are still possible and these are the way to go for any kind of extensive research. The "free" units per month are a good way to get a feel for the service and writing the very-nearly-but-not-quite SQL of Big Query.
Google is effectively sponsoring the project, which I still consider to be a great resource to have an idea how websites were built at any one moment in time over the last 15 years or so, but GCP really is a challenge for new users, especially budget management. For new users, it would be nice to have some kind of rate limiter that you explicitly have to deactivate for work.
It's also worth noting that this is the first time this has come up since reports switched to Big Query.
-
-
This post has been deleted by its author
-
-
-
-
Sunday 25th February 2024 13:12 GMT Len
Slappable jerk
Do you know the 'Slappable jerk' character 'the average Redditor'? He is scarily accurate, and very slappable indeed.
-
-
Monday 26th February 2024 11:15 GMT Necrohamster
From the linked discussion:
"Do you know about the dry_run option in the Python client? Granted, the estimate is provided in bytes, but it should give you an idea of the costs."
-
-
Thursday 22nd February 2024 21:44 GMT Tron
Hypothetical SmallQuery.
I guess these services could store your query. If someone else wanted to run the same query on the same data, they could pay a portion to share your results and you would get some of your original cash back. Interesting.
If you were the only person who searched for something really obscure on Google, and it was charged rather than free, would you get a much bigger bill for the processing involved than if you were one of the zillion who searched for 'Big Boobs' (whilst researching unorthodox uses of calculators in schools)? Because all that processing was only utilised by you, rather than being split a zillion ways. So don't lightly dismiss free ad-funded services. The alternative would see the internet used about as much as private viewdata services were in the 1980s.
In general, digital, online and adult services are usually only supplied after you click the 'Pay' button or hand over the banknotes. In this case, perhaps a little pop-up was in order in the corner of the screen with a running total in Benjamins and an 'OMG stop this madness now' button. Please make it so, Google. We don't all have your cash.
-
Friday 23rd February 2024 07:18 GMT Bebu
The downside is?
《So don't lightly dismiss free ad-funded services. The alternative would see the internet used about as much as private viewdata services were in the 1980s.》
At this point this alternative is really quite attractive.
Internet ca 1996 wasn't so bad - you had proper trolls (who probably owned a bridge or two) and could compose a grammatical, if demented, sentence even stretching to a paragraph without (immediately) descending to outright abuse.
With Altavista etc you had a fair chance of locating relevant material now with the lashings of AI/LLM mixed into current search engine mess you will be lucky to get only "... defining loads of buttered grin..."*.
A world without Alphabet/Google/Youtube etc, Amazon, Meta/Facebook etc, X, Tiktok, etc etc doesn't look too bad to me.
* The output to a query whether dogs can safely eat Fruit Loops or some such.
-
Friday 23rd February 2024 22:29 GMT the spectacularly refined chap
Re: The downside is?
A world without Alphabet/Google/Youtube etc, Amazon, Meta/Facebook etc, X, Tiktok, etc etc doesn't look too bad to me.
Amazon weren't too bad when they stayed on books. I still have nightmares of reading the "Computer Books" catalogue, printed on the same paper and in a similar size font to the phone book, only the prices in the catalogue made telephone numbers look cheap.
-
-
Friday 23rd February 2024 08:24 GMT Lee D
Cloud is just a way to go back to charging for computing resources per byte, per cycle, per second.
I don't understand it, and I don't see how people have managed to convince business (who like fixed determinable costs) to go down that path.
For instance, if I was to migrate our in-house AD and VM structure to Azure, how much would my bill be next month? I can tell you *most* of it (reserved instances and all that), but I can't tell you at all what it would actually say on the invoice, nor can I guarantee that tomorrow it won't spike massively through our own ordinary use of the same systems.
Thin client, fat client, thin client, fat client...
Distributed, consolidated, distributed, consolidated...
In-house, outsourced, in-house, outsource...
And now computing is
Purchased, rented, purchased, rented...
Sorry, but I don't want a system where it's even CAPABLE of running up an unexpected $12,000 in a year, let alone one query. It just shouldn't be possible. And when you consider the majority of the clients of such services, surely jumping out of free tiers into $12k bills is something that none of them want, that there should be guards against, and that the query should be denied outright and you have to go in and authorise it individually rather than it "just happens". I'd really rather my servers just stopped, for instance, than issued me a $12k bill for carrying on. And yet I spent many times more than $12k on my in-house servers that do the same.
This isn't about dumb users, or about how much you can run up a bill. This is about profiteering at the expense of having a set credit limit and a separate authorise button for anything over a user-controlled limit that has no default and has to be explicitly set by each customer before they are able to use the system.
-
Friday 23rd February 2024 09:11 GMT Mike007
Concur
If I genuinely have the budget for £100,000/month worth of resources and am willing to pay extra rather than have downtime then I am not going to make it to production without being aware that I need to change the default £100 limit... Whereas if I DONT have that kind of budget I would probably rather the system went down.
-
Saturday 24th February 2024 12:22 GMT Roland6
>” This is about profiteering at the expense of having a set credit limit and a separate authorise button for anything over a user-controlled limit that has no default and has to be explicitly set by each customer before they are able to use the system.”
Trouble is this seems to be a lesson that isn’t being learnt or sinking in.
We had exactly the same with mobile phones - remember:
The 2010 eruptions of Eyjafjallajökull and the disruption to air travel? Which resulted in travellers unknowingly running up massive £,000’s phone/data bills.
The issue with automatic debts to bank cards used on iStore, resulting in parents getting massive bills from children’s in game purchases..
The issue with AWS “free tier” which also was uncapped chargeable usage.
…
At least the (uk) mobile phone companies reacted and now you can set additional cost limits and receive a text message when the limit is approached.
I suspect neither iStore or AWS have changed…
-
Monday 26th February 2024 11:10 GMT Necrohamster
I don't understand it, and I don't see how people have managed to convince business (who like fixed determinable costs) to go down that path."
OpEx versus CapEx I suppose.
Beancounters like OpEx because they can fully deduct expenses in the same year they're incurred, and the infrastructure costs are someone else's problem. Subscription costs are pay-as-you-go (to Tim's detriment in this case).
Disclaimer: I'm not a beancounter, and the above information doesn't constitute financial advice or may be completely incorrect. :D
But back to the present case, it seems the company didn't enforce limits in its GCP billing settings.
-
-
Friday 23rd February 2024 17:14 GMT Not Yb
Reminds me of the time I accidentally printed out the postscript (a page description language) source code of a page, instead of telling the printer to RUN that source code and print the results. Luckily the local system operator realized what was happening and cancelled the job before it printed out all several hundred pages. Saved me a lot of money on my self-funded printing account.
"Are you sure you want to process 2.5 petabytes of data?" seems like a good question to ask whenever something like this gets run.
-
Sunday 25th February 2024 09:08 GMT StrangerHereMyself
Reason
This is the reason why I deleted my AWS account as soon as I had used for some purpose. The last thing I want is for someone to steal my credentials and racking up a $100K bill with crypto-mining.
I once read an article about some guy using AWS and making programming mistake and ending up with a $30K bill. Fortunately for him AWS forgave him the bill, but it made me extremely hesitant to use AWS or any other cloud service. Even if you're using the "free" tier you can still rack up a huge bill.
No cloud for me.
-
Monday 26th February 2024 09:40 GMT Necrohamster
It's always the "power users" doing dumb stuff
...However, the user, Tim, came back into the conversation. He said he was running queries from a Python script with the official GCP libraries, which, unlike the web UI, does not have a mechanism to show costs for a query, he said.
Sounds like Tim the power user was aware that queries cost money. How could anyone expect 2.5 PB of processing to be free? Just because you use a script, not the web interface? lol
Tough luck. Sounds like a classic case of someone-didn't-RTFM.
-
-
Monday 26th February 2024 12:35 GMT Necrohamster
Re: It's always the "power users" doing dumb stuff
From the post 'TIM' seems quite aware it cost money...
Indeed.
...racking up $14K in just 2 hours seem a bit above his expectation
Never ASSUME... it makes an ASS out of U and ME.. Well, it made an ass out of him anyway.
He was using a python library that did not present the warning as in the webgui.
That's why you make a dry run query before running a script that can cost you (or your employer) money.
-
Monday 26th February 2024 14:51 GMT Chris Evans
Re: It's always the "power users" doing dumb stuff
Dry run would be helpful but not the full answer. "queryDryRun demonstrates issuing a dry run query to validate query structure and provide an estimate of the bytes scanned."
No mention of cost estimate! Though AIUI it could be calculated.
-
-
-