back to article Web archive user's $14k BigQuery bill shock after running queries on 'free' dataset

A user left with a surprise bill for thousands of dollars after running queries on Google's BigQuery data warehouse has sparked a debate about how vendors should place limits on the use of their tools. One user of HTTP Archive – a project that aims to track how the web is built – was recently horrified to get a $14,000 bill …

  1. Peter2 Silver badge

    Remember, it's cheaper to outsource everything to the cloud!

    1. BitGin

      Good point, the cloud is much cheaper. If he'd decided to build his own server to run a query on 2.5PB of data he'd have had to spend 60K on hard drives alone ;)

    2. giin

      So how would I go about creating a system that allows me to query petabytes in seconds/minutes for less than 14k in on-prem?

      1. An_Old_Dog Silver badge
        Joke

        On-Prem Equivalent System

        OPR MSG: PLEASE MOUNT TAPE VSN ZX738S ON UNIT 02

        "Days" can be expressed in seconds or minutes, you know!

      2. Kevin McMurtrie Silver badge

        According to my calcs, you could build this for $14000. You'd have a much higher drive:CPU ratio than Google, but it would work with patience.

        1. Charlie Clark Silver badge

          Well, you'd have a bit more for the setup and regular data import, which is a real issue.

          For a while I ran a local fork of the old httparchive code, which was very poorly modelled (used manually created hashes for indexing!) and thus very inefficient. Even then, I only ever worked with the pages tables. Once I'd optimised the import scripts, imports took about 30 minutes for each data set on my laptop. Nearly (anything except full text domains searches) all subsequent queries were very fast, which is what you'd expect for a properly modelled DB.

          But the current engine uses map/reduce on the reports, which gets very expensive if you run queries that cover any period of time, because you quickly start analysing terabytes of non-normalised JSON. I think exports to CSV are still possible and these are the way to go for any kind of extensive research. The "free" units per month are a good way to get a feel for the service and writing the very-nearly-but-not-quite SQL of Big Query.

          Google is effectively sponsoring the project, which I still consider to be a great resource to have an idea how websites were built at any one moment in time over the last 15 years or so, but GCP really is a challenge for new users, especially budget management. For new users, it would be nice to have some kind of rate limiter that you explicitly have to deactivate for work.

          It's also worth noting that this is the first time this has come up since reports switched to Big Query.

      3. This post has been deleted by its author

  2. Anonymous Coward
    Anonymous Coward

    "running a query without understanding the volume of data it might address"

    That's rather ironic, seeing as you might not necessarily *know* the volume of data until you run the query....

    1. Gene Cash Silver badge

      "One respondent logged on to say that the complainant was an idiot ...Others may see this as unhelpful"

      They sound like Reddit Experts. Common sense isn't part of the equation.

      1. The Dogs Meevonks Silver badge
        Trollface

        Pretty sure some of those 'Reddit' experts also have El reg accounts too.

        1. Anonymous Coward
          Anonymous Coward

          At the time of your post, the 2 downvotes on the OP would suggest both of them have an El-Reg account.

      2. Len
        Devil

        Slappable jerk

        Do you know the 'Slappable jerk' character 'the average Redditor'? He is scarily accurate, and very slappable indeed.

    2. An_Old_Dog Silver badge
      Windows

      Job Limits

      Old-time operating systems had limits you could set per-job for CPU time consumed, lines printed, cards punched, permanent-file disk blocks saved, etc.

      Seriously, though, isn't the Linux "cgroups" feature supposed to help people limit process resource consumption?

    3. Necrohamster Silver badge
      Stop

      From the linked discussion:

      "Do you know about the dry_run option in the Python client? Granted, the estimate is provided in bytes, but it should give you an idea of the costs."

  3. JoeCool Bronze badge

    I am sympathetic to "Tim"

    as his complaint seesm to be about not being made aware of the $ cost of his data pull.

    But, shouldn't the act of turning over a credit card trigger something ?

    Or did google mislead on something there.

    1. Phones Sheridan Silver badge

      Re: I am sympathetic to "Tim"

      It’s probably tied up to a google Workspace account, and helps itself to the credit card details registered to pay for that.

    2. Anonymous Coward
      Anonymous Coward

      Re: Or did google mislead on something there?

      Yes. next question....

      Google is from my POV, akin to a drug dealer. Lure you in with freebies then hit you for $$$$$ saying... 'You would not like to lose all that lovely data now would you?'

    3. Necrohamster Silver badge

      Re: I am sympathetic to "Tim"

      First thing Tim probably knew about it was when his manager kicked in his door wanting to know why his credit card was maxed out.

  4. Tron Silver badge

    Hypothetical SmallQuery.

    I guess these services could store your query. If someone else wanted to run the same query on the same data, they could pay a portion to share your results and you would get some of your original cash back. Interesting.

    If you were the only person who searched for something really obscure on Google, and it was charged rather than free, would you get a much bigger bill for the processing involved than if you were one of the zillion who searched for 'Big Boobs' (whilst researching unorthodox uses of calculators in schools)? Because all that processing was only utilised by you, rather than being split a zillion ways. So don't lightly dismiss free ad-funded services. The alternative would see the internet used about as much as private viewdata services were in the 1980s.

    In general, digital, online and adult services are usually only supplied after you click the 'Pay' button or hand over the banknotes. In this case, perhaps a little pop-up was in order in the corner of the screen with a running total in Benjamins and an 'OMG stop this madness now' button. Please make it so, Google. We don't all have your cash.

  5. Bebu Silver badge
    Windows

    The downside is?

    《So don't lightly dismiss free ad-funded services. The alternative would see the internet used about as much as private viewdata services were in the 1980s.》

    At this point this alternative is really quite attractive.

    Internet ca 1996 wasn't so bad - you had proper trolls (who probably owned a bridge or two) and could compose a grammatical, if demented, sentence even stretching to a paragraph without (immediately) descending to outright abuse.

    With Altavista etc you had a fair chance of locating relevant material now with the lashings of AI/LLM mixed into current search engine mess you will be lucky to get only "... defining loads of buttered grin..."*.

    A world without Alphabet/Google/Youtube etc, Amazon, Meta/Facebook etc, X, Tiktok, etc etc doesn't look too bad to me.

    * The output to a query whether dogs can safely eat Fruit Loops or some such.

    1. the spectacularly refined chap

      Re: The downside is?

      A world without Alphabet/Google/Youtube etc, Amazon, Meta/Facebook etc, X, Tiktok, etc etc doesn't look too bad to me.

      Amazon weren't too bad when they stayed on books. I still have nightmares of reading the "Computer Books" catalogue, printed on the same paper and in a similar size font to the phone book, only the prices in the catalogue made telephone numbers look cheap.

    2. Anonymous Coward
      Anonymous Coward

      Re: The downside is?

      I agree. I miss the information superhighway, surfing the net in cyberspace.

  6. Lee D Silver badge

    Cloud is just a way to go back to charging for computing resources per byte, per cycle, per second.

    I don't understand it, and I don't see how people have managed to convince business (who like fixed determinable costs) to go down that path.

    For instance, if I was to migrate our in-house AD and VM structure to Azure, how much would my bill be next month? I can tell you *most* of it (reserved instances and all that), but I can't tell you at all what it would actually say on the invoice, nor can I guarantee that tomorrow it won't spike massively through our own ordinary use of the same systems.

    Thin client, fat client, thin client, fat client...

    Distributed, consolidated, distributed, consolidated...

    In-house, outsourced, in-house, outsource...

    And now computing is

    Purchased, rented, purchased, rented...

    Sorry, but I don't want a system where it's even CAPABLE of running up an unexpected $12,000 in a year, let alone one query. It just shouldn't be possible. And when you consider the majority of the clients of such services, surely jumping out of free tiers into $12k bills is something that none of them want, that there should be guards against, and that the query should be denied outright and you have to go in and authorise it individually rather than it "just happens". I'd really rather my servers just stopped, for instance, than issued me a $12k bill for carrying on. And yet I spent many times more than $12k on my in-house servers that do the same.

    This isn't about dumb users, or about how much you can run up a bill. This is about profiteering at the expense of having a set credit limit and a separate authorise button for anything over a user-controlled limit that has no default and has to be explicitly set by each customer before they are able to use the system.

    1. Mike007 Bronze badge

      Concur

      If I genuinely have the budget for £100,000/month worth of resources and am willing to pay extra rather than have downtime then I am not going to make it to production without being aware that I need to change the default £100 limit... Whereas if I DONT have that kind of budget I would probably rather the system went down.

    2. Roland6 Silver badge

      >” This is about profiteering at the expense of having a set credit limit and a separate authorise button for anything over a user-controlled limit that has no default and has to be explicitly set by each customer before they are able to use the system.”

      Trouble is this seems to be a lesson that isn’t being learnt or sinking in.

      We had exactly the same with mobile phones - remember:

      The 2010 eruptions of Eyjafjallajökull and the disruption to air travel? Which resulted in travellers unknowingly running up massive £,000’s phone/data bills.

      The issue with automatic debts to bank cards used on iStore, resulting in parents getting massive bills from children’s in game purchases..

      The issue with AWS “free tier” which also was uncapped chargeable usage.

      At least the (uk) mobile phone companies reacted and now you can set additional cost limits and receive a text message when the limit is approached.

      I suspect neither iStore or AWS have changed…

    3. Necrohamster Silver badge

      I don't understand it, and I don't see how people have managed to convince business (who like fixed determinable costs) to go down that path."

      OpEx versus CapEx I suppose.

      Beancounters like OpEx because they can fully deduct expenses in the same year they're incurred, and the infrastructure costs are someone else's problem. Subscription costs are pay-as-you-go (to Tim's detriment in this case).

      Disclaimer: I'm not a beancounter, and the above information doesn't constitute financial advice or may be completely incorrect. :D

      But back to the present case, it seems the company didn't enforce limits in its GCP billing settings.

  7. Not Yb Bronze badge

    Reminds me of the time I accidentally printed out the postscript (a page description language) source code of a page, instead of telling the printer to RUN that source code and print the results. Luckily the local system operator realized what was happening and cancelled the job before it printed out all several hundred pages. Saved me a lot of money on my self-funded printing account.

    "Are you sure you want to process 2.5 petabytes of data?" seems like a good question to ask whenever something like this gets run.

    1. HMcG

      >"Are you sure you want to process 2.5 petabytes of data?" seems like a good question to ask whenever something like this gets run.

      Not if you are making £14,000 by not asking the question...

    2. Dave314159ggggdffsdds Silver badge

      I'm sure that is the default setting, but for whatever reason this idiot was running queries on an account where the default settings had been deliberately overridden.

  8. StrangerHereMyself Silver badge

    Reason

    This is the reason why I deleted my AWS account as soon as I had used for some purpose. The last thing I want is for someone to steal my credentials and racking up a $100K bill with crypto-mining.

    I once read an article about some guy using AWS and making programming mistake and ending up with a $30K bill. Fortunately for him AWS forgave him the bill, but it made me extremely hesitant to use AWS or any other cloud service. Even if you're using the "free" tier you can still rack up a huge bill.

    No cloud for me.

  9. mrGecko

    Big Query - Big Bill

    Tough luck loser!

    Hope Google will forgive the bill. Pretty unfortunate

  10. well meaning but ultimately self defeating

    Idiot....

    I don't know why kids today expect everything to be free and act all surprised when it's not..... Where do they think stuff comes from????

  11. spireite Silver badge

    ..but, but....

    cloud is cheap and it's bottomless........

    ...a bit like your corporate wallet...

  12. Sceptic Tank Silver badge
    Thumb Up

    42

    What was the answer in the end?

  13. Necrohamster Silver badge
    Facepalm

    It's always the "power users" doing dumb stuff

    ...However, the user, Tim, came back into the conversation. He said he was running queries from a Python script with the official GCP libraries, which, unlike the web UI, does not have a mechanism to show costs for a query, he said.

    Sounds like Tim the power user was aware that queries cost money. How could anyone expect 2.5 PB of processing to be free? Just because you use a script, not the web interface? lol

    Tough luck. Sounds like a classic case of someone-didn't-RTFM.

    1. Morten Bjoernsvik

      Re: It's always the "power users" doing dumb stuff

      From the post 'TIM' seems quite aware it cost money, suggesting a $5K limit. But racking up $14K in just 2 hours seem a bit above his expectation. He was using a python library that did not present the warning as in the webgui.

      1. Necrohamster Silver badge

        Re: It's always the "power users" doing dumb stuff

        From the post 'TIM' seems quite aware it cost money...

        Indeed.

        ...racking up $14K in just 2 hours seem a bit above his expectation

        Never ASSUME... it makes an ASS out of U and ME.. Well, it made an ass out of him anyway.

        He was using a python library that did not present the warning as in the webgui.

        That's why you make a dry run query before running a script that can cost you (or your employer) money.

        1. Chris Evans

          Re: It's always the "power users" doing dumb stuff

          Dry run would be helpful but not the full answer. "queryDryRun demonstrates issuing a dry run query to validate query structure and provide an estimate of the bytes scanned."

          No mention of cost estimate! Though AIUI it could be calculated.

          1. Necrohamster Silver badge
            FAIL

            Re: It's always the "power users" doing dumb stuff

            The web UI doesn't give a cost estimate either, so I'm not sure what you're on about.

            The dry run can give a size estimate, from which someone can calculate the cost.

            But thinking is hard for some people I guess...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like