Re: Sounds awful because it is
Go is highly deterministic. And it wasn't a brainless 'LLM' doing the learning. Algorithmic (self) optimization is of course a thing and has been for 50 years. Chess engines are likewise NOT LLMs.
100 publicly visible posts • joined 14 Nov 2013
hmm, so thousands of low-talent, low-skill, low-IQ H1b taking away American jobs doesn't result in the best infra and software products? whodathunk!
(yes, I've rubbed shoulders with plenty of the H1b in the Seattle offices thanks, so my observation is first-friggin' hand, K?)
IMO Amazon should jettison all the "value-add" and focus on the essential services. Let someone else offer fancy-pants services that are transparently utilizing underlying SQS/EC2/Lambda et. al.
AWS ops/devs are spread too thin. They can't/won't kick customers out of us-east or close the damn door. Sorry, it's no longer a permitted region in which to deploy. We saw this being a giant problem over 10 years ago but Sales was more interested in not ruffling customer feathers then facing the damn music. Seriously, fk Sales people. They're 2nd in line behind the lawyers.
I ran a service that used Textract for financial forms. This year the powers decided to use Google's Gemini AI to do OCR. It's so bad at it, my chicken can do better. And yet, we're supposed to soldier on because it's 3x the cost of Textract. Huh?
GCP and Azure like to think they have parity with AWS on the essential building blocks, but they really don't, IMO. GCP is just fickle and obtuse. Azure well, their VPC networking is just atrocious to manage.
The problem here is 2-fold, too damn big of a single-zone - "we" (internal staff) were harping about this back in 2012. That it would be the ruin of us.
2 - too damn cute by half.
The footprint of these services is massive, and in particular DNS is LOUSY at 'real time' data updates in particular because every client software is STUPID about caching answers and won't move off and try the next one. But they chose to abuse the shit out of DNS so they could steer traffic and "rapidly" enter and exist resources that were coming in or going out/dead. IT IS NOT A SIN IF THE ENDPOINT IS NOT THERE! Since we KNOW DNS clients are total shite, you're supposed to do all your *magic* where people (clients) can't see it. Furthermore they can and should have rolled out a custom resolver across the fleet that was smarter than the typical dumb-ass implementation and do things like using A record priority tags.
They tried to force DNS into "instant" updates and it bit them rightfully in the ASS. It obviously doesn't help when your coders are too friggin stupid to write software that is DEFENSIVE in what it accepts. Partially wrong (old) answers are preferable to no answer. The internal rate-limiting is also mostly a manual intervention. There is/was very little AGGRESSIVE rate-limit logic built in. It is all written with the assumption "it just works".
Akamai did this correctly - the IPs are fixed as far as the 'front door' is concerned. All the machinations and adding/dead discovery happens on the BACKSIDE with KNOWN FIXED last-ditch endpoints in their CDN pods. So yes, user traffic may overrun the capacity of these fixed points, but that's relatively minor compared to service outage. Furthermore, internal cross-talk should have utter priority if not its own sub-division of resources that are impervious to end-user load.
the "skill level" at Amazon is highly uneven. DynamoDB being one of the original services it used to be run by top-shelf talent. Similarly S3. But even 10+ years ago the brain drain was in full swing, it was reduced to monkeys from India (H1b natch) desperately trying to understand how the system worked and how to keep all the plates in the air and spinning. Said monkeys had never run a complex system, let alone seen one, and were entirely inadequate to the task. (This was when Jasse was head of AWS, several years before his promotion to CEO)
The S3 service for example relies on the notion of eventual consistency for customer data. So it has large leeway in how things are presented to the user. DynamoDB follows similar "eventual consistency" but on much shorter timescales. People FORGIT or IGNORE these realities of these 2 services and treat them as ATOMIC with immediate READ-AFTER-WRITE consistency. From there all kinds of hilarity ensues.
It took 3 days for the S3 ecosystem (back then a mere 300,000 nodes) to coalesce and assemble a 'system image' as distinct pods or spheres gradually merged and updated their maps of "who is my neighbor".DynamoDB works similarly. BUT their code-base maybe VERY outdated. It is common to find teams in AWS using a dozen different versions of the main tooling because they can't be arsed to stay current or even moderately so. So the race-condition between the 2 update/execute *may* have had at its root, divergent code-bases and thus behaviors.
IMO the constant fudging with DNS is an outgrowth of not understanding how the system was never intended to be contorted into how AWS uses it, and I suspect they could have used SLOW (relatively speaking) state-map updates to keep the map churn at a reasonable level. People who use AWS, and frankly internally too, don't seem to understand that INVALID DATA is FINE. Wait a bit and try again, or try the alternate answers. But no, they keep trying to achieve "perfection" in accuracy and you get these bolted on "improvements" which wreak havoc when some of the fundamental assumptions prove to be incorrect.
> AWS has disabled the DynamoDB DNS Planner and DNS Enactor automation worldwide until safeguards can be put in place to prevent the race condition reoccurring.
Which comes back to trying to get cute with DNS resolution. Clients have a NASTY tendency to hold onto records forever which causes highly lumpy inbound access to services. What's hilarious is Amazon didn't copy what Akamai does with their load-balancers. The whole "let's be cute" by screwing with the tables and updating the tables, and jiggering with weights is can be reduced to a simple level. And in the mean time your DNS records can be STATIC, and end-point alive/dead detection can be slow-moving and localized to the LB in question that owns the range.
Every ec2 instance has 1 nic port. ONE. This includes all instances used internally for backplane. There is also only 1 top of rack switch. Every s3 jbod (45+ drives) has ONE power supply. Even better they had to replace 75000 power supplies because they shaved the wattage so close all it took was 2 drives to go into click of death mode to blow the fuse and wipe out the entire node.
the 'application' was throwing datasets at it and have it curve-fit. Only thing is there's this software called Matlab and R that have been around for 30+ years that exist solely to do such work. The individual license might be expensive but it doesn't give you false answers EVER! AI conveniently invents answers some portion of the time.
The other institutional use-case was trolling thru 30 years of corporate documents, research papers, 3d model scans, failure data of aeronautical components etc. Except I distinctly recall a 2U Dell 2940 server with a bit of disk and ram and a yellow faceplate that said "Google" on it back in 1998 that did a fantastic job of indexing and cross-referencing keywords to documents when you turned it loose against all the NFS and CIFS fileshares across the enterprise. It *too* didn't know how to lie and make random things up. It cost what, 20 grand back then?
And yes there's some cadre of "coders" who think they can vibe their way to productivity if only they could learn how to cajole the infinite monkeys contained within the box to generate code that actually worked.
The people in power and the breathless users really ARE this friggin' stupid.
A decently useful rig costs that much. If you can get say 5 devs to use it cooperatively that's 200/mo/dev assuming no special power or Hvac needs and youre willing to ammortize for 3 loooooong years. I've built rigs for 5k that did a decent enough job for piddlyass usage.
Running your own infra or renting it straight rate $4/hr (1000/mo) is the only sane way forward. openAI et. Al. are terminally f'ked.
A b200 is 250k. At 2 yr depreciation, 50 active users it's $1/hr/user. Power and cooling about $1500/mo
Muntz "Ha Ha" comes to mind. For people so smart they are incredibly naive, even stupid! Aside from worthless chat bits, you're not going to see prices below 1000/mo for any useful work. If I even stipulate they are capable of such. And the economic reality was bloody obvious months ago if you were paying the slightest attention.
crazy orange has nothing to do with the incentive. This is just an already pre-built location with half of the cake already made. Just need finishing operations. It's a cheap ante into building more AI "compute". It's probably a lot cheaper to operate than many other MS cites (including NoVA) so when the AI balloon pops in less than 2 years, old/small sites in NoVA are more likely to face the ax than this one.
The sites in Oregon and Washington have dirt cheap electricity so those will stick around.
A certain 7 building campus in Manassas VA has 3-5 security per building per 12hr shift. Min staffing is 2. Facility "customer service" and vendor coordination is 8 more during business hours. M$ (and the other big7) hire their own augmented and armed security staff (one of the common firms that you commonly see providing shopping mall security) at 2 or 3 per shift per building where they have data halls.
So call it 6 people per shift per building. Or probably even fewer than what Foxconn ended up with. House electricians or HVAC are typically contracted out and simply on-call.. So more jobs that are not perm nor resident.
The construction of a site was probably 100 people at any given time. The mud workers are rotated out once the walls are standing and their employment is over. Then pipe-fitters and electricians move in. From greenfield to ready for customer load takes about 8-10 months in northern VA but we do these builds constantly and its a well-oiled process. In bumbfk Wisconsin it might take longer because of Union horseshit and adverse weather. NTTData dropped 4 new datacenters into Loudoun inside of 10 months.
Datacenters don't take "years" unless you don't already have the permits, and don't have the high-tension wires and sub-stations built out. Given that the MS/Foxxcon site is an established industrial location there is no lead time there.
And not bother is frankly the point. His project has no value to the ecosystem. Only in his warped mind is the current state of affairs in need of his solution. Which is buggy as hell.
Who gives a god damn if you can't boot off of it? That's why we have /boot and root filesystems that are small and essentially read only. The myopia over zfs on root is the same mental disease. It's a stupid thing to do ALWAYS!
Fine. But it's still not remotely ready for use except as experimental. Kent is apparently desperate. Nobody loves his ugly baby and the kernel gambit has done nothing to increase mind share. Honestly he's just Don Quixote at this point. Nobody cares. Lvm and the existing well established filesystems get the job done. Nobody gives a shit about the friction of a non-integrated solution. Of the few that do, they use zfs.
To waste brain cells on dumbass optimization like allowing over commit, and then trying to manage an self inflicted problem
Every rack in AWS datacenter is committed to exactly once instance type. Each server is sliced into whole numbers of instances that fit the memory size and core count that corresponds to the chosen type. RDS racks are just special cases of ec2 racks. If there are left over cores or gb of ram, nobody gives a shit.
Alibaba apparently isn't so clever, they are stupid for worrying about minutia that is not remotely i.portant.
our little service is targeting exactly this. Most of the big boys and also those who offer "cloud rightsizing consulting and billing management" typically have their own OpenStack or other cloud-like MSP stack that while perhaps cheaper than AWS, it's not by a whole lot.
Going back to mixed or on-prem is a bit PITA when you have to buy 1/2 a rack at minimum and then pay for bandwidth, interconnect and so on. And it also means you have to go source the IT talent to build it, if you were dumb enough to toss those guys overboard.
If a "garage band" can offer you instead for a song, resilient storage (nee SAN), clustered nodes for seamless workload balancing across hardware, and has written lambda functions that can piggy back off of AWS/Azure auto-scaling triggers to scale out the on-prem to a point and then invoke cloud instances once a threshold is hit, what's not to like? Especially if you can take advantage of unmetered 10Gbps bidirectional circuits.
$300/mo for 24+ cores and 256gb of ram, triple the typical storage quota, to slice and dice any way you want and has AWS DirectConnect or IPSec access is cheap even by rent-a-baremetal-server offerings.
The problem with moving back to on-prem is the startup costs and if you do find someone who will offer it for OPEX they overcharge severely while still being less than AWS.
Some cloud services (eg. S3) are simply no point competing with. But if there is an EC2 instance tacked onto the offering (eg. EKS) then the always-powered-on workload gets expensive right quick!
ClownStrike, obviously.
This same DoD mentality of security is just a set of checkboxes is precisely WHY we had that glorious demonstration of abject stupidity. I commend the prof for knowing this BS software is useless. The same DoD requires anti-virus on every last damn server. Why? Because some moron who can't type 'cat' at a shell prompt said so.
Red hat support org is mostly useless. The value of rhel is the quality of the release engineering ( don't laugh too hard) and it's stability. With any other platform packaged software would change willy nilly. Eg Apache 2.2 would suddenly become v2.4 because they felt like it. With rhel it was still 2.2 with certain things backported. Which is its own problem, but not normally a showstopper.
Red hat got GREEDY many years ago!. $200 one time became $79/yr/host. Then it was 199/yr/host. Then 400 and even 1200+
There is no frickin way you can justify that kind of price gouging.
THAT is why CentOS etc became so big. Nobody could afford Rhel everywhere any more. Hell some people went back to friggin Windows because it was cheaper and by a lot.
Redhat killed the golden goose a good 10 years before ibm came sniffing around. Their own stupidity MADE Conical and Suse the players they are today.
I worked for Nielsen Media. We had ONE copy of Rhel for the purposes of scraping the patch repo, and a 1000 machines installed. That used to be legal. Rather than start paying redhat millions we pivoted to CentOS. Had redhat offered us $50/host/yr we would have taken it in a blink of any eye.
Red hat mgmt has fundamentally been just stupid. They wanted to get rich. They sold out to get their pay day, KNOWING the cancer that is IBM would destroy the company soon enough. Now the devil is here to collect.
Yes this is ALL about screwing Oracle as hard as they can. These 2 hate each other with white hot hatred. They really should just go get pistols at dawn and shoot each other dead and be done with it.
The trivial answer to this tightfisted money grubbing is incredibly simple. $50/host/yr subscription to the repos. 3yr qty discounts, Org level bulk discounts. And completely separate support contracts for actual support.
When running a rhel box is as cheap as bottled water, nobody is going to bother making clones.
Red hat is perfectly within their rights to prohibit further distribution of their binary and src rpms. If Oracle wants to get into that game, they can give redhat 100 million a year.
the top-of-rackl UPS needs to have enough power for about 30 seconds worth of load - they might have as much as 2 minutes. Buffers can be flushed and checkpoints written before everything goes black. The Gensets fire immediately on loss of mains. But if one were to fail and the N+1 also, then yes, part of the DC goes magically silent.
The S3 Index tier is a close analog. When we had to mass-bootstrap the tier the nodes fetched their config from a 'static' source before they fell back to 'chatter/gab' mode to converge. I can't remember how we partitioned node sets but the respective 'master's eventually got their immediate peers all registered and sent updates upstream to other cells and eventually every cell got wind of all the other cells. But we sure as HELL did not maintain N^2 active connections!!
This was SOLVED 10 years ago by the S3 team (and probably the EBS team). Kinesis apparently didn't bother to avail themselves of the existing codebase.
This is not an uncommon occurrence at AWS - the teams don't talk and apparently Jasse and his minions haven't beaten the individual service teams with the "REUSE THE GOD DAMN CODE!" hammer enough.
that's who they hire by the bucket-load and most of them are H1B at that. Did you honestly think they had actual CS degrees and wouldn't design something so stupid?
The only way forward is to use key-partitioning (like S3 does) and stop being so damn cheap about refusing to use load-balancers since they have their own in-house design for Pete's sake and don't have to pay Citrix for their NetScalers anymore.
I don't remember how fast the S3 infra converges to a single-system-image, but S3 has 3 distinct tiers for starters and about 350,000 servers globally that need to eventually register and share 'knowledge' about their peers. If Kinesis is not using the correct/latest 'chatter' protocol to discover it's swarm, they are fracking idiots.
I can do that now and nothing special needed. Find a local-storage node type, put it in a subnet that doesn't have a routing entry for a NAT gateway. Or enforce no traffic via security group rules or Network ACLs.
And lose the PEM key. If you don't have the SSM agent installed you can't get into the 'console' via SSM either. What's so hard about this?
Wordpress has a couple of 'publish as static' plug-ins which is what 98% of sites should be using. That way WP sits behind the firewall and can run as badly as they want. People who build WP sites can barely use a browser anyway so programming is completely out of their skill set.
Funny thing, I remember a few tools back in early 90's that were 'fancy' WSIWIG site editors that 'printed' your website in a static manner. Blue-something? The generated HTML was unreadable though.
And no, nobody gives a damn about comment boards on your dumb-ass website. Embed a MAILTO tag in the 'contact us' graphic and the job is done.
it's time to follow the lead of AWS. Go CLI and forget the slow, cumbersome UI. Let pissed off users or sufficiently motivated consultancies write Python interfaces to CLI calls. You don't need a dashboard anyway that's Nagios' job. Write wrapper scripts in Perl or Bash. UI and UI portability is a fools' errand.
Or just go back to the old .NET windows-only client. Nobody cares if they have to keep a Winblows box around to admin via VCenter.
had a job interview there about a year ago. massage consultant and dog walker onsite. And offices in Georgetown. By silly-con valley standards the offices were stark and bare. They were looking for devops help to try to stop doing everything by hand and one-off. I was rather surprised they were so far behind the curve.
What on earth do 2400 employees do exactly? Well same could be said for the ungodly number at FB, Google and Amazon too.
Typical silly-con HR nonsense about how well you work with teams in a nurturing fashion and affirm other's inputs. I told them I don't "do bullshit". If your idea is stupid or you screwed up something, I will tell you to your face. Needless to say they were aghast.
First, Fk Uber. But the solution to this isn't employee or contractor per se - that's just a tax grab. Deregulate the entire 'for hire' landscape where it concerns fares. Publish MAX fares which are what they are today and posted on every legitimate taxi. Every car must be equipped with an official GPS device with a running meter display - plenty of commercial solutions available. The taxi regulator could publish an app that would do the same. If Uber wants to commit to a lower 'fixed' rate at time of booking that's their prerogative. And so can the regulated taxi companies.
Every registered driver must be charged a flat-rate 'medallion' fee calculated based on distance traveled while under fare. They must also be charged commercial insurance rates per mile. Uber et. al. already collect VIN of all vehicles and DL of every participant. This will be reported to the DMV in near realtime - ie the previous day's active bookings are reported by VIN. As a driver you can supply proof of commercial liability insurance to the DMV and brand your title as 'for-hire'. You can also bulk-pay your medallion fee to the DMV, say $5000/yr, non-refundable. People who want to make gypsy taxi their source of income will elect to do the needful for some savings in fees. Casual drivers will pay the per mile rates out of convenience.
This makes the playing field completely level and will probably kill 90% of the 'casual' drivers and good riddance.
the family of course. People don't get paid just because they are mouth breathers. They get compensated based on the economic worth of that particular person and expected lifetime earnings which is peanuts compared to even 1st world working poor. There's some extra ladled on top for 'pain and suffering' but at the end of the day, if you're from Etheopia and Indonesia your life amounts to very little. Sure, there are the odd doctor and oil or mining exec who brings in big coin so their lifetime earnings are not a rounding error.
Life is short and brutish in most places in the world. What's a life worth in China, in Cambodia, Burma, India, most of Africa or pretty much any Muslim state? Damn close to zero. That's just the facts.
The S3 operations team monitors a zillion metrics but as the article noted, it's almost entirely "inside" activity. No doubt customers started complaining about bucket availability and there might have been metrics that showed a downward move in request rate that didn't jive with historical levels.
The answer will be to either piggy-back off of specialist anti-DDOS providers but also to stand up arms-length availability metrics viewed from 'outside' as it relates to geographically distributed name resolution. Or more likely write their own DDOS implementation and embed it into the Route53 infrastructure. S3 front-end itself has request rate-limiting already. But I couldn't say how hardened it might be against a flood of malicious payloads.
> Someone wasn't running Cloudwatch/Cloudtrail properly/at all then.
Oh come on. The number of outfits that even know what those are is small and the accounts that have it set up CORRECTLY to detect 'bad things' is vanishingly miniscule. Not to mention the people on the receiving end of the messages (assuming sent by email or piped to the federally mandated Splunk don't know what to do with them. The 'security' staff in most places are incredibly bad at their job. I swear, when you fail as a developer/ops, don't want to be a cat herder, you go into security if middle-management is not available.
Now a so-called financial institution in a highly regulated industry should be a cut above the normal cesspool. And yet their failings are as bad if not WORSE than other orgs who don't labor under "compliance" mandates.
> This dependence on IAM for everything needs to fucking die
Well using session creds *is* IAM. One of the insecure by default in AWS is the OUT=* firewall rule that has to be explicitly removed from every security group when created. If you want to beat AWS over the head, start there.
Using session creds in user-space is FAR, FAR more likely to engender pathologically lazy and stupid behavior on the part of developers, let alone sysadmins, and lead to credential theft. Not to mention your app will have to 'refresh' it's creds every hour or so. IAM roles are the best, most correct answer actually.
Where IAM roles fail is not the fault of IAM as such - it's the meat-space that can't write a policy worth a damn because the topic is opaque, convoluted, and tedious. STUPID people need not apply. However, the world is primarily populated by stupid people and a lot of them have jobs in IT. So instead of actually identifying the specific S3 operation, S3 bucket and/or key path, they just heck the S3:* and Resource=* and go on their merry way.
I've found just incredible security gaffs in AWS' Professional Services' code and publicly shared solutions and sample code. What's that tell you?
Why do we have the plague of public S3 buckets? First, Amazon had buckets marked public by default way back when (as I recall) but more to the point, people can't figure out what "public" actually means, and can't write a bucket policy to save their life. Only recently has Amazon written a system that traverses the ecosystem of all buckets and sends the account owner an email asking them "did you really mean to do that"? I got mine a couple days ago but the buckets have been public for well over a year. How often does the check fire? Within say a couple hours of a bucket perm change?
AWS is sufficiently complicated and obtuse even people with good IQ and a rigorous approach are easily tempted to take shortcuts. Disaster follows as expected.
When designing nuclear power plants (Ukranian test program aside) it's done by very SERIOUS people, who take their time and have their work checked meticulously by other very serious people who are looking for mistakes. Clearly that pattern does hold for the FAA and Boeing but that's a separate topic.
Now let's look at the typical 'Dev' pretending to be Ops, hell, look at your typical IT bod be they helpdesk or sysadmin. They are some combination of incredibly dumb, lazy, sloppy. How many times has Microsoft f*cked millions of machines because they didn't test their software patches. And they are supposed to be 'smart'.
Security is HARD. AWS does it's users no favors by designing a system even experts shy from. The world would be a vastly worse place if IAM roles were not being used. The trick now is to somehow get people to write policy statements in a responsible fashion.
certain employees have out-of-band access to objects because they can walk the metadata and hit objects "in the raw" and bypass the ACL mechanism. At some level this criminal had 'privileged' knowledge like which VPC or IP was defined in the bucket policy ACL and had abused/retained sufficient access to scrape the data.
> Yes - You can have S3 Buckets behind two firewalls using S3 VPC Endpoint within your private subnet without exposing onto the internet.
utter rubbish!
A VPC service endpoint simply means your request traffic that originates from within the VPC doesn't need a NAT or Internet Gateway to hit the S3 web tier servers that sit in publicly reachable IP space. The traffic stays "internal" to AWS datacenter routing CIDR.
The information disclosure has nothing to do with someone sniffing the request/response traffic. Rather the security permissions on EACH object stored in the S3 bucket. If you mark an object as public or have a bucket policy that makes some/all of the paths public, the object WILL be served to you no matter how you requested the object.
I can sit here in my own VPC with S3 endpoint and grab every public bucket object I want and it's totally legit. The bucket/object owner is exclusively the idiot who allowed people to enumerate and GET the objects.
The reason there are so many cases of public objects which weren't intended to be so 2 fold:
1) early on the defaults made it quite easy to pick 'public' without clear warnings as to what that meant
2) Bucket policies are 'hard' to write and most people are simply too STUPID or lazy to tackle the topic and out of frustration just click the 'public' button in order to shut up the developer who also can't be bothered to use an IAM role bleating about his delivery is being impacted because he can't see the data.
IT is not for stupid people but there are a many millions who are involved because they are cheap.