Multi-vector issue
As a person working most of my life in service providers that manage server and DBs for customers I do recognize valid points in almost all points of view.
In any case a trend I see over the years is that developers and DB admins and now the new devops generation and automation requirements work more and more in high level with little understanding how a computer works. Wellcome to the cloud era.
The examples and issues are so many I would have to right a book, but I will give few examples.
Devops and dB admins when specing a serve 80% of the time brute force resources and almost always ask for way more than needed, and in a virtual by default era, this brings some issues. For example in VMware environments, a very common and known problem with DB servers is with CPU scheduling and monster VMs in general. The more CPU cores that aren’t in use in reality, the worse the performance can be, and I lost the counts of app and dB owners complaints about cpu performance I resolved by simply cutting the CPU count in half or reorganizing the cores to align better with the physical server specs.
Another common problem already mentioned is the quantity of copies DB admins use. They often think the best way to backup is to do a local dump every now and then and then replicate it to a second location or 3rd, usually another DB server they control elsewhere. This of course on top of the DB replication. Some even use VM snapshots on top. This practice drives to an uncontrolled flow of that and when you actually need these copies, like in ramsomeware attacks, all these copies are gone. On top the data is almost not dedupable or compressible for storage systems. And interestingly create a constant abusive load of sequencial data that often affects SSD performeance, that is now the default for DB. Now when I say affect performance I don’t mean that to a point DB admins notice, because many DBs aren’t that intensive and the brute force some times compensate. But to illustrate as per this week I have a case in hands of a customer complaining about performance during certain periods of the night that last for 8 hours, and when the teams checked they were creating loads that exceed 400 MB/S that is damm good for SSD storage. To prespective aws EBS SSD won’t provide you more than 250. The ticket still says they have a performance problem. Don’t ask me what they are doing in terms of tasks to generate such traffic for a DB that isn’t bigger than 300 GB. The traffic they generate is enough rewrite their fill DB every few minutes in a loop for hours.
The above example leads to the other problem that is related with automation and reporting. The amount of loads that are now consistently competing with DB resources without any planning, between backups, replications, indexing, dB updates and reads for constant automation flows and reporting are often driving DB servers to unexplainable loads. There are 2 sides to this problem, first is with the abusive use of programmatic queries on the DB side in SQL DBs, with excessive joins, rather than just take the raw data and do that elsewhere. The number of long running queries I see this days with queries as big as full poetries is becoming surreal. The PostgreSQL folks are particularly good illustration of this, for some reason.
App designers and dB architects are becoming increasingly bad in using cached architectures and not caring at all and pushing businesses to brute force resources that when I get involved in average I can reduced 50% in computing and storage and increasing yet the performance by some %. There are several reason for this, one and often not accounted for is poor documentation of code and high rotation on the developers, that rarely will bother to look at existing code and tables and do everything yet again from scratch, because of course they know better. The increasing cases of duplicated loads because of this effect is mind blowing. The amount of code and scripts all over the place that is deprecated but still running is getting in some organizations to quite worrying Levels.
The other limiting factor on resource management is again cloud related, the procurement process was eliminated for the most as the installation process. This is driving to the notion of limitless and loose of cost control, the devops guys can just spin another 100 servers at the change of a value in an api call, so why to bother to think if they really need 100.
I could keep writing examples and issues I see daily, but in general the cloud generation and all the abstraction layers in the middle created a wasteful resource generation of IT professionals and when you cross that to the known ego problem that most IT professionals have that they know better and what was done before is crap is helping even less. And of course the very human nature of laziness will ensure there aren’t cleanups being done on old scripts, apps and whatever more that isn’t used anymore.
The sad thing is that today we have so many tools and open source projects that can do great optimizations, from cache to queuing, but because the professionals that implement them don’t really understand the capabilities and install them with default configs not even bothering to understand how those tools actually work and how they can contribute to improve things. Many applications that have the right stack of technology, but because of this they become just and yet again a wasteful resource. Redis and such things are the prime example, very good tools that people thing that just because they are their and out of the box they will do miracles, then the outcome is devs saying that redis creates more problems with caching and that isn’t true, they just blame it for every bad code they do. This is a real phenomenon I see, redis is very often used to explain strange behaviors, when in reality is just bad code or app not being adapted to run in redis or distributed environments. Again because who writes and designs the code doesn’t understand the basics of the computers that are still there as they were in the 80s and still provide the same challenges, just in a much bigger scale.