Data retention, compliance and archiving are often the forgotten areas as they're following the same principle as with backup. Noone cares about it until they need to access/restore something from somewhen.
Often the 3 topics are driven by business requirements, and thus when it hits the IT infrastructure support groups it's already segmented into different solutions driven by specific application requirements. The result is multiple unaligned tools & infrastructure components that become unmanageable & costly as they grow larger & older.
On top comes constantly increased compliance requirements.
The way to handle it is first of all to define, top down, a company policy/directive about what has to be kept, how & for how long, but just as important : What has to be deleted, how & at what age. In the end this is Information Lifecycle Management (ILM).
To define a proper ILM policy it is required to define and enforce data classification.
When the ILM policy/directive is defined and the data classification is there also, the next step is the process.
If no process is defined, each business unit will interpret the policy their own way and define a process that fits them, thus again ending up with a zoo of solutions in the IT infrastructure support.
The key here is to make the choice easy for those having to follow it. If they can chose only one way to do it, the choice is fairly easy.
When the process is defined, the next step is the technology. There are many tools out there. Some are purely software based and some are more deeply integrated with hardware solutions.
Depending on what compliancy rules there are, one often have limited options on what software & hardware will fit.
Application integration is also tricky here as there are few standards for such. XAM being one of the few common ones, however with the right policy & process in place and with proper data classification, often a simpler infrastructure setup can be utilized as the metadata definition has been covered by following the policy & process.
The trend is that storage solutions in this space moves towards scale-out object based platforms with alot of logic built into the solution and having common supported APIs (NFS, CIFS, WebDAV / REST, XAM). Alot of data classification & compliance can be defined when setting up each data container (how many copies to keep and geo-dispersed or not, how long to keep etc.).
It is key to classify the expected data before generating the data. If one start to generate data first and then try to classify later on, it's typically too late.
What keeps the cost & mangement under control is the ILM directive/policy combined with the data classification & enforced process to follow. This way it's ensured that only the data that needs to be kept is kept, and all the rest is deleted after a pre-defined grace period. Typically such an enforced deletion policy will get rid of > 80% of the data amount.
What kills the storage budget in relation to compliance & data retention is typically the reality of not knowing what data is important and what is not and thus ending up having to keep it all, just in case.
On top, keeing data that didn't need to be kept, could even be a risk and an audit finding.