Good step - but
I applaud the licenses but I wonder about the standards.
One of the biggest problems of language training datasets, beyond the languages in the dataset, is lack of dialects. They all tend towards 'proper' (i.e. newsreader) dialect. Southern and Northern English are not the same on either side of the pond any more than American and British English.
Some way of classifying dialects needs to be built into the standards. And, I would hope, some sense of diversity in the dialects should be deemed a community goal.