It would be good if someone can show I'm reading the paper incorrectly...
Reading the article from Nature, it seems - as then quoted here - to be confusing the word "explain" with "correlate".
Unless they really believe that the secondary data sources, such as geo-located tweets, really are the explanation for hunger? Then again, they only mention tweets that once, so they probably don't believe that (but why mention tweets at all?).
I'd be very glad if an ML model really could help, but unless I'm reading the paper wrong (see title) and the graphs included, their model is really not doing a good job. Their predictions are generally following straight lines or just shallow changes whilst the near-real-time data they are meant to be predictions of bounce up and down over an 8 week period.
Their own error analysis shows wide variance in the accuracy of their models' results.
"Results indicate that, as one might expect, errors happen when the independent variables take on values that differ the most from those most frequently seen by the models during training."
So, like all of today's ML models, these are just crudely matching on the inputs from the training, without any consideration as to whether they've even allowed the model to see the full range of values that each input *could* take - and instead of flagging "out of expected range" as you would in a properly-built model, it just spits out values that aren't seen as errors until compared against the new ground truth.
There is also a swing from the start of the "Main" section discussing using secondary data sources (including the above-mentioned tweets) to feed the predictive models to discussion and graphing of models fed more substantive data, such as population density and GDP per capita, as base-line model: "however, because they are annual national-level figures, they serve as a fundamental baseline but cannot help in predicting the sub-national and rapidly changing dynamics characterizing food insecurity, which is the objective of this study".
But it is these base-line models whose outputs are explained (via the SHAP method) in terms of which of the inputs caused the change.
What I'm totally missing is the connection between the apparently reasonable (as in, what their inputs are) base-line models and the other "secondary inputs" model that is purported to track the "changing dynamics" that are "the objective of this study".
Unless that early part of the "Main" section is just irrelevant to their work?
Or if you can point out what I've misunderstood...