Re: Were previous medical reports wrong?
That paper is not about unreliability in brain scans, it's about unreliability in doing your statistics badly. It was a fun illustration of the need to use multiple comparisons correction when doing mass univariate statistics. That's something that had been around for years when they did that work, however because it reduces your power, people had either not been doing it, or taken to inventing home-brewed "correction" which wasn't worthy of the name (such as using some arbitrary lower p threshold and arbitrary cluster sizes to establish 'significance').
The problem in a nutshell: outcome of experiments is based on statistical comparisons, standard hypothesis testing. In investigations where you wish to know where something is occurring that means statistical comparisons between images. As usual in hypothesis testing you consider the probability of the measured result given the hypothesis of no difference or effect between groups. However, with images, such as MRI, but also PET, SPECT, you're making that group comparison for each voxel in the image. There are lots of these, while your p values are calculated for single independent comparisons, https://xkcd.com/882/ sets in. With a vengeance https://twitter.com/ibmalone/status/1168847805584224256 To counter that you take your original per-voxel p-values and (equivalently) apply a stricter threshold or adjust them for multiple comparison (reduce them) and apply your original threshold. But what's the appropriate correction? Bonferroni correction says you should divide the p-values by the number of independent comparisons, but the more you do this the more statistical power you use, and adjacent voxels aren't independent, so just using the number of voxels over-corrects.
Rules of thumb:
PET and fMRI, look for p<0.05 family wise error correction voxelwise. FWE p<0.05 on cluster sizes can also be used, but cluster forming thresholds should be less than p<0.001 if so (based on Eklund et al. 2016 https://doi.org/10.1073/pnas.1602413113 established false positive rates by using randomised null data), permutation testing preferred.
Volumetric voxel-based-morphometry (VBM), only voxelwise FWE p<0.05. Cluster based results should not be used. Ashburner and Friston 2000, https://doi.org/10.1006/nimg.2000.0582 afraid this one isn't open access, but here's the relevant bit from the "Testing the rate of false positives using randomization" section, "Approximately 5 significant clusters would be expected from the 100 SPMs if the smoothness was stationary. Eighteen significant clusters were found when the total amount of gray matter was not modeled as a confound,and 14 significant clusters were obtained when it was. These tests confirmed that the voxel-based extent statistic should not be used in VBM." (As in the quote, the reason is non-stationary smoothness, which is exactly what it sounds like, cluster size inference assumes the smoothness of the signal is constant across the image, which works for fMRI, but not for VBM / non-linear registration.)
In general permutation testing is a pretty good alternative to the parametric methods, and can be done in an FWE manner without the assumptions that parametric methods need. Much harder computationally though, and can't necessarily be applied to more complicated experimental designs. Open access and fairly accessible review by Nichols that I was previously unaware of: https://doi.org/10.1016/j.neuroimage.2012.04.014
There are alternatives to FWE, such as False Data Rate. While FWE attempts to control the overall probability of false positives, FDR, as the name suggests, controls the rate, the expected proportion of your significant results that are false positives (remember, we're talking about per-voxel comparisons). It provides more statistical power (i.e. reduces false negatives), but people can be wary of it, because it doesn't give you the same 'guarantee' of significance (which of course is not a guarantee anyway).
And so the JAMA paper on these embassy workers: https://jamanetwork.com/journals/jama/fullarticle/2738552 secondary analyses not multiple comparisons corrected, so ignore. Primary analyses, they did a number of these, one being whole brain volume. Whole brain volume is not a statistical image comparison, it's comparing a single number from each image, which on its own does not have the multiple comparisons issue discussed above. Multiple comparisons reappears though if you compare more things... they do, and use FDR correction on their primary outcomes. What do they find? Significantly smaller white matter volume, but not significantly smaller grey matter volume, and it actually trends in the other direction. Sex is controlled for, just as well as the control group has slightly higher proportion males to females than patient group. For the rest, they pretty much throw the kitchen sink at it, and some of those differences are probably real. Some of the subjects had head injury histories (former military?), they claim that excluding these did not change the results, however while they show some of this in the supplementary (one of the longest I've seen), they don't show it for the white matter volume change. Conclusions? "Significantly lower functional connectivity was observed in the auditory and visuospatial networks in patients compared with controls." You don't have to believe a sonic weapon was involved to believe there are differences here; we're looking at a cohort that reported some phenomenon and comparing to a control group(s), you could equally believe that the volume difference (and connectivity differences) might be associated with suffering long term anxiety, or with a higher vulnerability to mass hysteria or hallucination, or with insecticide exposure as suggested by other posters.