It didn't decide to do anything. The input photo is all the data collected about you. The output photo might be a single pixel describing your credit rating. And the filter is the entirety of the program.
One of us isn't getting this, and I don't think it's me!
Sorry, but afraid not. You are indeed missing it. There are no decision trees, or state machines in machine learning / neural nets (in they way you appear to be thinking).
Your retina and brain is comprised of neurons. How do we ask it how you decided you just saw a cat? Big hint, it's not the way you might think. We can slightly describe how it actually happens in terms of layers that look for vertical edges, horizontal edges, motion, image convolutions, and so on. That's all you can get out of machine learning.
Even better comparison - how do you recognize someone's voice? Can you describe an average friends voice well enough so that someone who has never met them will uniquely identify them the first time they hear it? If you magically tracked all the neural activity, you would have worthless information about relative weights of harmonics and frequencies and time delays, but it still results in either recognition, familiarity or not. Even with all the details, it doesn't tell us what voices might be easily misidentified, or who could do a good impression of that person.