I think it worked correctly.
If there was a picture of something different every x number of frames then I would argue that there were two videos here. One of the lion and one of the car both being played back to the AI. That it classified the lower frequency one correctly seems to indicate it worked somewhat correctly given that it should give one answer. A human may not see the car because of the persistence of vision aspect of the human eye and its low bandwidth but the AI would see it. And with the picture of the car being identical in all cases, see it better than it saw something that continually moved around and changed shape like the lion. Effectively, the lion was variable noise, the car a constant if infrequent feature.