Visual symbols or events may provide predictive information on to-be-expected sound events. When the perceived sound does not confirm the visual prediction, the incongruency response (IR), a prediction error signal of the event-related brain potentials, is elicited. It is unclear whether predictions are derived from lower-level local contingencies (e.g., recent events or repetitions) or from higher-level global rules applied top-down. In a recent study, sound pitch was predicted by a preceding note symbol. IR elicitation was confined to the condition where one of two sounds was presented more frequently and was not present with equal probability of both sounds. These findings suggest that local repetitions support predictive cross-modal processing. On the other hand, IR has also been observed with equal stimulus probabilities, where visual patterns predicted the upcoming sound sequence. This suggests the application of global rules. Here, we investigated the influence of stimulus repetition on the elicitation of the IR by presenting identical trial trains of a particular visual note symbol cueing a particular sound resulting either in a congruent or an incongruent pair. Trains of four different lengths: 1, 2, 4, or 7 were presented. The IR was observed already after a single presentation of a congruent visual-cue-sound combination and did not change in amplitude as trial train length increased. We conclude that higher-level associations applied in a top-down manner are involved in elicitation of the prediction error signal reflected by the IR, independent from local contingencies.