Sequential Max-Min Bilevel Programming with Incomplete Information and Learning



We present a framework for a class of sequential decision-making problems in the context of max-min bilevel programming, a class of hierarchical optimization problems with applications in network interdiction, defense, vulnerability analysis, among others. At each period, the leader allocates resources to disrupt the performance of the follower (e.g., as in the defender-attacker, attacker-defender or interdiction problems), who in turn minimizes some cost function over a set of activities that depend on the leader's decision. While the follower has complete knowledge of his/her problem, the leader has only partial information, and needs to learn about the cost parameters, available resources, and the follower's activities from the feedback  generated by the follower's actions. We propose a class of greedy and robust policies and show that these policies are weakly optimal, eventually match the oracle's actions, and provide a real-time certificate of optimality.  In particular, we measure policy performance in terms of its time-stability, defined as the number of periods it takes for the leader to match the actions of an oracle with complete information. We also study a lower bound on any policy performance based on the notion of a semi-oracle. Our numerical experiments demonstrate that the proposed policies consistently outperform several other reasonable benchmarks, and perform fairly close to the semi-oracle.