Dynamic Online Pricing with Incomplete Information Using Multi-Armed Bandit Experiments



Consider the pricing decision for a manager at a large online retailer, that sells millions of products. A manager must decide on real-time prices for each of these products. It is infeasible to have complete knowledge of demand curve for each product. A manager can run price experiments to learn about demand and maximize long run profits. There are two aspects that make this setting different from traditional brick-and-mortar settings. First, due to the number of products the manager must be able to automate pricing. Second, an online retailer can make frequent price changes. In this paper, we propose a dynamic price experimentation policy where the firm has incomplete demand information. For this general setting, we derive a pricing algorithm that balances earning profit immediately and learning for future profits. The proposed approach combines multi-armed bandit (MAB) algorithms statistical machine learning with partial identification of consumer demand from economic theory. Our automated policy solves this problem using a scalable distribution-free algorithm. We show that our method converges to the optimal price faster than standard machine learning MAB solutions to the problem. In a series of Monte Carlo simulations, we show that the proposed approach perform favourably compared to methods in computer science and revenue management.