Introduction

Overview

Determining a winner among a set of items using active pairwise comparisons under a limited budget is a challenging problem in preference-based learning. The Dueling Bandit Toolkit implements the PARWiS algorithm, proposed by Sheth and Rajkumar [17], which leverages spectral ranking and disruptive pair selection to identify the best item under shoestring budgets. The toolkit extends PARWiS with two novel variants: Contextual PARWiS, incorporating item features, and RL PARWiS, using reinforcement learning for pair selection. These are compared against baselines like Double Thompson Sampling [19] and a random selection strategy.

The toolkit evaluates performance on synthetic and real-world datasets (Jester [5] and MovieLens [6]) using budgets of 40, 60, and 80 comparisons for 20 items. Metrics include recovery fraction, true rank of reported winner, reported rank of true winner, cumulative regret, and the separation metric \(\Delta_{1,2}\). Results show that PARWiS and RL PARWiS outperform baselines, particularly on the Jester dataset with a higher \(\Delta_{1,2}\), while Contextual PARWiS shows comparable performance, indicating potential for further feature optimization.

Background

Preference-based learning through pairwise comparisons is widely used in recommender systems, social choice, and information retrieval [8, 18]. In scenarios with limited comparison budgets (shoestring budgets), efficient algorithms are critical. PARWiS addresses this by using spectral ranking [14] and active pair selection, building on the Bradley-Terry-Luce (BTL) model where the probability of item \(i\) beating item \(j\) is:

\[P_{i,j} = \frac{w_i}{w_i + w_j}\]

This toolkit provides a practical implementation of PARWiS and its extensions, enabling researchers to explore winner determination under constrained budgets.