Feature clustering! | (mini)Book Genome Project

To evaluate the quality of suggestions from mBG, we manually created a complete shelf for the concept `Victorian literature’. It consisted of all 49 books in our dataset whose authors are broadly considered to be Victorian-era British novelists coded as positive. Those books that a user might confuse for Victorian British novels were coded as neutral, as a user might genuinely not be sure whether they belong on the shelf. All remaining books were coded as negative.

Armed with this fully classified example shelf, the first question we set out to answer, before even assessing the quality of mBG’s suggestions, is whether distance in our feature space is a useful metric for evaluating book relatedness; that is, whether shelf concepts form clusters in our feature space.

To do this, we found the average number of positive books found in two samples of $k$ books, for $k$ from 1 to 300. One was sampled randomly from the full dataset; the other consisted of the neighbors nearest to shelf members. As can be seen in figure, the number of positives found by randomly sampling is, on average, the value one would expect when drawing $k$ books from a pool of $N=2165$ total books, where $p = \frac{49}{2165}$ is the probability of drawing a positive book. In other words, it follows a binomial distribution.

The books sampled by choosing $k$ books near books on the shelf shows a substantially higher incidence of drawing another book from the shelf. This suggests that, at least for our test shelf, the concept represented by our shelf does form a cluster in our feature space.