andrew w. moore | blog

Guessing a book's page length from the audiobook's duration

Over the past year, I’ve been gradually adding features I used from Goodreads within the /reading route of my site. I’ve made dedicated subpages for each book that I’ve finished, a timeline, and a display of books I’m currently reading. Most recently, I’ve been working on a /reading/stats page that summarizes my reading activity each year.

Aside from the book covers, all of the metadata I add about each book is recorded by hand. For example, I’ll write down the number of pages in the edition I’m reading. The newest data point I’ve started tracking the medium I used when reading a book (e.g., ebook, vs. audiobook, vs. print). As I went back to note which books I’d listened to, I started to feel itchy: are audiobook lengths and page-lengths really comparable? I’d like to be able to have a general sense for how “large” books are within my collection, and it’d be nice if I could use a standard measure.

Books are printed in various sizes, which means the the number of words on a page varies by book. However, most e-readers can track progress based on the print edition’s page numbers, so there’s at least some correspondence. But, what about audiobooks? From personal experience, it does seem like some narrators are a bit faster or slower. Perhaps the book’s subject matter or genre also influences how fast the book is read. These hunches suggest that we’ll see variability in audio-length at different page-lengths.

Comparing these lengths directly is probably the best starting point. At the time of writing this post, I’ve recorded data for 39 books. For each book, I went back and wrote down Amazon’s “listening length” value for each book in audiobook form. First, some univariate statistics:

NPagesMinutesHours
MeanSDMeanSDMeanSD
39374.4 157.4687.7 332.510.9 6.0

Here’s what things look like when plotted. Herbert P. Bix’s Hirohito and the Making of Modern Japan is a bit of an outlier, but the relationship looks fairly linear. The equation for the regression line is pages == 0.4 ×\times minutes ++ 113.5. Across the whole collection, the model makes prediction errors of about 65.3 pages (RMSE).

2006001K1.4K1.8K 2006001K ;
Pages:
Minutes:
Title:
So, for an audiobook that's and long (690 min.), we'd estimate 389 pages.

The fit isn’t great for some books close to 600 pages in length; the 4 points with large residuals are each Sci-Fi epics. Genres might actually be relevant for prediction, but this could easily be noise due to the small sample size.

In practice, I think I feel good about only recording page-lengths for each book (rather than also recording listening times). It looks like the model is doing an okay job of predicting pages for audiobooks 5 to 14 hours in duration. Maybe I’ll revisit this in a year, and we can see what’s happening with longer books.