Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Extracting book titles from comments
40 points by _6cj7 on April 26, 2017 | hide | past | favorite | 21 comments
Using named-entity recognition, how to extract book titles from HN comments? Should I train NER chunker on HN data?


There is one project[0] that accomplishes the same thing, but instead of searching for the titles, it searches for the links to online stores (like Amazon) and then checks if the links are indeed books or not. Might be relevant to you.

[0] http://hackernewsbooks.com/


Not shilling: I recommend signing up to the weekly newsletter. I've picked up around 9 very interesting books from it. The best recommendation I got from it (subjectively) was Zen and the Art of Motorcycle Maintenance.

I really enjoy the service. Mostly go through all the top books and see if anything calls out to me, and so far the site has done really well at keeping the informal contract of "I give you my email, you don't spam me".


I am aware of this website. Initially, I thought to do the same, extract all links from comments and then figure out which ones are "book-related". However, after browsing several threads I noticed that lots of book recommendations are in plain text.


Ha, that's hilarious because I created one that searches all Reddit comments for products from Amazon[0]. And I'd always been thinking of moving toward searching for all proper nouns, but obviously that's more difficult on this front. So if anyone has suggestions about whether or not to expand and in a certain way, let me know!

[0] http://www.productmentions.com/


This concept I believe represents a good enough sample size to provide a good overview of "most commonly linked-to items." If 100 people link to the same product across many domains, chances are that most of those will be Amazon links. You don't get all of the links, but you get most of them.

The exception will be items from sellers of products not available on Amazon. And there may be some communities in Reddit that are more loyal to another e-commerce platform or site you would miss (crafters to Etsy, antique collectors to eBay, etc.)


I did this a while ago (www.hnreads.com, bookbot.io). Manually labelled ~2k comments, trained a NER system, then validated the titles via amazons api. Was pretty easy once I had manually labelled comments. Don't recall my F1/precision/recall scores - they were ok but lower than the state of the art reported in papers.


That's interesting. Which labels were you assigning to them?


I had a macro in emacs that wrapped the highlighted text in some xml tag (say <book></book>). Processing that I could label it however - e.g. IOB or whatever you fancy. The labelling didn't really take that long to do, maybe a few hours over a couple of days.


I would do something like this:

1. Get a few thousands book titles that are "long enough" so that the probability they appear in a sentence without meaning the book is low

2. Search these spans in HN comments and use this corpus to train Stanford's CoreNLP NER

3. Run the NER on all comments

4. Check on Openlibrary or another book DB that the extracted spans are real books titles


This is known as bootstrapping or semi supervised learning, just in case you want to look for some theory behind it.


How so? I don't see reference to sampling with replacement or a suggestion to re-use the unsupervised results to further improve the model. Seems like I'm missing something...


You're creating a small classified dataset without manually labeling them, in order to train a supervised learning model.

I'm not an expert, but I think that a boostraping technique doesn't imply continually improvement of the model.


Ahh, I see. I was confused about whether semi-supervised approaches rely on using predictions on the unlabeled data to improve model performance. Wikipedia seems to suggest this is a key component which isn't mentioned in OP:

> Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data.

Agreed on bootstrap, but in the proposed approach you're not artificially expanding your sample size by sampling with replacement.


I built the simmilar stuff for Reddit using Amazon links. http://booksreddit.com

I'm in the process of adding Goodreads and other book websites to get better suggestions.(Using Amazon API is limiting on scale)

As for your question: I've been researching python natural language processing to get probable named entities and check them against Amazon API. I suggest https://spacy.io that has reasonable named entities extraction. However doing it at a scale might produce lot of books that are named as a common phrases.


Haven't heard about spacy, been struggling with Stanford's NER for a few days. Looks great, will check it out. Thanks!


I've found experimentally that doing a search on amazon for a book title, even if you're only close, almost always turns up that exact book. However, the product API has really restrictive terms.

I would look at the OpenLibrary dataset - you might be able to match titles in there to comments, or use it to validate the NER output, if you don't want to go the Amazon link regex route.

The entire dataset is available for download, or you can build a prototype with their API - I did this to map speakers to books with https://www.findlectures.com (you can see it if you hover over a name - e.g. https://www.findlectures.com/?p=1&speaker=-Barack%20Obama).


A pretty common structure in comments:

"name of book" by "author_first_name author_last_name"

If you could determine the most common words written before the title begins (read, liked, loved, recommend, etc.) you could probably parse out a lot of titles plus their authors.


You would probably be better served starting with an ISBN database and just looking for spans that match book titles. Not sure if there's one you can download for free, but I saw one on sale for $675.

NER will give you a lot of entities that you need to resolve against something to check if they're actually books anyway, depending on volume you may not find a free API tier.


I was thinking about querying Google Books API to check if NNPs are book titles/authors, however, as you mentioned, NER produces a lot of entities. Luckily, there is Open Library dump[0] which has around 16 million book titles (IIRC) with some metadata.

[0] https://openlibrary.org/developers/dumps


I've been using Amazon Product API for Booknshelf (www.booknshelf.com) by passing "category => book" in the search to query only on Amazon Books. The book search by title is quite relevant. I thought this might be useful!


One problem that you will run into is that X (the phrase in the comment) might be a permutation of a book title. Or perhaps even a title of a book being mentioned in a comment where the author did not intend on it being interpreted as such. This is mostly due to ambiguity & information density in language as well as how we assign names/titles to things.

Ex. If I write a comment saying "The number of our clients went from zero to one instantly", in this sentence, "Zero to One" will match Peter Theil's book.

If you are willing to put up with such "noise" in the output, then you don't need to train a thing or even use ML. Just chunk the given piece of text and look up all the tokens permutations of given length (from 1 up to N) in your SQL Entities Database.

You can generally find this kind of Entities Databases by aggregating a lot of datasets from all over the web or even use something like google books datasets: https://books.google.co.uk/

https://storage.googleapis.com/books/ngrams/books/datasetsv2... - This is an ngram set but there should be one with only book titles somewhere. Can't find it atm but you can search for it yourself!

You could even (if you are brave enough) try to use wikipedia's dumps to mine book titles from articles




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: