Friday, March 23, 2012

How we made a social news service with machine learning from scratch


How do you get to news which interests you in Hacker News or TechCrunch or The Next Web or the ReadWriteWeb? I mean when you go to a news website do you read all articles or do you scan the article titles and try to guess which ones to read based on your interests?

In my case I have noticed that I am looking for the news which interests me. Like startup news, Java or Ruby programming language news. So it beg the question can I build a system which scans through all the news and then classifies, indexes all relevant terms and serve to me so that I can get a stream of news which I would like to know and read about.

So we(my friend and I) started building a system which when given a website link does the following:

It tries to get the relevant content from the page.

It uses natural language processing to understand a bunch of things like: sentences, parts of speech etc. We score the word relevance in the article.

We also do classification and clustering with bunch of statistical methods of the article to find out which category an article belongs to. Like is it a technical or scientific or law related article.

Once we got the meat of the document we tag it with the terms that we have identified with the above steps.Now we were faced with the challenge that how can we make it available to us. We needed a web site which will allow us to see the content in stream. We should also be able to customize the stream to user’s choice. So we thought of creating a micro-blogging site, after all, we just needed to put a post with article title, URL and the tags. Here is how a post looks today in the stream.

If you would like to see a full stream:



As you can see here that I am seeing news with startup and java. As I am following news related to java and startup.

We noticed that we need more character support than 140 character of Twitter so we increased the character limit to 300. One more choice that we have made is adding tag separately from the actual post. This allowed us to add as many tags as we want. We feel tags are more like meta information so one should not be limited with hash tags.

Once we have such a system, we needed a way to get the latest news from Hacker News, TechCrunch, TheNextWeb etc. RSS feed came to our rescue. These websites provides RSS feed which we read periodically. We created an account called news in ScoopSpot which will do the posting in our micro-blogging site. If you would like to see news in action, visit: http://news.scoopspot.com/.

So now if a user comes to our web site and post something with an article link we are now able to auto tag it.

Here are some links to tag based pages:

Ruby: http://www.scoopspot.com/Ruby

Startup: http://www.scoopspot.com/Startups

Steve Jobs: http://www.scoopspot.com/Steve_Jobs

If you are interested to try it out you can visit: www.scoopspot.com

No comments: