Starting out: The Question

In the first few posts I decided it would be a good idea to build a machine learning model. I have seen a several tutorials on this topic but rarely do they show the iterative process. It will be an interesting journey for both the reader and the writer because I have no idea if there is a correlation between the features and the output and therefore no idea if the final model will be of any use.

First we have to start with a problem definition. What question are we trying to answer. One of the most interesting problems out there is predicting the stock market movements. A lot of work has already been done in this area. You have day traders trying to make money on small price changes (a lot of automated trading systems work on this time scale). There are swing traders that try to feel out the momentum of the stock market. Then there are the long term investors, trying to gauge the real value of a company and catch those that are currently undervalued.

However this is not the type of predictions that we will be interested in. I believe that the value of a stock market in many ways reflects the current mood of a population. If there is a crisis, the stocks go down, if there are good news, the stock go up. By analysing the news we can try to work out the effect it will have on the stock market.

The analysis of the mood of the text is called sentimental analysis. There has already been a lot of work done in this field. We will not be reinventing the wheel. It is a different and vast topic that would require whole other project. We will use an existing library created by Stanford. They have created a command line program written in Java that uses the movie reviews comments to generate its models.

There has also been a lot work done with sentimental analysis and stock market prediction. There are companies that monitor Twitter for information and notify the traders of any relevant events. We will not be interested in Twitter in our case. We don’t have a free access to historical data and I would like to try something a little different. We will use the Reddit comments on the /r/news subreddit. Hopefully the comments on the news should be a little bit more valid to our problem space than 140 character twitts on Bieders new album. I know it is possible to define filters on twitter but it would be a lot of work to specify all the keywords we would like to monitor. Besides, maybe there IS a correlation between Bieder fans and the stock market! We will leave that for another day. There are other possible subreddits, that are even more financially specific but /r/news has by far biggest activity and we will start here. I will use PRAW to access the reddit API.

Finally we have to decide what machine learning tool-kit we will use. I will not go here into comparison of all the options since this is highly depended on personal preferences. I like Python with sklearn and this is what I will use.

To sum up:

Question: Can we predict the stock market (in our case it will be NASDAQ) movements using the sentimental analysis of /r/news comments?

In the next post we will implement a way to download the needed data.

Starting out: The Question