Loading large CSV file using Pandas

I have recently encountered a problem when I tried to load a large CSV file into Python using Pandas. When I tried to load the file I was getting a message stating that the computer ran out of the memory. This was surprising because the file was just over 2 GB and I have well over 10 GB RAM machine. I was never able to figure out which part of pandas was causing this (and to be fair I didn’t want to waste my time on it) but I found a workaround that might be helpful to others not wanting to split the file in to smaller chunks and still load it in using pandas. Simply put – load the file several rows at a time, i.e. instead of splitting the file, slit the loading. Here is an example:

[sourcecode language=”python”]
data = pd.read_csv("train_data.csv", nrows = 50000)
data = data.append(pd.read_csv("train_data.csv", skiprows = 50001, header=None, names=[‘a’,’b’,’c’]))
[/sourcecode]

The file we load is called “train_data.csv”. I have noticed that the computer can handle around 1 GB loads at a time, which in this case corresponded to around 500K records. We load the first 500K records, then load the rest and append the created DataFrames. If your CSV is even larger then just put the read in a loop this way:

[sourcecode language=”python”]
for i in xrange((NUMBER_OF_RECORDS/MAX_RECORD_NUMBER)+(1 if NUMBER_OF_RECORDS%MAX_RECORD_NUMBER != 0 else 0)):
data1=pd.read_csv("your_file.csv", skiprows = MAX_RECORD_NUMBER*i,nrows=MAX_RECORD_NUMBER-1, header=None, names=[‘a’,’b’,’c’])
[/sourcecode]

Where NUMBER_OF_RECORDS is the number of records in the file and MAX_RECORD_NUMBER is the number of records that your computer can handle loading one at a time.

You also need to remember that if you skip lines the headers will be wrong. You can’t specify the header as index, because it takes the index of the already cut slice, not the original file. Just specify headers your self. This might, however, be problematic if you have a large number of columns but then you can simply save the column names from the first slice and use them in the following loads.

Loading large CSV file using Pandas

Searching for a car

Introduction

In a past year I have been searching for a new car. Actually my first car ever. I am a huge supporter of public transport and a good bike but the long commute and low bus frequency has started to get to me. This meant time for research. Normally I am quick to make a decision. It took me around a month between making a decision to buy a flat and signing the deal. However buying a car scares me a little. The number of maintenance costs can get large, especially with the newer cars. I had a talk with my friend who is a car mechanic and he recommended to me to look into Toyota Corolla, especially the pre 2005 models, or even as far as pre 2000. Apparently they are as failure-free as you can get. I am totally green when it comes to cars, so I took his word for it and got cracking on the search. While doing the research I quickly got lost in the sheer number of data that is available and came to realisation that this would be a great data mining exercise! This was also a good opportunity to learn some R. I have heard that it has some very nice visualisation tools and wanted to check them out.

Data Collection

I have picked one of the bigger sites that lists used cars for data gathering called otomoto.pl. I live in Poland and that is where I am searching for the car. For website scraping I like to use a Python library called Scrapy. It is simple to use and has a good documentation. I will not get into the details of the implementation, since the program was relatively simple. I filtered the list to show the Toyota Corolla cars build prior to 2005, looped through pages and accessed each offer. From the offer page I have retrieved information most important to me and stored the results in a csv file. Since there was only about 400 offers, there really was no reason to use anything more complicated for storage than a simple csv file. Of course data is from running the script only once. If we where to collect the data over a longer period, there would be more.

CODE

Data Analysis

As mentioned in the introduction, I will be using R to analyse the data. Be ready for a lot of graphs. I like graphs. Lets first find out where the cars are coming from:

country_of_origin

As expected, most cars come from Poland. The fact that a lot of cars come from Germany is also to be expected, since they are our neighbours and usual have better quality cars. However I did not expect them to be almost as much as from Poland. What could explain it, is a huge portion of cars whose origin is unknown. I would guess that most of them (if not all) come from Poland.

Since we are at locations, lets check out where the offers are located:

location_vs_price

Again no surprises. Most of them are around big cities like Warsaw or Krakow, but quite a few are spread around. On a side note, I must say creating this map with R was pleasantly quick and easy.

Next lets look at the production year.

production_year

Hmm… I wonder what happened in 2002. The number of offers sky-rocked in that year. I was expecting a more gradual rise. Now, the question is, are the cars prior to 2002 better and people don’t want to sell them or worse and no one wants to buy them.

Another important car feature is its mileage.

mileage

There was quite a spread. There was even a car with over 2 000 000 km mileage! I have decided to cut histogram to show only the mileage that would potentially be interesting to me, i.e. less than 500 000 km. Still the result is interesting. Especially, the 1 km mileage cars. I have chose to collect information on used cars so I highly doubt that those cars really have 1 km mileage! Not the first time people tried to scam on the mileage numbers but that is quite bold. Apart from that it seems that most of the cars fall in the 100 000 km category.

Now lets look at the most important aspect – price (PLN).

price_histogram

It seems that the price is spread out very evenly. I would say that I should be looking at a 10K – 15K price range. Not much more to say here.

Here is another, the price vs the production year.

price_vs_production_year

I was hoping that there will be some indication as to the quality of the car reflected in the mean price for a given production year, however the trend seems to be consistent and no anomalies are found, except for a small spike in the 1999 year (most probably caused by small sample size).

I was curious if there was a way to predict the price of a car based on the location where it is being sold. It would allow me to see how some machine learning tools work in R, as well as create an interesting map. I didn’t think there was a need to create an exact heatmap within the Polish contours, so I decided to just fit it as best as possible. The code is HERE, while the result is below:

location_vs_price_predicted

From the looks of it, there are some areas, where the cars could potentially be cheaper. The interesting fact is that most white areas are next to big cities but not precisely within them.

Here is the full REPO with R code and the collected data.

Conclusion

Overall I think this was a very fun exercise. I will definitely need look around more. The final decision is always made after you actually see the car but I have some idea in what price and mileage range, as well the location of the seller to look at. The most fun was of course playing with R. I will definitely came back to it, since the map tools I truly great.

Searching for a car

Starting out: The Question

In the first few posts I decided it would be a good idea to build a machine learning model. I have seen a several tutorials on this topic but rarely do they show the iterative process. It will be an interesting journey for both the reader and the writer because I have no idea if there is a correlation between the features and the output and therefore no idea if the final model will be of any use.

First we have to start with a problem definition. What question are we trying to answer. One of the most interesting problems out there is predicting the stock market movements. A lot of work has already been done in this area. You have day traders trying to make money on small price changes (a lot of automated trading systems work on this time scale). There are swing traders that try to feel out the momentum of the stock market. Then there are the long term investors, trying to gauge the real value of a company and catch those that are currently undervalued.

However this is not the type of predictions that we will be interested in. I believe that the value of a stock market in many ways reflects the current mood of a population. If there is a crisis, the stocks go down, if there are good news, the stock go up. By analysing the news we can try to work out the effect it will have on the stock market.

The analysis of the mood of the text is called sentimental analysis. There has already been a lot of work done in this field. We will not be reinventing the wheel. It is a different and vast topic that would require whole other project. We will use an existing library created by Stanford. They have created a command line program written in Java that uses the movie reviews comments to generate its models.

There has also been a lot work done with sentimental analysis and stock market prediction. There are companies that monitor Twitter for information and notify the traders of any relevant events. We will not be interested in Twitter in our case. We don’t have a free access to historical data and I would like to try something a little different. We will use the Reddit comments on the /r/news subreddit. Hopefully the comments on the news should be a little bit more valid to our problem space than 140 character twitts on Bieders new album. I know it is possible to define filters on twitter but it would be a lot of work to specify all the keywords we would like to monitor. Besides, maybe there IS a correlation between Bieder fans and the stock market! We will leave that for another day. There are other possible subreddits, that are even more financially specific but /r/news has by far biggest activity and we will start here. I will use PRAW to access the reddit API.

Finally we have to decide what machine learning tool-kit we will use. I will not go here into comparison of all the options since this is highly depended on personal preferences. I like Python with sklearn and this is what I will use.

To sum up:

Question: Can we predict the stock market (in our case it will be NASDAQ) movements using the sentimental analysis of /r/news comments?

In the next post we will implement a way to download the needed data.

Starting out: The Question