September 2015 - jpintelli

I have recently encountered a problem when I tried to load a large CSV file into Python using Pandas. When I tried to load the file I was getting a message stating that the computer ran out of the memory. This was surprising because the file was just over 2 GB and I have well over 10 GB RAM machine. I was never able to figure out which part of pandas was causing this (and to be fair I didn’t want to waste my time on it) but I found a workaround that might be helpful to others not wanting to split the file in to smaller chunks and still load it in using pandas. Simply put – load the file several rows at a time, i.e. instead of splitting the file, slit the loading. Here is an example:

[sourcecode language=”python”]
data = pd.read_csv("train_data.csv", nrows = 50000)
data = data.append(pd.read_csv("train_data.csv", skiprows = 50001, header=None, names=[‘a’,’b’,’c’]))
[/sourcecode]

The file we load is called “train_data.csv”. I have noticed that the computer can handle around 1 GB loads at a time, which in this case corresponded to around 500K records. We load the first 500K records, then load the rest and append the created DataFrames. If your CSV is even larger then just put the read in a loop this way:

[sourcecode language=”python”]
for i in xrange((NUMBER_OF_RECORDS/MAX_RECORD_NUMBER)+(1 if NUMBER_OF_RECORDS%MAX_RECORD_NUMBER != 0 else 0)):
data1=pd.read_csv("your_file.csv", skiprows = MAX_RECORD_NUMBER*i,nrows=MAX_RECORD_NUMBER-1, header=None, names=[‘a’,’b’,’c’])
[/sourcecode]

Where NUMBER_OF_RECORDS is the number of records in the file and MAX_RECORD_NUMBER is the number of records that your computer can handle loading one at a time.

You also need to remember that if you skip lines the headers will be wrong. You can’t specify the header as index, because it takes the index of the already cut slice, not the original file. Just specify headers your self. This might, however, be problematic if you have a large number of columns but then you can simply save the column names from the first slice and use them in the following loads.

Introduction

In a past year I have been searching for a new car. Actually my first car ever. I am a huge supporter of public transport and a good bike but the long commute and low bus frequency has started to get to me. This meant time for research. Normally I am quick to make a decision. It took me around a month between making a decision to buy a flat and signing the deal. However buying a car scares me a little. The number of maintenance costs can get large, especially with the newer cars. I had a talk with my friend who is a car mechanic and he recommended to me to look into Toyota Corolla, especially the pre 2005 models, or even as far as pre 2000. Apparently they are as failure-free as you can get. I am totally green when it comes to cars, so I took his word for it and got cracking on the search. While doing the research I quickly got lost in the sheer number of data that is available and came to realisation that this would be a great data mining exercise! This was also a good opportunity to learn some R. I have heard that it has some very nice visualisation tools and wanted to check them out.

Data Collection

I have picked one of the bigger sites that lists used cars for data gathering called otomoto.pl. I live in Poland and that is where I am searching for the car. For website scraping I like to use a Python library called Scrapy. It is simple to use and has a good documentation. I will not get into the details of the implementation, since the program was relatively simple. I filtered the list to show the Toyota Corolla cars build prior to 2005, looped through pages and accessed each offer. From the offer page I have retrieved information most important to me and stored the results in a csv file. Since there was only about 400 offers, there really was no reason to use anything more complicated for storage than a simple csv file. Of course data is from running the script only once. If we where to collect the data over a longer period, there would be more.

CODE

Data Analysis

As mentioned in the introduction, I will be using R to analyse the data. Be ready for a lot of graphs. I like graphs. Lets first find out where the cars are coming from:

As expected, most cars come from Poland. The fact that a lot of cars come from Germany is also to be expected, since they are our neighbours and usual have better quality cars. However I did not expect them to be almost as much as from Poland. What could explain it, is a huge portion of cars whose origin is unknown. I would guess that most of them (if not all) come from Poland.

Since we are at locations, lets check out where the offers are located:

Again no surprises. Most of them are around big cities like Warsaw or Krakow, but quite a few are spread around. On a side note, I must say creating this map with R was pleasantly quick and easy.

Next lets look at the production year.

production_year

Hmm… I wonder what happened in 2002. The number of offers sky-rocked in that year. I was expecting a more gradual rise. Now, the question is, are the cars prior to 2002 better and people don’t want to sell them or worse and no one wants to buy them.

Another important car feature is its mileage.

mileage

There was quite a spread. There was even a car with over 2 000 000 km mileage! I have decided to cut histogram to show only the mileage that would potentially be interesting to me, i.e. less than 500 000 km. Still the result is interesting. Especially, the 1 km mileage cars. I have chose to collect information on used cars so I highly doubt that those cars really have 1 km mileage! Not the first time people tried to scam on the mileage numbers but that is quite bold. Apart from that it seems that most of the cars fall in the 100 000 km category.

Now lets look at the most important aspect – price (PLN).

It seems that the price is spread out very evenly. I would say that I should be looking at a 10K – 15K price range. Not much more to say here.

Here is another, the price vs the production year.

I was hoping that there will be some indication as to the quality of the car reflected in the mean price for a given production year, however the trend seems to be consistent and no anomalies are found, except for a small spike in the 1999 year (most probably caused by small sample size).

I was curious if there was a way to predict the price of a car based on the location where it is being sold. It would allow me to see how some machine learning tools work in R, as well as create an interesting map. I didn’t think there was a need to create an exact heatmap within the Polish contours, so I decided to just fit it as best as possible. The code is HERE, while the result is below:

From the looks of it, there are some areas, where the cars could potentially be cheaper. The interesting fact is that most white areas are next to big cities but not precisely within them.

Here is the full REPO with R code and the collected data.

Conclusion

Overall I think this was a very fun exercise. I will definitely need look around more. The final decision is always made after you actually see the car but I have some idea in what price and mileage range, as well the location of the seller to look at. The most fun was of course playing with R. I will definitely came back to it, since the map tools I truly great.

jpintelli

Menu

Month: September 2015

Loading large CSV file using Pandas

Searching for a car