I have recently encountered a problem when I tried to load a large CSV file into Python using Pandas. When I tried to load the file I was getting a message stating that the computer ran out of the memory. This was surprising because the file was just over 2 GB and I have well over 10 GB RAM machine. I was never able to figure out which part of pandas was causing this (and to be fair I didn’t want to waste my time on it) but I found a workaround that might be helpful to others not wanting to split the file in to smaller chunks and still load it in using pandas. Simply put – load the file several rows at a time, i.e. instead of splitting the file, slit the loading. Here is an example:
data = pd.read_csv("train_data.csv", nrows = 50000)
data = data.append(pd.read_csv("train_data.csv", skiprows = 50001, header=None, names=[‘a’,’b’,’c’]))
The file we load is called “train_data.csv”. I have noticed that the computer can handle around 1 GB loads at a time, which in this case corresponded to around 500K records. We load the first 500K records, then load the rest and append the created DataFrames. If your CSV is even larger then just put the read in a loop this way:
for i in xrange((NUMBER_OF_RECORDS/MAX_RECORD_NUMBER)+(1 if NUMBER_OF_RECORDS%MAX_RECORD_NUMBER != 0 else 0)):
data1=pd.read_csv("your_file.csv", skiprows = MAX_RECORD_NUMBER*i,nrows=MAX_RECORD_NUMBER-1, header=None, names=[‘a’,’b’,’c’])
Where NUMBER_OF_RECORDS is the number of records in the file and MAX_RECORD_NUMBER is the number of records that your computer can handle loading one at a time.
You also need to remember that if you skip lines the headers will be wrong. You can’t specify the header as index, because it takes the index of the already cut slice, not the original file. Just specify headers your self. This might, however, be problematic if you have a large number of columns but then you can simply save the column names from the first slice and use them in the following loads.