Monday, December 6, 2010

PHP - Managing Memory When Processing Large Files

Something interesting I found out about PHP when processing huge files. The memory garbage collector doesn't always work the way it is intended and common tricks don't always work either.

What's even more frustrating is the common method of reading a file line by line causes huge memory leaks.
Here's my findings and solutions: (feel free to correct me if I'm wrong, even though it worked for me)

Common method fopen+fgets fails:
fopen() with fgets() to read line by line on files that contains almost a million lines will cause crazy memory leak. It takes only 10 seconds before it consumes pretty much 100% of the system memory and go into swap.
   My solution:
   use "head -1000 {file} | tail -1000" is much less memory intensive. The exact number of lines to process varies depending on the system speed. I had it set to 2000 and was running very smoothly.

Garbage Collector fails:
PHP's garbage collector fails to clean up memory after each loop iteration even if unset() is used (or set variables to null). The memory just keep on piling up. Unfortunately "gc_collect_cycles" which forces the garbage collector cycle to run, is only available in PHP 5.3 branch.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = explode("\n", shell_exec("head -$i blah.xml | tail -2000"));
    //parse using simplexml
    unset($data);
}

My Solution
You can FORCE the garbage collector to run by wrapping a process in a function. PHP does clean up memory after each function call. So for the above code example, if re-written, memory will happily hover over 0.5% constantly.

Example Code:
for ($i=2000; $i<=1000000; $i+=2000) {
    $data = shell_exec("head -$i blah.xml | tail -2000");
    process($data);
    unset($data);
}

function process($data) {
    $data = explode("\n", $data);
    //parse using simplexml
    unset($data);
}