So this I've repurposed from where it was languishing on my personal blog, which has fallen into disrepair since I started putting more and more of my stuff onto Needcoffee, just like I urge everybody on our staff to do--because I'm a total nutjob. I remembered this post after HTQ4 had asked me a question about his logs and how to deal with them. Anyway, when I tried to look at my access logs, there was a boatload of crap in there and I couldn't even answer the question: what are the hogs that I need to deal with--what can I change to keep the overhead of the site down?
The access log I was getting looked like this:
x.x.x.x - - [26/Apr/2007:00:36:50 -0700] "GET /wp-content/plugins/podpress/podpress_js.php HTTP/1.1" 200 2311 "http://www.needcoffee.com/2006/03/08/power-rangers-dino-thunder-vol-3-dvd-review/" "Mozilla/5.0 (Windows; U; Win98; en-US; rv:184.108.40.206) Gecko/20070309 Firefox/220.127.116.11"
Now, barring for a moment that someone is actually viewing a Power Rangers review and we must find them and stop them from breeding, imagine 10MB of that. That's how much I've got for a full day's access log, and that's after I've been working for a few days to optimize my robots.txt file.
Well, the obvious thing would be to sort the log file by size of the file being requested, and I've seen some sites promising perl scripts or whatever, and there's even analysis tools that tend to cost money, but I thought there had to be an easier way.
And here it is.
1. Take your access.log and open it in a text editor. Now, granted, if you're looking to do a 10MB access log, Wordpad will cough up a lung so grab something like Editpad or the like, or just use a subset of the log.
2. Do a find and replace. You want to find a space, i.e. " " and replace it with a comma "," Since we don't care about any data that would get screwed up by doing this, go for it--replace all.
3. Save the file with the suffix of .csv
4. Open the file in Excel (or equivalent) as a text .csv file
5. This should put the info into a spreadsheet where you should have a column for size. On my version, it's column H. Sort by H and take a look.
In my case, once I get past the podcasts and such that are supposed to be large I find...wow, holy crap: there's a JPG on here that's 73KB that flat out doesn't need to be.
Also, prototype.js, which WordPress uses for the admin panels, is about that size as well. I wish somebody would create a stripped down, no FX, just want to get the shit done WordPress admin theme, for those of us who...well, just want to get the shit done. (Update: Still haven't found this yet...pointers would be appreciated.)
Anyway, there you go. You should also be able to sort by other things as well, like IP address (if you want to see how often one is hitting you, for example). Also, since I originally wrote this tip, I've abandoned Podpress for being too bloated. I don't understand a plugin that loads itself for every page whether it's actually needed or not.