Speed-up bash scripts that use grep often

The last day I ran into a very awkward situation, but lets start from the beginning:
I wrote a bash script that had to perform several regular expression grep's in several text files.
So far so good, but the script always needed several minutes since the first version. I never wondered about this long execution time, the script runs on a remote server and also uses some mysql searches on another server.
Until some days ago, I wanted to debug a minor error with bash -xv to see all instructions when being executed.
To read the huge output more easily, I logged into the remote server from emacs editor and launched the script and had all the output in my emacs window in a few seconds. Wait, it only needed a few seconds?!
What happened, I rechecked the execution, with exactly the same parameters to the script from my emacs shell and from a ssh terminal.
The same, within emacs it only took some seconds, within the ssh terminal y took minutes.

So what's going on here?
Investigating the processes in the remote server with htop showed that the grep commands consumed almost 100% of the CPU for some time.
But when launched from emacs they did not.
What might be different?
Well, what about the environment?
Emacs might use it's own set of environment, different from a terminal.
Looking at the export definitions of both, I could see that the locale was different, emacs was using the C, whereas the terminal was using the English en_UK.UTF8.
Issuing a export LC_ALL=C in the terminal and then launching the script ... only a few seconds.
So it seems, grep is sensible to the locale of the shell.
Investigating in the web confirmed this, people talk about a speed enhancement of up to 10 times when using the C locale with grep, as it doesn't have to convert anything, it just compares a plain ASCII set of character, which in most cases is enough.
Also, grep is slightly faster when comparing fixed strings if you use the -F flag or call it with fgrep (due to lazyness I always used grep in such cases).

Instructions for speeding up grep

At the beginning of my scripts, I now put these instructions, which will force to use the C locale for all three flavours of grep:
shopt -s expand_aliases
for g in "" e f; do
    alias ${g}grep="LC_ALL=C ${g}grep"  # speed-up grep commands by not considering locale.
done
It does not work to just put export LC_ALL=C at the beginning.

No comments:

Post a Comment