Monday, August 17, 2009

GNU sed (Stream EDitor)

sed -r 's/\t+/,/g'


sed invoke the stream editor
-r use extended regular expressions (similar to using the -E argument for grep). This gives meaning to the '+' character in my regex.
s tells sed that we are doing a replacement ("substitution") operation
\t+ find occurrences of one or more tab characters
, replace it with a comma
g do this substitution for all occurrences of \t+

So, today I had a problem.  A friend needed me to convert a 10 MB data file from tab-separated format to comma-separated format.

"This should take about 2 seconds."

I wasn't on my trusty little laptop (running Ubuntu 9.04 Jaunty Jackalope since March) and was stuck using a lab computer on campus, which was, of course, running Windows XP with no useful utilities whatsoever.  To try to save some time, I tried to do this conversion right on my friend's computer.  We opened the document in MS Word, and tried to do a Find and Replace for tabs, converting them to commas.

Slow.  Killed the program several minutes into the operation.

Next, over to my trusty laptop.  Loaded up jEdit, a handy programming editor that has done well for me in the past.  Tried to do the find and replace.

Also slow.  Killed this about 10 minutes into the operation.  "It really shouldn't be taking this long."  What went wrong?  JEdit was out of memory.  I found that out from the command-line terminal where I launched jEdit.  Hmmm... Maybe some kind of error box would have been nice so I didn't just sit there for 10 minutes wondering. ;)

No more of this garbage.  We're going to the command line.

Always go to the command line.

I already knew about sed, but my memory was a little rusty on the command-line arguments.  After about 10 minutes, I finally found what I was looking for.

Converted the file in about 2 seconds.

Why is it that something that should take 2 seconds always takes 30 minutes?