• Uncategorized

About regex : Perl-like-shorthand-character-class-not-working-inside-bracket-expression

Question Detail

\s does not seem to work with

sed 's/[\s]\+//' tempfile

while it is working for

sed 's/[ ]\+//' tempfile

I am trying to remove white spaces that are coming at the beginning of each line due to the command:

nl -s ') ' file > tempfile  

e.g. file:

A Storm of Swords, George R. R. Martin, 1216
The Two Towers, J. R. R. Tolkien, 352
The Alchemist, Paulo Coelho, 197
The Fellowship of the Ring, J. R. R. Tolkien, 432
The Pilgrimage, Paulo Coelho, 288
A Game of Thrones, George R. R. Martin, 864

tempfile:

 1) Storm of Sword, George R. R. Martin, 1216
 2) The Two Tower, J. R. R. Tolkien, 352
 3) The Alchemit, Paulo Coelho, 197
 4) The Fellowhip of the Ring, J. R. R. Tolkien, 432
 5) The Pilgrimage, Paulo Coelho, 288
 6) A Game of Throne, George R. R. Martin, 864

i.e. there are spaces before numbers

Please explain why the white spaces are coming and the reason for \s to not work.

Question Answer

The reason is simple: POSIX regex engine does not parse shorthand Perl-like character classes as such inside bracket expressions.

See this reference:

One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression. So in POSIX, the regular expression [\d] matches a \ or a d.

So, [\s] in a POSIX regex matches one of two symbols: either \ or s.

Consider the following demo:

echo 'ab\sc' | sed 's/[\s]\+//'

Output is abc. \s substring is removed.

Consider using POSIX character classes instead of Perl-like shorthands:

echo 'ab\s c' | sed 's/[[:space:]]\+//'

See this online demo (the output is ab\sc). The POSIX character classes are made of [:<NAME_OF_CLASS>:], and they can only be used inside bracket expressions. See more examples of POSIX character classes here.

NOTE: if you want to make sure the spaces at the start of the line are removed, add ^ at the pattern start:

sed 's/^[[:space:]]\+//'
       ^ 

MORE PATTERNS:

  • \w = [[:alnum:]_]
  • \W = [^[:alnum:]_]
  • \d = [[:digit:]] (or [0-9])
  • \D = [^[:digit:]] (or [^0-9])
  • \h = [[:blank:]]
  • \S = [^[:space:]]

You could also format the numbers without fixed width. From coreutils.info:

‘-w NUMBER’
‘--number-width=NUMBER’
     Use NUMBER characters for line numbers (default 6).

E.g.:

nl -w 1 -s ') ' infile

Output:

1) A Storm of Swords, George R. R. Martin, 1216
2) The Two Towers, J. R. R. Tolkien, 352
3) The Alchemist, Paulo Coelho, 197
4) The Fellowship of the Ring, J. R. R. Tolkien, 432
5) The Pilgrimage, Paulo Coelho, 288
6) A Game of Thrones, George R. R. Martin, 864

You may also like...

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.