• Uncategorized

About regex : grep-valid-domain-regex-duplicate

Question Detail

I’m trying to make a regex for grep that match only valid domains.

My version work pretty well but match the following invalid domain :

@subdom..dom.ext

Here is my regex :

echo "@dom.ext" | grep "^@[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"

I’m working with bash so I escaped special characters.

Sample that should match :

@subdom.dom.ext
@subsubdom.subdom.dom.ext
@subsub-dom.sub-dom.ext

Thanks for help

Question Answer

A truly complete solution requires more work, but here’s an approximation that may work well enough (note that a @ prefix is assumed and the input string is expected to start with it):

^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$

You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash’s regex-matching operator.

Makes the following assumptions, which are more permissive than actual DNS name constraints:

  • Only ASCII (non-foreign) letters are allowed – see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs – e.g., xn--bcher-kva.ch for bücher.ch – are not matched – see below.

  • There’s no limit on the number of nested subdomains.

  • There’s no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).

  • The TLD (last component) is composed of letters only and has a length of at least 2.

  • Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.

Here’s a quick test:

for d in @subdom..dom.ext @dom.ext @subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext @x.org; do
 [[ $d =~ \
    ^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
 ]] && echo YES || echo NO
done

Support for Internationalized Domain Names (IDN) with literal Unicode characters – again, a complete solution requires more work:

A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:

^@(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$

Caveats:

  • No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.

  • As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):

    • False positive: an invalid Punycode-encoded name such as ab--whatever
    • False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name – a rule that is impossible to enforce via a regex alone.
    • False negatives: emoji-based names such as 💄.ws (xn--jr8h.ws)
    • False negative: பரிட்சை is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
    • … and many more
  • Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.

  • I’m unclear on whether names in right-to-left writing scripts are properly matched.

  • For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.

Tip of the hat to @Alfe and for pointing out the problem with IDNs, and to @Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.

Use

grep '@[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'

echo "@dom.ext" | grep -E "^@[a-zA-Z0-9]+([-.]?[a-zA-Z0-9]+)*.[a-zA-Z]+$"

This did the job.

You may also like...

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.