email regex

2019-12-19

email regex - when you need to find email addresses in a long document or validate html forms

This was also one of Google Code-In tasks.

Basic regex syntax:

\s - matches a whitespace
() - matched group
[] - matches any range of characters in the brackets
[a-zA-Z0-9] – matches alphanumeric characters
[]+ - matches given range multiple times
(L|R) – a group with alternation, matches either left or right side expressed inside the parentheses
[]* - matches given range zero or more times
[]? - matches given range once or not at all

According to RFC5322 section 3.4.1 the e-mail address is specified as follows (in ABNF):

addr-spec      =    local-part "@" domain
local-part     =    dot-atom / quoted-string
domain         =    dot-atom / domain-literal
domain-literal =    [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
dtext          =    %d33-90 / %d94-126 ; printable ASCII not including "[", "]", or "\"

The obsolete alternate parts were omitted.

dot-atom may not begin with a dot, it consists of:

any alphanumeric character or any of "!#$%&'*+-/=?^_`{|}~."

Domain name syntax is specified elsewhere, for simplicity let's assume only alphanumeric characters and dashes are allowed in domain-names. Domain names may also consist of subdomains, so we should match dots too. quoted-strings in a local-part of the email address and domain-literal in the domain part is rarely used, especially in web-forms etc.

In practice matching a following pattern should be sufficient (in EBNF):

email-address =    dot-atom, "@", domain
domain        =    alphanumeric, {["." | "-"], alphanumeric}, ".", alphanumeric, {alphanumeric}
dot-atom      =    (legal-char - "."), {legal-char}
legal-char    =    alphanumeric | ? any of "!#$%&'*+-/=?^_`{|}~." ?

Therefore a following regex should match most of e-mail address names in a practical scenario:

([#-'\/-9A-Z^-~!*+=?-][#-'--9A-Z^-~!*+=?]*@[a-zA-Z0-9]([.-]?[a-zA-Z0-9]+)*\.[a-zA-Z]+)

Some additional explanations regarding this regex:

@ - matches a literal "@"
\. - matches a literal "."
[a-zA-Z0-9-.]+ - matches any alphanumeric character, "." or "-" multiple times (to get the domain name and subdomain names)

The first part matches a dot-atom (any legal character excluding a dot followed by any number of legal characters). Parts are separated by a "@". Then the second part matches alphanumeric characters divided by dots or dashes. The string must end with alphanumeric top-level domain.