email regex
2019-12-19
email regex - when you need to find email addresses in a long document or validate html forms
This was also one of Google Code-In tasks.
Basic regex syntax:
\s - matches a whitespace
() - matched group
[] - matches any range of characters in the brackets
[a-zA-Z0-9] – matches alphanumeric characters
[]+ - matches given range multiple times
(L|R) – a group with alternation, matches either left or right side expressed inside the parentheses
[]* - matches given range zero or more times
[]? - matches given range once or not at all
According to RFC5322 section 3.4.1 the e-mail address is specified as follows (in ABNF):
addr-spec = local-part "@" domain
local-part = dot-atom / quoted-string
domain = dot-atom / domain-literal
domain-literal = [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
dtext = %d33-90 / %d94-126 ; printable ASCII not including "[", "]", or "\"
The obsolete alternate parts were omitted.
dot-atom may not begin with a dot, it consists of:
any alphanumeric character or any of "!#$%&'*+-/=?^_`{|}~."
Domain name syntax is specified elsewhere, for simplicity let's assume only alphanumeric characters and dashes are allowed in domain-names. Domain names may also consist of subdomains, so we should match dots too. quoted-strings in a local-part of the email address and domain-literal in the domain part is rarely used, especially in web-forms etc.
In practice matching a following pattern should be sufficient (in EBNF):
email-address = dot-atom, "@", domain
domain = alphanumeric, {["." | "-"], alphanumeric}, ".", alphanumeric, {alphanumeric}
dot-atom = (legal-char - "."), {legal-char}
legal-char = alphanumeric | ? any of "!#$%&'*+-/=?^_`{|}~." ?
Therefore a following regex should match most of e-mail address names in a practical scenario:
([#-'\/-9A-Z^-~!*+=?-][#-'--9A-Z^-~!*+=?]*@[a-zA-Z0-9]([.-]?[a-zA-Z0-9]+)*\.[a-zA-Z]+)
Some additional explanations regarding this regex:
@- matches a literal "@"\.- matches a literal "."[a-zA-Z0-9-.]+- matches any alphanumeric character, "." or "-" multiple times (to get the domain name and subdomain names)
The first part matches a dot-atom (any legal character excluding a dot followed by any number of legal characters). Parts are separated by a "@". Then the second part matches alphanumeric characters divided by dots or dashes. The string must end with alphanumeric top-level domain.