email regex
2019-12-19


email regex - when you need to find email addresses in a long document or validate html forms
This was also one of <a href="https://codein.withgoogle.com/">Google Code-in</a> tasks.

Basic regex syntax:
\s - matches a whitespace
() - matched group
[] - matches any range of characters in the brackets
[a-zA-Z0-9] – matches alphanumeric characters
[]+ - matches given range multiple times
(L|R) – a group with alternation, matches either left or right side expressed inside the parentheses
[]* - matches given range zero or more times
[]? - matches given range once or not at all

According to <a href="https://tools.ietf.org/html/rfc5322#section-3.4.1">RFC5322 section 3.4.1 [link]</a> the e-mail address is specified as follows (in ABNF):
<pre>
addr-spec	=	local-part "@" domain
local-part	=	dot-atom / quoted-string
domain	=	dot-atom / domain-literal
domain-literal	=	[CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
dtext		=	%d33-90 / %d94-126 ; printable ASCII not including "[", "]", or "\"
</pre>
The obsolete alternate parts were omitted.

dot-atom may not begin with a dot, it consists of:
any alphanumeric character or any of "!#$%&'*+-/=?^_`{|}~."

Domain name syntax is specified elsewhere, for simplicity let's assume only alphanumeric characters and dashes are allowed in domain-names. Domain names may also consist of subdomains, so we should match dots too. quoted-strings in a local-part of the email address and domain-literal in the domain part is rarely used, especially in web-forms etc.

In practice matching a following pattern should be sufficient (in EBNF):
<pre>
email-address	=	dot-atom, "@", domain
domain	=	alphanumeric, {["." | "-"], alphanumeric}, ".", alphanumeric, {alphanumeric}
dot-atom	=	(legal-char - "."), {legal-char}
legal-char	=	alphanumeric | ? any of "!#$%&'*+-/=?^_`{|}~." ?
</pre>

Therefore a following regex should match most of e-mail address names in a practical scenario:
<pre>
([#-'\/-9A-Z^-~!*+=?-][#-'--9A-Z^-~!*+=?]*@[a-zA-Z0-9]([.-]?[a-zA-Z0-9]+)*\.[a-zA-Z]+)
</pre>

Some additional explanations regarding this regex:
@ - matches a literal "@"
\. - matches a literal "."
[a-zA-Z0-9-.]+ - matches any alphanumeric character, "." or "-" multiple times
(to get the domain name and subdomain names)

The first part matches a dot-atom (any legal character excluding a dot followed by any number of legal characters). Parts are separated by a "@". Then the second part matches alphanumeric characters divided by  dots or dashes. The string must end with alphanumeric top-level domain.