email regex 2019-12-19 email regex - when you need to find email addresses in a long document or validate html forms This was also one of Google Code-in tasks. Basic regex syntax: \s - matches a whitespace () - matched group [] - matches any range of characters in the brackets [a-zA-Z0-9] – matches alphanumeric characters []+ - matches given range multiple times (L|R) – a group with alternation, matches either left or right side expressed inside the parentheses []* - matches given range zero or more times []? - matches given range once or not at all According to RFC5322 section 3.4.1 [link] the e-mail address is specified as follows (in ABNF):
addr-spec	=	local-part "@" domain
local-part	=	dot-atom / quoted-string
domain	=	dot-atom / domain-literal
domain-literal	=	[CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
dtext		=	%d33-90 / %d94-126 ; printable ASCII not including "[", "]", or "\"
The obsolete alternate parts were omitted. dot-atom may not begin with a dot, it consists of: any alphanumeric character or any of "!#$%&'*+-/=?^_`{|}~." Domain name syntax is specified elsewhere, for simplicity let's assume only alphanumeric characters and dashes are allowed in domain-names. Domain names may also consist of subdomains, so we should match dots too. quoted-strings in a local-part of the email address and domain-literal in the domain part is rarely used, especially in web-forms etc. In practice matching a following pattern should be sufficient (in EBNF):
email-address	=	dot-atom, "@", domain
domain	=	alphanumeric, {["." | "-"], alphanumeric}, ".", alphanumeric, {alphanumeric}
dot-atom	=	(legal-char - "."), {legal-char}
legal-char	=	alphanumeric | ? any of "!#$%&'*+-/=?^_`{|}~." ?
Therefore a following regex should match most of e-mail address names in a practical scenario:
([#-'\/-9A-Z^-~!*+=?-][#-'--9A-Z^-~!*+=?]*@[a-zA-Z0-9]([.-]?[a-zA-Z0-9]+)*\.[a-zA-Z]+)
Some additional explanations regarding this regex: @ - matches a literal "@" \. - matches a literal "." [a-zA-Z0-9-.]+ - matches any alphanumeric character, "." or "-" multiple times (to get the domain name and subdomain names) The first part matches a dot-atom (any legal character excluding a dot followed by any number of legal characters). Parts are separated by a "@". Then the second part matches alphanumeric characters divided by dots or dashes. The string must end with alphanumeric top-level domain.