email regex 2019-12-19 email regex - when you need to find email addresses in a long document or validate html forms This was also one of Google Code-in tasks. Basic regex syntax: \s - matches a whitespace () - matched group [] - matches any range of characters in the brackets [a-zA-Z0-9] – matches alphanumeric characters []+ - matches given range multiple times (L|R) – a group with alternation, matches either left or right side expressed inside the parentheses []* - matches given range zero or more times []? - matches given range once or not at all According to RFC5322 section 3.4.1 [link] the e-mail address is specified as follows (in ABNF):
addr-spec = local-part "@" domain local-part = dot-atom / quoted-string domain = dot-atom / domain-literal domain-literal = [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS] dtext = %d33-90 / %d94-126 ; printable ASCII not including "[", "]", or "\"The obsolete alternate parts were omitted. dot-atom may not begin with a dot, it consists of: any alphanumeric character or any of "!#$%&'*+-/=?^_`{|}~." Domain name syntax is specified elsewhere, for simplicity let's assume only alphanumeric characters and dashes are allowed in domain-names. Domain names may also consist of subdomains, so we should match dots too. quoted-strings in a local-part of the email address and domain-literal in the domain part is rarely used, especially in web-forms etc. In practice matching a following pattern should be sufficient (in EBNF):
email-address = dot-atom, "@", domain domain = alphanumeric, {["." | "-"], alphanumeric}, ".", alphanumeric, {alphanumeric} dot-atom = (legal-char - "."), {legal-char} legal-char = alphanumeric | ? any of "!#$%&'*+-/=?^_`{|}~." ?Therefore a following regex should match most of e-mail address names in a practical scenario:
([#-'\/-9A-Z^-~!*+=?-][#-'--9A-Z^-~!*+=?]*@[a-zA-Z0-9]([.-]?[a-zA-Z0-9]+)*\.[a-zA-Z]+)Some additional explanations regarding this regex: @ - matches a literal "@" \. - matches a literal "." [a-zA-Z0-9-.]+ - matches any alphanumeric character, "." or "-" multiple times (to get the domain name and subdomain names) The first part matches a dot-atom (any legal character excluding a dot followed by any number of legal characters). Parts are separated by a "@". Then the second part matches alphanumeric characters divided by dots or dashes. The string must end with alphanumeric top-level domain.