8 Regular Expressions (aka Regex) by examples for Java developers

Q1 How will you go about implementing the following validation rules for a user name?

— user name must be between 2 and 17 characters long.
— valid characters are A to Z, a to z, 0 to 9, . (full-stop), _ (underscore) and – (hyphen)
— user name must begin with an alphabetic character.
— user name must not end with a . (full stop) or _ (underscore) or – (hyphen).

A1 The above rules can be implemented with a regular expression as shown below:

Not compiling the regular expression can be costly if Pattern.matches( ) is used over and over again with the same expression in a loop or frequent method calls because the matches( ) method will re-compile the expression every time it is used. You can also re-use the Matcher object for different input strings by calling the method reset( ).

What does this pattern mean?

Screen shot 2014-10-11 at 2.33.12 PM

How will you test this? Test with JUnit. The junit.jar file must be in the classpath.

Q2 How would you go about validating a supplied password in your code that must adhere to the following conditions?

— Must contain a digit from 0-9.
— Must contain one uppercase character.
— Must contain one lowercase character.
— The length must be between 6 to 15 characters long.

A2 It can be achieved with a regular expression. But unlike the previous question, this is a bit more complicated than it looks. You might be tempted to write something like,

But the above regex is not going to meet the requirement. The above regex will only ensure that the valid characters are used and the password is of correct length (i.e. between 6 and 15 characters). But it won’t ensure that there is at least “one digit from 0-9, one lower case and one uppercase”. So, the above regex will return true for passwords like

This is where the regex with LOOK AHEAD comes to the rescue. The pattern below uses “look ahead” with the syntax ?=. The regex pattern with look ahead is shown below:

Don’t be overwhelmed by long regex patterns as shown above. Once you divide them into smaller chunks as shown below, things become much clearer.

Screen shot 2014-10-11 at 2.38.25 PM

Note: The look ahead and look behind constructs are collectively known as “look around” constructs. Unlike the greedy quantifiers discussed, the “look arounds” actually match characters, but then give up the match and only return the result with “match” or “no match”. That is why they are called “assertions”. They do not consume characters in the string like [0-9A-Za-z]{6,15}, but only assert whether a match is possible or not.

The look ahead constructs are non-capturing. If you want to capture the values matched, for example to debug or use the captured values, the above regex can be modified with additional parentheses to capture the matched values.

You can display the captured values as shown below:

Outputs:

As you can see,

(?=([A-Za-z]*)([0-9])) matches John followed by 1, i.e. presence of a digit.
(?=([0-9A-Z]*)([a-z])) matches J followed by o, i.e. presence of a lower-case alphabet.
(?=([0-9a-z]*)([A-Z])) matches nothing followed by J, i.e. presence of an uppercase letter.

Q3 What does .*, +, and ? mean with respect to regular expressions? Where do you look for the summary of Java regular expression constructs?
A3 *, +, and ? are known as (greedy) quantifiers as they quantify the preceding character(s). For example,

. → matches any character.
.* → matches any character repeated 0 or more times.
.+ → matches any character repeated 1 or more times.
.? → matches any character repeated 0 or 1 time. This means optional.

[a-zA-Z]* → Any alphabet repeated 0 or more times.
[a-zA-Z]+ → Any alphabet repeated 1 or more times.
[a-zA-Z]? → Any alphabet repeated 0 or 1.

Note: These are not wild characters. *, +, and ? are regex repetitions. The {x,y} is also used for repetition. For example, [a]{3,5} means the letter “a” repeated at least 3 times or a maximum of 5 times. In fact, internally the Pattern.java class implements,

a* as a{0, 0x7FFFFFFF}
a+ as a{1, 0x7FFFFFFF}

The values in square brackets (i.e. [ ] ) are known as character sets. The ., *, ?, etc are escaped and used as literal values within the square brackets. Here are some examples of the character sets.

[aeiou] → matches exactly one lowercase vowel.

[^aeiou] → matches a character that ISN’T a lowercase vowel (^ inside [ ] means NOT).

^[a-z&&[^aeiou]]*$ → matches any character other than a vowel anchored between start (^) and end ($). This is a character class subtraction regex. The “&&” means intersection.

[a-d[m-p]] → Matches characters a to d and m to p. This is a union regex.

[a-z&&[d-f]] → Matches only d, e, and f. This is an intersection regex.

[x-z[\\p{Digit}]] → matches x-z and 0-9. Similar to [x-z0-9] or [x-z[\\d]]. The “\p” stands for POSIX character classes.

^[aeiou] – matches a lowercase vowel anchored at the beginning of a line

[^^] → matches a character that isn’t a caret ‘^’ .

^[^^] → matches a character that isn’t a caret at the beginning of a line.

^[^.]. → matches anything but a literal period, followed by “any” character, at the beginning of a line

[.*]* → matches a contiguous sequence of optional dots (.) and asterisks (*)

[aeiou]{3} – matches 3 consecutive lowercase vowels (all not necessarily the same vowel)

\[aeiou\] → matches the string “[aeiou]”. “\” means escape.

Q4 What does grouping mean with regards to regular expressions? Would you prefer capturing or non-capturing group?
A4 A group is a pair of parentheses used to group sub patterns. For example, c(a|u)t matches cat or cut. A group also captures the matching text within the parentheses. The groups are numbered from left to right, outside to inside. For example, (x(y*))+(z*) has 3 explicit and 1 implicit groups. The implicit group 0 is the entire match.

implicit group 0: (x(y*))+(z*)
explicit group 1: (x(y*))
explicit group 2: (y*)
explicit group 3: (z*)

Capturing groups are numbered by counting their opening parentheses from left to right. The captured sub sequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete. Groups beginning with ?: are pure, non-capturing groups that do not capture text and do not count towards the group total. For example, (?:x(?:y*))+(?:z*) will only have group 0 being the entire match.

Output:

The following sample code illustrates substitution, where the captured group numbers can be used for the subsequent matches. For example, the term “marketing” and “sales” can be captured and reused as these terms appear more than once. Once within the URL and once outside the URL as per the code snippet shown below.

Output:

Note: The “\1” (group 1) captures either “marketing” or “sales” when the capturing pattern is used. Since “\” has to be escaped in Java, it is represented as “\\1”. The .* matches the word “for” as it means match any character 0 or more times.

Group capturing incur a small-time penalty each time you use them. If you don’t really need to capture the text inside a group, prefer using non-capturing groups.

Q5 What do you understand by greedy, reluctant, and possessive quantifiers? What do you understand by the term “backtracking”?
A5

Screen shot 2014-10-11 at 2.51.16 PM

Output:

Q6 Why did the string “xfooxxfooxxxxfooxxxxxxxxfoofo” return no match when used with a possessive quantifier?
A6 Even if you try to match “xfooxxfooxxxxfooxxxxxxxxfoo” without the last “fo”, the possessive quantifier would match nothing because the possessive quantifiers will not give up the matches on a backtrack, and the .*+ matches your entire string including the last “foo” and then there’s nothing for “foo” to match in the pattern “.*+foo”. So use possessive quantifiers only when you know that what you’ve matched should never be backtracked. For example, “[^f]*+.*foo” or more importantly matching text in quotes as demonstrated above with “\”xfoo\”*+”.

Performance testing

Always test your regular expressions with negative and positive test cases. Generally the possessive quantifiers are more efficient than the reluctant quantifiers, and the reluctant quantifiers are more efficient than the greedy quantifiers. When a regex is used in a performance critical area of your application, it would be wise to test it first. Write a benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that almost match your pattern.

What is “backtracking”?

Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. This means the engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. a|b|c|d). This exploration technique is called backtracking. In a more simpler term, backtracking means return to attempt an untried option.

Optimization tips

The .* and alternation (e.g. a|b|c) regexes must be used sparingly. For example, if you want to retrieve everything between two “@” characters in an input string, instead of using “@(.*)@”, it’s much better to use “@([^@]*)@”. This regex can be further optimized by minimizing the backtracking with the help of a possessive quantifier “*+” instead of a greedy quantifier “*”.

Say you are using the pattern “@([^@]*)@” with a long input string that does not have any “@”. Because “*” is a greedy quantifier, it will grab all the characters until the end of the string, and then it will backtrack, giving back one character at a time. The expression will fail only when it can’t backtrack anymore. If you change the pattern to use “@([^@]*+)@” with a possessive quantifier “*+”, the new pattern fails faster because once it has tried to match all the characters that are not a “@”, it does not backtrack; Instead it fails right there. Take care not to compromise correctness for a small optimization.

Regular expressions like “(COLON|COMMA|COLUMN)” have a reputation for being slow, so watch out for them. Firstly, the order of alternation counts. So, place the more common options in the front to be matched faster. Secondly, extract out the more common ones. For example, CO(MMA|L(ON|UMN)). Since COMMA is more common, it is used first. The “CO” and “L” are extracted out as they are more common.

Q7 How will you use “negative look ahead” to split a string of digits say 230_000L to individual array of digits using a regex?
A7

(?!X) –> is NEGATIVE LOOK AHEAD syntax. X is the “^”. The ^ matches the beginning of a string, so the regex matches at every position that is not the beginning of the string, and inserts a split there.

Output:

Q8 How will you split the following CSV values on “,” ?

A8 Using a regex in the String.split(….) method.

\s -> space
\\s* –> first “\” is an escape character, and “*” means zero or more spaces.

Q9 What if you have “,” as a value within the quotes as in “Peter, Smith”, how will you split the CSV values with regex?

A9

Output:

(?=) –> Positive Look Ahead

[^\”]*\” –> NO quotes 0 or more times, and followed by a quote.

,([^\”]*\”[^\”]*\”)* –> , followed by zero or more of anything within quotes.

[^\”]*$ –> ends with anything other than a quote 0 or more times.


800+ Java & Big Data Interview Q&As

Top