Embedded Text Matching

Consider the task of looking through a document to locate words that contain the run of characters "truct" - for instance constructor, destructor and constructive. The regex pattern \b\w*truct\w*\b would accomplish the task with no trouble at all. How does it work?
  • The codes \b instruct the regex engine to look for words - i.e. text demarcated by a word boundary
  • The codes \w* are instructions to match zero or more alphanumeric characters or the underscore.
  • The truct embedded between two \w*s causes all words starting with, containing or ending with the run of characters "truct" to be matched.

This is very useful but what if we only want to match words containing no more than 11 characters? In the example this would require that we prevent the word "constructive" - 12 letters - from being matched. Not in itself is a difficult regex task - the pattern \b\w{5,11}\b is all that is required. As before, the \bs restrict the match to whole words. \w{5,11} constrains the match to be a run of between 5 and 11 alphanumeric_ characters. Could we not somehow combine the two patterns, \b\w*truct\w*\b and \b\w{5,11}\b, to achieve the desired result?

Easily done - the pattern (?=\b\w{5,11}\b)\b\w*truct\w*\b built using a straight combination of the two patterns is up to the task. You will note that in this pattern the lookahead pattern occurs before the main search pattern. While this pattern works it contains redundancies.

The redundancies are a direct consequence of way the regex engine uses a search pattern. As it attempts to find matches to a pattern in a piece of text the regex engine moves through the text. This phenomenon is called consumption in regex-speak. However, consumption does not occur when the regex engine examines text for a lookahead pattern. In the present example the lookahead pattern, \b\w{5,11}\b, is used first. Only when it has matched that the main search pattern is tested. However, since the lookahead did not consume any text the regex engine performs the main search from the position where it found the lookahead match. Consequently, it is wholly unnecessary to constrain the main search pattern to test text that lies between word boundaries, \b...\b - it is already doing so or else the lookahead test would have failed! In other words, we can use the simplified pattern (?=\b\w{5,11}\b)\w*truct\w*.

Regular Expression Sandbox
Model
Data

Do not wrap the model expression in a /.../ pair. The characters ^$.?*!+:=()[]{}|\\ must be escaped - except when then occur inside a character class. Invalid characters will be grayed out.
Result Left Text Match Right Text
       

Download
Jump To...

Colophon