Highlighting Patterns

Writing Syntax Highlighting Patterns

Patterns are the mechanism by which language syntax highlighting is implemented in NEdit (see Syntax Highlighting under the heading of Features for Programming). To create syntax highlighting patterns for a new language, or to modify existing patterns, select "Recognition Patterns" from "Syntax Highlighting" sub-section of the "Default Settings" sub-menu of the "Preferences" menu.

First, a word of caution. As with regular expression matching in general, it is quite possible to write patterns which are so inefficient that they essentially lock up the editor as they recursively re-examine the entire contents of the file thousands of times. With the multiplicity of patterns, the possibility of a lock-up is significantly increased in syntax highlighting. When working on highlighting patterns, be sure to save your work frequently.

NEdit's syntax highlighting is unusual in that it works in real-time (as you type), and yet is completely programmable using standard regular expression notation. Other syntax highlighting editors usually fall either into the category of fully programmable but unable to keep up in real-time, or real-time but limited programmability. The additional burden that NEdit places on pattern writers in order to achieve this speed/flexibility mix, is to force them to state self-imposed limitations on the amount of context that patterns may examine when re-parsing after a change. While the "Pattern Context Requirements" heading is near the end of this section, it is not optional, and must be understood before making any any serious effort at pattern writing.

In its simplest form, a highlight pattern consists of a regular expression to match, along with a style representing the font an color for displaying any text which matches that expression. To bold the word, "highlight", wherever it appears the text, the regular expression simply would be the word "highlight". The style (selected from the menu under the heading of "Highlight Style") determines how the text will be drawn. To bold the text, either select an existing style, such as "Keyword", which bolds text, or create a new style and select it under Highlight Style.

The full range of regular expression capabilities can be applied in such a pattern, with the single caveat that the expression must conclusively match or not match, within the pre-defined context distance (as discussed below under Pattern Context Requirements).

To match longer ranges of text, particularly any constructs which exceed the requested context, you must use a pattern which highlights text between a starting and ending regular expression match. To do so, select "Highlight text between starting and ending REs" under "Matching", and enter both a starting and ending regular expression. For example, to highlight everything between double quotes, you would enter a double quote character in both the starting and ending regular expression fields. Patterns with both a beginning and ending expression span all characters between the two expressions, including newlines.

Again, the limitation for automatic parsing to operate properly is that both expressions must match within the context distance stated for the pattern set.

With the ability to span large distances, comes the responsibility to recover when things go wrong. Remember that syntax highlighting is called upon to parse incorrect or incomplete syntax as often as correct syntax. To stop a pattern short of matching its end expression, you can specify an error expression, which stops the pattern from gobbling up more than it should. For example, if the text between double quotes shouldn't contain newlines, the error expression might be "$". As with both starting and ending expressions, error expressions must also match within the requested context distance.

Coloring Sub-Expressions

It is also possible to color areas of text within a regular expression match. A pattern of this type associates a style with sub-expressions references of the parent pattern (as used in regular expression substitution patterns, see the NEdit Help menu item on Regular Expressions). Sub-expressions of both the starting and ending patterns may be colored. For example, if the parent pattern has a starting expression "\<", and end expression "\>", (for highlighting all of the text contained within angle brackets), a sub-pattern using "&" in both the starting and ending expression fields could color the brackets differently from the intervening text. A quick shortcut to typing in pattern names in the Parent Pattern field is to use the middle mouse button to drag them from the Patterns list.

Hierarchical Patterns

A hierarchical sub-pattern, is identical to a top level pattern, but is invoked only between the beginning and ending expression matches of its parent pattern. Like the sub-expression coloring patterns discussed above, it is associated with a parent pattern using the Parent Pattern field in the pattern specification. Pattern names can be dragged from the pattern list with the middle mouse button to the Parent Pattern field.

After the start expression of the parent pattern matches, the syntax highlighting parser searches for either the parent's end pattern or a matching sub-pattern. When a sub-pattern matches, control is not returned to the parent pattern until the entire sub-pattern has been parsed, regardless of whether the parent's end pattern appears in the text matched by the sub-pattern.

The most common use for this capability is for coloring sub-structure of language constructs (smaller patterns embedded in larger patterns). Hierarchical patterns can also simplify parsing by having sub-patterns "hide" special syntax from parent patterns, such as special escape sequences or internal comments.

There is no depth limit in nesting hierarchical sub-patterns, but beyond the third level of nesting, automatic re-parsing will sometimes have to re-parse more than the requested context distance to guarantee a correct parse (which can slow down the maximum rate at which the user can type if large sections of text are matched only by deeply nested patterns).

While this is obviously not a complete hierarchical language parser it is still useful in many text coloring situations. As a pattern writer, your goal is not to completely cover the language syntax, but to generate colorings that are useful to the programmer. Simpler patterns are usually more efficient and also more robust when applied to incorrect code.

Deferred (Pass-2) Parsing

NEdit does pattern matching for syntax highlighting in two passes. The first pass is applied to the entire file when syntax highlighting is first turned on, and to new ranges of text when they are initially read or pasted in. The second pass is applied only as needed when text is exposed (scrolled in to view).

If you have a particularly complex set of patterns, and parsing is beginning to add a noticeable delay to opening files or operations which change large regions of text, you can defer some of that parsing from startup time, to when it is actually needed for viewing the text. Deferred parsing can only be used with single expression patterns, or begin/end patterns which match entirely within the requested context distance. To defer the parsing of a pattern to when the text is exposed, click on the Pass-2 pattern type button in the highlight patterns dialog.

Sometimes a pattern can't be deferred, not because of context requirements, but because it must run concurrently with pass-1 (non-deferred) patterns. If they didn't run concurrently, a pass-1 pattern might incorrectly match some of the characters which would normally be hidden inside of a sequence matched by the deferred pattern. For example, C has character constants enclosed in single quotes. These typically do not cross line boundaries, meaning they can be parsed entirely within the context distance of the C pattern set and should be good candidates for deferred parsing. However, they can't be deferred because they can contain sequences of characters which can trigger pass-one patterns. Specifically, the sequence, '\"', contains a double quote character, which would be matched by the string pattern and interpreted as introducing a string.

Pattern Context Requirements

The context requirements of a pattern set state how much additional text around any change must be examined to guarantee that the patterns will match what they are intended to match. Context requirements are a promise by NEdit to the pattern writer, that the regular expressions in his/her patterns will be matched against at least <line context> lines and <character context> characters, around any modified text. Combining line and character requirements guarantee that both will be met.

Automatic re-parsing happens on EVERY KEYSTROKE, so the amount of context which must be examined is very critical to typing efficiency. The more complicated your patterns, the more critical the context becomes. To cover all of the keywords in a typical language, without affecting the maximum rate at which users can enter text, you may be limited to just a few lines and/or a few hundred characters of context.

The default context distance is 1 line, with no minimum character requirement. There are several benefits to sticking with this default. One is simply that it is easy to understand and to comply with. Regular expression notation is designed around single line matching. To span lines in a regular expression, you must explicitly mention the newline character "\n", and matches which are restricted to a single line are virtually immune to lock-ups. Also, if you can code your patterns to work within a single line of context, without an additional character-range context requirement, the parser can take advantage the fact that patterns don't cross line boundaries, and nearly double its efficiency over a one-line and 1-character context requirement. (In a single line context, you are allowed to match newlines, but only as the first and/or last character.)