Highlighting Rules

The highlighting rules specify to Ace (Cloud9's editor) how to color the syntax of the language of your mode.

Defining Syntax Highlighting Rules

The Ace highlighter can be considered to be a state machine. Regular expressions define the tokens for the current state, as well as the transitions into another state. Let's define mynew_highlight_rules.js, which our mode uses.

All syntax highlighters start off looking something like this:

define(function(require, exports, module) {
"use strict";

var oop = require("../lib/oop");
var TextHighlightRules = require("ace/mode/text_highlight_rules").TextHighlightRules;

var MyNewHighlightRules = function() {
    // regexp must not have capturing parentheses. Use (?:) instead.
    // regexps are ordered -> the first match is used
   this.$rules = {
        "start" : [
            {
                token: <token>, // String, Array, or Function: the CSS token to apply
                regex: <regex>, // String or RegExp: the regexp to match
                next:  <next>   // [Optional] String: next state to enter
            }
        ]
    };
};

oop.inherits(MyNewHighlightRules, TextHighlightRules);
exports.MyNewHighlightRules = MyNewHighlightRules;
});

The token state machine operates on whatever is defined in this.$rules. The highlighter always begins at the start state, and progresses down the list, looking for a matching regex. When one is found, the resulting text is wrapped within a <span class="ace_<token>"> tag, where <token> is defined as the token property. Note that all tokens are preceded by the ace_ prefix when they're rendered on the page.

Once again, we're inheriting from TextHighlightRules here. We could choose to make this any other language set we want, if our new language requires previously defined syntaxes. For more information on extending languages, see extending Highlighters below.

Defining Tokens

The Ace highlighting system is heavily inspired on the TextMate language grammar. Most tokens will follow the conventions of TextMate when naming grammars. A thorough (albeit incomplete) list of tokens can be found on the Ace Wiki.

For the complete list of tokens, see tool/tmtheme.js. It is possible to add new token names, but the scope of that knowledge is outside of this document.

Multiple tokens can be applied to the same text by adding dots in the token, e.g. token: support.function wraps the text in a <span class="ace_support ace_function"> tag.

Defining Regular Expressions

Regular expressions can either be a RegExp or String definition.

If you're using a regular expression, remember to start and end the line with the / character, like this:

{
    token : "constant.language.escape",
    regex : /\$[\w\d]+/
}

A caveat of using stringed regular expressions is that any \ character must be escaped. That means that even an innocuous regular expression like this:

regex: "function\s*\(\w+\)"

Must actually be written like this:

regex: "function\\s*\(\\w+\)"

Groupings

The regular expression matches the part of the code that should be styled by the token. You can also include flat regexps (var) or have matching groups ((a+)(b+)). There is a strict requirement whereby matching groups must cover the entire matched string; thus, (hel)lo is invalid. If you want to create a non-matching group, simply start the group with the ?: predicate; thus, (hel)(?:lo) is okay. You can create longer non-matching groups. For example:

{
    token : "constant.language.boolean",
    regex : /(?:true|false)\b/
},

For flat regular expression matches, token can be a String, or a Function that takes a single argument (the match) and returns a string token. For example, using a function might look like this:

var colors = lang.arrayToMap(
    ("aqua|black|blue|fuchsia|gray|green|lime|maroon|navy|olive|orange|" +
    "purple|red|silver|teal|white|yellow").split("|")
);

var fonts = lang.arrayToMap(
    ("arial|century|comic|courier|garamond|georgia|helvetica|impact|lucida|" +
    "symbol|system|tahoma|times|trebuchet|utopia|verdana|webdings|sans-serif|" +
    "serif|monospace").split("|")
);
...
{
    token: function(value) {
        if (colors.hasOwnProperty(value.toLowerCase())) {
            return "support.constant.color";
        }
        else if (fonts.hasOwnProperty(value.toLowerCase())) {
            return "support.constant.fonts";
        }
        else {
            return "text";
        }
    },
    regex: "\\-?[a-zA-Z_][a-zA-Z0-9_\\-]*"
}

If token is a function, it should take the same number of arguments as there are groups, and return an array of tokens.

For grouped regular expressions, token can be a String, in which case all matched groups are given that same token, like this:

{
    token: "identifier",
    regex: "(\\w+\\s*:)(\\w*)"
}

More commonly, though, token is an Array (of the same length as the number of groups), whereby matches are given the token of the same alignment as in the match. For a complicated regular expression, like defining a function, that might look something like this:

{
    token : ["storage.type", "text", "entity.name.function"],
    regex : "(function)(\\s+)([a-zA-Z_][a-zA-Z0-9_]*\\b)"
}

Defining States

The syntax highlighting state machine stays in the start state, until you define a next state for it to advance to. At that point, the tokenizer stays in that new state, until it advances to another state. Afterwards, you should return to the original start state.

Here's an example:

this.$rules = {
    "start" : [ {
        token : "text",
        regex : "<\\!\\[CDATA\\[",
        next : "cdata"
    },

    "cdata" : [ {
        token : "text",
        regex : "\\]\\]>",
        next : "start"
    }, {
        defaultToken : "text"
    } ]
};

In this extremly short sample, we're defining some highlighting rules for when Ace detectes a <![CDATA tag. When one is encountered, the tokenizer moves from start into the cdata state. It remains there, applying the text token to any string it encounters. Finally, when it hits a closing ]> symbol, it returns to the start state and continues to tokenize anything else.

Extending Highlighters

Suppose you're working on a LuaPage, PHP embedded in HTML, or a Django template. You'll need to create a syntax highlighter that takes all the rules from the original language (Lua, PHP, or Python) and extends it with some additional identifiers (<?lua, <?php, {%, for example). Ace allows you to easily extend a highlighter using a few helper functions.

Getting Existing Rules

To get the existing syntax highlighting rules for a particular language, use the getRules() function. For example:

var HtmlHighlightRules = require("./html_highlight_rules").HtmlHighlightRules;

this.$rules = new HtmlHighlightRules().getRules();

/*
    this.$rules == Same this.$rules as HTML highlighting
*/

Extending a Highlighter

The addRules() method does one thing, and it does one thing well: it adds new rules to an existing rule set, and prefixes any state with a given tag. For example, let's say you've got two sets of rules, defined like this:

this.$rules = {
    "start": [ /* ... */ ]
};

var newRules = {
    "start": [ /* ... */ ]
}

If you want to incorporate newRules into this.$rules, you'd do something like this:

this.addRules(newRules, "new-");

/*
    this.$rules = {
        "start": [ ... ],
        "new-start": [ ... ]
    };
*/

Extending Two Highlighters

The last function available to you combines both of these concepts, and it's called embedRules. It takes three parameters:

  1. An existing rule set to embed with
  2. A prefix to apply for each state in the existing rule set
  3. A set of new states to add

Like addRules, embedRules adds on to the existing this.$rules object.

To explain this visually, let's take a look at the syntax highlighter for Lua pages, which
combines all of these concepts:

var HtmlHighlightRules = require("./html_highlight_rules").HtmlHighlightRules;
var LuaHighlightRules = require("./lua_highlight_rules").LuaHighlightRules;

var LuaPageHighlightRules = function() {
    this.$rules = new HtmlHighlightRules().getRules();

    for (var i in this.$rules) {
        this.$rules[i].unshift({
            token: "keyword",
            regex: "<\\%\\=?",
            next: "lua-start"
        }, {
            token: "keyword",
            regex: "<\\?lua\\=?",
            next: "lua-start"
        });
    }
    this.embedRules(LuaHighlightRules, "lua-", [
        {
            token: "keyword",
            regex: "\\%>",
            next: "start"
        },
        {
            token: "keyword",
            regex: "\\?>",
            next: "start"
        }
    ]);
};

Here, this.$rules starts off as a set of HTML highlighting rules. To this set, we add two new checks for <%= and <?lua=. We also delegate that if one of these rules are matched, we should move onto the lua-start state. Next, embedRules takes the already existing set of LuaHighlightRules and applies the lua- prefix to each state there. Finally, it adds two new checks for %> and ?>, allowing the state machine to return to start.

Testing Your Highlighter

The best way to test your tokenizer is to see it live, right? To do that you'll want to create a new Cloud9 bundle and add your highlighter to it. See this guide for more information.

Adding Automated Tests

Adding automated tests for a highlighter is trivial so you are not required to do it, but it can help during development.

In lib/ace/mode/_test create a file named

text_<modeName>.txt

with some example code. (You can skip this if the document you have added in demo/docs both looks good and covers various edge cases in your language syntax).

Run node highlight_rules_test.js -gen to preserve current output of your tokenizer in tokens_<modeName>.json

After this running highlight_rules_test.js optionalLanguageName will compare output of your tokenizer with the correct output you've created.