[TIP] About testing Pygments lexers

Wed Aug 25 10:46:22 PDT 2010

On 8/25/10 9:47 AM, Olemis Lang wrote:
> Hello , I don't recall if I already asked this but I'm looking for a
> test lib able to detect if a Pygments lexer (e.g. using Regex lexer
> rules) will be stuck

By "stuck" do you mean an infinite loop, or an error token?  Off the top
of my head, the only case where an infinite loop can happen is if a
regex can match the empty string [although empty string to enter a new
state is used by a couple of lexers, and is probably safe].

> and also able to generate state & rule coverage
> by running tests

The best place to do this would be inside RegexLexer's
get_tokens_unprocessed method.  Although the code hasn't changed since
2007, IMO the best way to add coverage would be to replace the match
function ('rexmatch' there, self._tokens[...][...][0] more generally)
with your own function using a subclass:

def __init__(self, *args, **kwargs):
    super(RegexLexer, self).__init__()

def do_patch(self):
    def record(s, i, rexmatch):
        def closure(text):
            self.covered[s][i] = True
            return rexmatch(text)
        return closure

    self.covered = {}
    for statename, regexes in self._tokens.iteritems():
        self.covered[statename] = dict.fromkeys(range(len(regexes)))
        for i in range(len(regexes)):
            regexes[i][0] = record(statename, i, regexes[i][0])

def missed_coverage(self):
    def shorten(s):
        if len(s) < 20:
            return s
        return s[:20] + '...'

    missed = 0
    for statename in sorted(self._tokens):
        print "State", statename
        regexes = self._tokens[statename]
        for i in range(len(regexes)):
            if not self.covered[statename][i]:
                print "Missed", i, shorten(self.tokens[statename][i][0])
                missed += 1

    return missed

Your test function could do something like this (assuming stdout
capture, like nose does):

def test_lexer():
    text = "lambda x: 1+1\n"
    lex = PythonLexer()
    # ... potentially monkeypatch
    x = list(lex.get_tokens_unprocessed(text))
    assert x.missed_coverage() == 0

Tim