Basic lexical analyser for C99 code.
#!/usr/bin/python
from clexer import C99Lexer
lexer = C99Lexer()
print lexer.tokenize("static unsigned int foo = bar++;")You should get your tokens in a list of the following format:
[('KW_STATIC', 'static'), ('KW_UNSIGNED', 'unsigned'), ('KW_INT', 'int'), ('IDENTIFIER', 'foo'), ('OP_ASSIGN', '='), ('IDENTIFIER', 'bar'), ('OP_INC', '++'), ('SYM_SEMICOLON', ';')]
Language symbols are prefixed with SYM_, operators are prefixed with OP_, keywords are prefixed with KW_. Keep in mind that context-dependent tokens (&, *, +, -) are prefixed with SYM_.
If you want the whitespace characters preserved in your tokens list, set keep_whitespaces in the object constructor:
lexer = C99Lexer(keep_whitespaces=True)