Since this is the first article, I'll outline the project structure for the C- compiler.
The project has a series of pretty typical stages:
As far as possible, I'd like to keep each of these stages separate. One benefit of this is that it simplifies memory management greatly. I plan to use an arena allocator for each stage, and by making sure the only thing on the actual heap is the output of the stage, and all temporary data is stored in the arena, I can free all the memory used by a stage by simply freeing the arena.
Here are some rules (more like guidelines) that I plan to follow for this project; they're mostly just to keep things simple and consistent.
PROGRAM LIKE IT'S 1999
640 KB ought to be enough for anybody. - Bill Gates
Maybe not that little, But I'm going to try to keep the project as simple as possible, 640 KB probably won't be enough, but I'll still aim for less than 10 MB of memory usage.
This places a lot of constraints on the project, but I think it's a good exercise in minimalism.
Some consequences of this are that I'll have to use memory-wise algorithms, be very careful about program structure, and avoid some of the bigger libraries (which will help with making this project self-hosting in the future).
PROGRAM IN C++--
I'm not a big fan of C++, but its class system helps prevent a lot of ugly bugs. To that end, I'm going to try and keep data structures out of header files, and only expose functions that operate on those data structures, to create a sort of approximation of a class. This has a few benefits:
DON'T GET FANCY
My goal here isn't to write the fastest interpreter in the world, or the most complete. I just want to make something that works and can be understood by someone else.
That means I'm going to avoid a lot of the tricks that are used in production interpreters, and focus more on simplicity and readability.
DESIGN FOR DEBUGGING
This code is going to be peppered with asserts and contain mechanisms to print out the state of the program at any point.
This might be painful, but it'll make debugging a lot simpler and let users look under the hood.
SMART DATA, STUPID CODE
A lot of times, the right data structure can replace 50-100 lines of procedural code. I'm going to try and design data structures which make the algorithms as simple as possible.
For example, instead of writing 50-100 lines of code to hold every keyword in the language, I can just use a simple hash table.
THIS IS A LITERATE PROGRAM! Go to this link to see the file that generated this HTML.
A lexical analyzer reads source code and produces tokens, which are the smallest units of meaning in a language. These tokens are then used by the parser to build an abstract syntax tree (AST), which represents the structure of the program.
For example, in the C programming language, tokens include keywords (if, else, while, etc.), identifiers (variable names), numbers, and punctuation (braces, semicolons, etc.).
Given a string like int main() { return 0; }
, the lexer would produce a series of tokens like INT
, IDENTIFIER(main)
, LPAREN
, RPAREN
, LBRACE
, RETURN
, INTCONSTANT(0)
, SEMICOLON
, RBRACE
.
I'll break the lexer into several modules:
token.c
will contain the token data structure and functions to create and destroy tokens.
input.c
will contain the input data structure and functions to read from the input file.
tokenizer.c
will contain the main lexer logic.
We'll need several components to represent a token:
TOK_IF
or TOK_INTEGER_U32
.
To implement a class system in C, we'll hide the token implementation details behind an opaque pointer. We'll use a forward declaration of the token type in the header file and define the token type in the implementation file.
We'll need functions to create and destroy tokens.
token_t *token_create(c_token_types kind, int lin, int col, int len); token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len); token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len); token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len); token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len); void token_destroy(token_t *token);
Used in section 41
We'll also need functions to access the token data.
c_token_types token_type(token_t *token); int64_t token_int(token_t *token); double token_float(token_t *token); const char *token_string(token_t *token); char token_char(token_t *token); int token_line(token_t *token); int token_column(token_t *token); void print_token(token_t *tok);
We'll need types to represent the different kinds of tokens.
typedef enum { // Control Keywords TOK_IF, TOK_ELSE, TOK_SWITCH, TOK_CASE, TOK_DEFAULT, TOK_WHILE, TOK_DO, TOK_FOR, TOK_CONTINUE, TOK_BREAK, TOK_RETURN, TOK_GOTO, // Type Keywords TOK_VOID, TOK_CHAR, TOK_SHORT, TOK_INT, TOK_LONG, TOK_FLOAT, TOK_DOUBLE, TOK_SIGNED, TOK_UNSIGNED, TOK_STRUCT, TOK_UNION, TOK_ENUM, TOK_TYPEDEF, // Storage Class/Specifier Keywords TOK_AUTO, TOK_REGISTER, TOK_STATIC, TOK_EXTERN, TOK_CONST, TOK_VOLATILE, // Misc Keywords TOK_SIZEOF, // Operators TOK_ADD, // + TOK_SUB, // - TOK_MUL, // * TOK_DIV, // / TOK_MOD, // % TOK_BIT_AND, // & TOK_BIT_OR, // | TOK_BIT_XOR, // ^ TOK_BIT_NOT, // ~ TOK_LSHIFT, // << TOK_RSHIFT, // >> TOK_NOT, // ! TOK_ASSIGN, // = TOK_LT, // < TOK_GT, // > TOK_INC, // ++ TOK_DEC, // -- TOK_EQ, // == TOK_NE, // != TOK_LE, // <= TOK_GE, // >= TOK_AND, // && TOK_OR, // || TOK_MEMBER_POINTER, // -> TOK_MEMBER, // . TOK_COND_DECISION, // : TOK_COND, // ? TOK_ASSIGN_ADD, // += TOK_ASSIGN_SUB, // -= TOK_ASSIGN_MUL, // *= TOK_ASSIGN_DIV, // /= TOK_ASSIGN_MOD, // %= TOK_ASSIGN_BITAND, // &= TOK_ASSIGN_BITOR, // |= TOK_ASSIGN_BITXOR, // ^= TOK_ASSIGN_LSHIFT, // <<= TOK_ASSIGN_RSHIFT, // >>= // Separators TOK_LEFT_PAREN, // ( TOK_RIGHT_PAREN, // ) TOK_LEFT_BRACKET, // [ TOK_RIGHT_BRACKET, // ] TOK_LEFT_BRACE, // { TOK_RIGHT_BRACE, // } TOK_COMMA, // , TOK_SEMICOLON, // ; TOK_DOT, // . TOK_ELLIPSIS, // ... TOK_HASH, // # // Identifiers TOK_ID, TOK_TYPEDEF_NAME, // Constants TOK_INTEGER_U32, // u TOK_INTEGER_U64, // ul TOK_INTEGER_S32, // (no suffix) TOK_INTEGER_S64, // l TOK_FLOAT_32, // f TOK_FLOAT_64, // (no suffix) TOK_CHAR_CONST, // 'c' TOK_STRING_ASCII, // "string" (width of 8 bits) // Special TOK_EOF, TOK_ERROR, } c_token_types;
We bring this all together in token.h
. Line and column are exposed as global variables because skip_whitespace
will need to update them.
Now that we have the interface, we can implement the token data structure. We'll need:
To verify the token isn't corrupt, we'll use a tag. TOK_MAGIC_1
represents a token with optional data, and TOK_MAGIC_2
represents a token without optional data.
A zero-length array is used in the token data structure. This GCC extension allows us to allocate memory for the token data structure and the token data in one allocation.
#define TOK_MAGIC_1 0x544F4B454E544F4Bul // "TOKENTOK" #define TOK_MAGIC_2 0x544F4B544F4B454Eul // "TOKTOKEN" struct token { long magic; int line; int column; short kind; long opt_data[0]; }; typedef struct token token_t; struct token_data { union { int64_t i; double f; const char *s; char c; } data; }; typedef struct token_data token_data_t; int column = 1; int line = 1;
Used in section 42
We'll implement an interface for accessing the token data and a macro for accessing optional data.
#define token_data(token) ((struct token_data *)((token)->opt_data)) c_token_types token_type(token_t *token) { assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2); return token->kind; } int64_t token_int(token_t *token) { assert(token->kind == TOK_INTEGER_U32 || token->kind == TOK_INTEGER_U64 || token->kind == TOK_INTEGER_S32 || token->kind == TOK_INTEGER_S64); assert(token->magic == TOK_MAGIC_1); return token_data(token)->data.i; } double token_float(token_t *token) { assert(token->kind == TOK_FLOAT_32 || token->kind == TOK_FLOAT_64); assert(token->magic == TOK_MAGIC_1); return token_data(token)->data.f; } const char *token_string(token_t *token) { assert(token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TYPEDEF_NAME); assert(token->magic == TOK_MAGIC_1); return token_data(token)->data.s; } char token_char(token_t *token) { assert(token->kind == TOK_CHAR_CONST); assert(token->magic == TOK_MAGIC_1); return token_data(token)->data.c; } int token_line(token_t *token) { assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2); return token->line; } int token_column(token_t *token) { assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2); return token->column; }
Used in section 42
For debugging, we'll add a function to print the token type.
This function returns a string with the token type name.
const char *token_name_from_type(c_token_types type) { switch (type) { case TOK_IF: return "TOK_IF"; case TOK_ELSE: return "TOK_ELSE"; case TOK_SWITCH: return "TOK_SWITCH"; case TOK_CASE: return "TOK_CASE"; case TOK_DEFAULT: return "TOK_DEFAULT"; case TOK_WHILE: return "TOK_WHILE"; case TOK_DO: return "TOK_DO"; case TOK_FOR: return "TOK_FOR"; case TOK_CONTINUE: return "TOK_CONTINUE"; case TOK_BREAK: return "TOK_BREAK"; case TOK_RETURN: return "TOK_RETURN"; case TOK_GOTO: return "TOK_GOTO"; case TOK_VOID: return "TOK_VOID"; case TOK_CHAR: return "TOK_CHAR"; case TOK_SHORT: return "TOK_SHORT"; case TOK_INT: return "TOK_INT"; case TOK_LONG: return "TOK_LONG"; case TOK_FLOAT: return "TOK_FLOAT"; case TOK_DOUBLE: return "TOK_DOUBLE"; case TOK_SIGNED: return "TOK_SIGNED"; case TOK_UNSIGNED: return "TOK_UNSIGNED"; case TOK_STRUCT: return "TOK_STRUCT"; case TOK_UNION: return "TOK_UNION"; case TOK_ENUM: return "TOK_ENUM"; case TOK_TYPEDEF: return "TOK_TYPEDEF"; case TOK_AUTO: return "TOK_AUTO"; case TOK_REGISTER: return "TOK_REGISTER"; case TOK_STATIC: return "TOK_STATIC"; case TOK_EXTERN: return "TOK_EXTERN"; case TOK_CONST: return "TOK_CONST"; case TOK_VOLATILE: return "TOK_VOLATILE"; case TOK_SIZEOF: return "TOK_SIZEOF"; case TOK_ADD: return "TOK_ADD"; case TOK_SUB: return "TOK_SUB"; case TOK_MUL: return "TOK_MUL"; case TOK_DIV: return "TOK_DIV"; case TOK_MOD: return "TOK_MOD"; case TOK_BIT_AND: return "TOK_BIT_AND"; case TOK_BIT_OR: return "TOK_BIT_OR"; case TOK_BIT_XOR: return "TOK_BIT_XOR"; case TOK_BIT_NOT: return "TOK_BIT_NOT"; case TOK_LSHIFT: return "TOK_LSHIFT"; case TOK_RSHIFT: return "TOK_RSHIFT"; case TOK_NOT: return "TOK_NOT"; case TOK_ASSIGN: return "TOK_ASSIGN"; case TOK_LT: return "TOK_LT"; case TOK_GT: return "TOK_GT"; case TOK_INC: return "TOK_INC"; case TOK_DEC: return "TOK_DEC"; case TOK_EQ: return "TOK_EQ"; case TOK_NE: return "TOK_NE"; case TOK_LE: return "TOK_LE"; case TOK_GE: return "TOK_GE"; case TOK_AND: return "TOK_AND"; case TOK_OR: return "TOK_OR"; case TOK_MEMBER_POINTER: return "TOK_MEMBER_POINTER"; case TOK_MEMBER: return "TOK_MEMBER"; case TOK_COND_DECISION: return "TOK_COND_DECISION"; case TOK_COND: return "TOK_COND"; case TOK_ASSIGN_ADD: return "TOK_ASSIGN_ADD"; case TOK_ASSIGN_SUB: return "TOK_ASSIGN_SUB"; case TOK_ASSIGN_MUL: return "TOK_ASSIGN_MUL"; case TOK_ASSIGN_DIV: return "TOK_ASSIGN_DIV"; case TOK_ASSIGN_MOD: return "TOK_ASSIGN_MOD"; case TOK_ASSIGN_BITAND: return "TOK_ASSIGN_BITAND"; case TOK_ASSIGN_BITOR: return "TOK_ASSIGN_BITOR"; case TOK_ASSIGN_BITXOR: return "TOK_ASSIGN_BITXOR"; case TOK_ASSIGN_LSHIFT: return "TOK_ASSIGN_LSHIFT"; case TOK_ASSIGN_RSHIFT: return "TOK_ASSIGN_RSHIFT"; case TOK_HASH: return "TOK_HASH"; case TOK_ID: return "TOK_ID"; case TOK_TYPEDEF_NAME: return "TOK_TYPEDEF_NAME"; case TOK_INTEGER_U32: return "TOK_INTEGER_U32"; case TOK_INTEGER_U64: return "TOK_INTEGER_U64"; case TOK_INTEGER_S32: return "TOK_INTEGER_S32"; case TOK_INTEGER_S64: return "TOK_INTEGER_S64"; case TOK_FLOAT_32: return "TOK_FLOAT_32"; case TOK_FLOAT_64: return "TOK_FLOAT_64"; case TOK_CHAR_CONST: return "TOK_CHAR_CONST"; case TOK_STRING_ASCII: return "TOK_STRING_ASCII"; case TOK_EOF: return "TOK_EOF"; case TOK_ERROR: return "TOK_ERROR"; case TOK_LEFT_PAREN: return "TOK_LEFT_PAREN"; case TOK_RIGHT_PAREN: return "TOK_RIGHT_PAREN"; case TOK_LEFT_BRACKET: return "TOK_LEFT_BRACKET"; case TOK_RIGHT_BRACKET: return "TOK_RIGHT_BRACKET"; case TOK_LEFT_BRACE: return "TOK_LEFT_BRACE"; case TOK_RIGHT_BRACE: return "TOK_RIGHT_BRACE"; case TOK_COMMA: return "TOK_COMMA"; case TOK_SEMICOLON: return "TOK_SEMICOLON"; case TOK_DOT: return "TOK_DOT"; case TOK_ELLIPSIS: return "TOK_ELLIPSIS"; } return "UNKNOWN"; }
Used in section 14
This function adds escape characters to a string for printing.
#define clamp(x, min, max) ((x) < (min) ? (min) : (x) > (max) ? (max) : (x)) char *re_escape_string(const char *str) { int len = strlen(str); char *buf = malloc(len * 2 + 1); if (!buf) { fprintf(stderr, "Out of memory. Cannot escape string\n"); exit(1); } int i = 0; for (int j = 0; j < len; j++) { switch (str[j]) { case '\a': buf[i++] = '\\'; buf[i++] = 'a'; break; case '\b': buf[i++] = '\\'; buf[i++] = 'b'; break; case '\f': buf[i++] = '\\'; buf[i++] = 'f'; break; case '\n': buf[i++] = '\\'; buf[i++] = 'n'; break; case '\r': buf[i++] = '\\'; buf[i++] = 'r'; break; case '\t': buf[i++] = '\\'; buf[i++] = 't'; break; case '\v': buf[i++] = '\\'; buf[i++] = 'v'; break; case '\\': buf[i++] = '\\'; buf[i++] = '\\'; break; case '\'': buf[i++] = '\\'; buf[i++] = '\''; break; case '"': buf[i++] = '\\'; buf[i++] = '"'; break; default: { if (isprint(str[j])) { buf[i++] = str[j]; } else { buf[i++] = '\\'; buf[i++] = 'x'; buf[i++] = "0123456789abcdef"[clamp(str[j] >> 4, 0, 0xf)]; buf[i++] = "0123456789abcdef"[clamp(str[j] & 0xf, 0, 0xf)]; } } } } buf[i] = '\0'; return buf; }
Used in section 14
This function prints the token type and value.
void print_token(token_t *tok) { if (!tok) { printf("NULL\n"); return; } const char *name = token_name_from_type(tok->kind); switch (tok->kind) { case TOK_ID: case TOK_STRING_ASCII: { char *escaped = re_escape_string(token_string(tok)); printf("%s: \"%s\"@%d:%d\n", name, escaped, tok->line, tok->column); free(escaped); break; } case TOK_TYPEDEF_NAME: { char *escaped = re_escape_string(token_string(tok)); printf("%s: %s@%d:%d\n", name, escaped, tok->line, tok->column); free(escaped); break; } case TOK_CHAR_CONST: printf("%s: '%c'@%d:%d\n", name, token_char(tok), tok->line, tok->column); break; case TOK_INTEGER_S32: case TOK_INTEGER_U32: case TOK_INTEGER_S64: case TOK_INTEGER_U64: printf("%s: %ld@%d:%d\n", name, token_int(tok), tok->line, tok->column); break; case TOK_FLOAT_32: case TOK_FLOAT_64: printf("%s: %f@%d:%d\n", name, token_float(tok), tok->line, tok->column); break; default: printf("%s@%d:%d\n", name, tok->line, tok->column); break; } }
Used in section 14
Now we can implement functions to create and destroy tokens. We'll start with the easy ones.
token_t *token_data_create(c_token_types kind, int lin, int col, int len) { token_t *token = malloc(sizeof(token_t) + sizeof(struct token_data)); if (!token) { fputs("Out of memory\n", stderr); exit(1); } token->magic = TOK_MAGIC_1; token->line = lin; token->column = col; column += len; token->kind = kind; return token; } token_t *token_create(c_token_types kind, int lin, int col, int len) { token_t *token = malloc(sizeof(token_t)); if (!token) { fputs("Out of memory\n", stderr); exit(1); } token->magic = TOK_MAGIC_2; token->line = lin; token->column = col; column += len; token->kind = kind; return token; } token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len) { token_t *token = token_data_create(kind, lin, col, len); token_data(token)->data.i = i; return token; } token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len) { token_t *token = token_data_create(kind, lin, col, len); token_data(token)->data.f = f; return token; } token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len) { token_t *token = token_data_create(kind, lin, col, len); token_data(token)->data.c = c; return token; } void token_destroy(token_t *token) { if (token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2) { free(token); } else { fputs("Corrupt token\n", stderr); exit(1); } }
token_create_string
can be implemented either the easy way or the right way. Let's try the easy way.
There's an issue with this approach. token_create_string
will be called for every identifier and every string in a program. Imagine a large program, say a shell, with a bunch of user input and output. That program will likely have 20-40 calls to fprintf
, fscanf
, strchr
, strtok
, each. We create a new string for each of those calls. That's a lot of duplicates, and can quickly add up to a lot of memory usage.
To fix this, we use a hash table to store the strings. We'll define a hash table in hash_table.h
and hash_table.c
.
A hash table is a data structure that maps keys to values. It's commonly used for implementing symbol tables to store variables and functions. To create a generic hash table, we'll need:
Let's start with the interface:
Now let's implement the hash table:
#include <stdlib.h> #include <string.h> #include <stdio.h> #include "hash_table.h" {Hash Table Data Structure, 27} {Hash Table Entry Data Structure, 28} hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor) { {Allocate and Initialize Hash Table, 29} return table; } void hash_table_destroy(hash_table_t *table) { if (!table) return; {Destroy Entries, 30} free(table->entries); free(table); } void *hash_table_get(const hash_table_t *table, const void *key) { if (!table || !key) return NULL; {Get Entry By Hash, 31} {Loop Through Entries and Return Value if Match, 35} return NULL; } void hash_table_put(hash_table_t *table, void *key, void *value) { if (!table || !key) return; {Get Entry By Hash, 31} {Loop Through Entries and Replace Value if Key Matches, 32} {Allocate New Entry if No Match, 33} } void hash_table_remove(hash_table_t *table, const void *key) { if (!table || !key) return; {Get Entry By Hash, 31} {Loop Through Entries and Remove Entry if Key Matches, 34} } #ifdef TEST_HASH_TABLE #include <assert.h> static int string_cmp(const void *key1, const void *key2) { return strcmp((const char *)key1, (const char *)key2); } static unsigned int string_hash(const void *key) { unsigned int hash = 5381; const unsigned char *str = key; int c; while ((c = *str++)) { hash = ((hash << 5) + hash) + c; } return hash; } int main(void) { hash_table_t *table = hash_table_create(16, string_cmp, string_hash, NULL); assert(table != NULL); hash_table_put(table, "foo", "bar"); hash_table_put(table, "foo", "baz"); assert(strcmp((const char *)hash_table_get(table, "foo"), "baz") == 0); hash_table_remove(table, "foo"); assert(hash_table_get(table, "foo") == NULL); hash_table_destroy(table); printf("All tests passed!\n"); return 0; } #endif
The hash table data structure contains an array of entry pointers, the size of the array, and function pointers for comparison, hashing, and destruction.
Each entry in the hash table contains a key, a value, and a pointer to the next entry in the chain for collision resolution via chaining.
To allocate a hash table, we allocate memory for the table structure and its entries, initialize the entries to NULL, and set the function pointers.
hash_table_t *table = malloc(sizeof(struct hash_table)); if (!table) { fprintf(stderr, "Error: Out of memory, could not allocate hash table\n"); exit(EXIT_FAILURE); } table->entries = calloc(size, sizeof(struct hash_table_entry *)); if (!table->entries) { fprintf(stderr, "Error: Out of memory, could not allocate hash table entries\n"); free(table); exit(EXIT_FAILURE); } table->size = size; table->cmp = cmp; table->hash = hash; table->dtor = dtor;
Used in section 26
To destroy the entries in a hash table, we iterate through all entries, free the keys and values using the destructor if provided, and free the entry itself.
for (int i = 0; i < table->size; i++) { struct hash_table_entry *entry = table->entries[i]; while (entry) { struct hash_table_entry *next = entry->next; if (table->dtor) { table->dtor(entry->key, 1); table->dtor(entry->value, 0); } free(entry); entry = next; } }
Used in section 26
To retrieve an entry's hash bucket, we apply the hash function to the key and take the modulus of the result with the table size.
When putting a new entry in the table, we first check if the key already exists. If it does, we update the value; otherwise, we create a new entry.
If no matching key is found, we create a new entry and insert it at the beginning of the hash bucket.
This can possibly improve performance due to a property called temporal locality. When we access an entry, we're likely to access it again soon. Since the entry is at the beginning of the list, it's likely to be accessed again soon.
struct hash_table_entry *new_entry = malloc(sizeof(struct hash_table_entry)); if (!new_entry) { fprintf(stderr, "Error: Out of memory, could not allocate hash table entry\n"); exit(EXIT_FAILURE); } new_entry->key = key; new_entry->value = value; new_entry->next = table->entries[hash]; table->entries[hash] = new_entry;
Used in section 26
To remove an entry, we find its bucket, update the linked list to bypass it, then free the entry and its contents.
struct hash_table_entry *prev = NULL; while (entry) { if (table->cmp(entry->key, key) == 0) { if (prev) prev->next = entry->next; else table->entries[hash] = entry->next; if (table->dtor) { table->dtor(entry->key, 1); table->dtor(entry->value, 0); } free(entry); return; } prev = entry; entry = entry->next; }
Used in section 26
To retrieve a value from a given bucket, we just walk the list and return the value if a matching key is found.
We're now almost ready to implement token_create_string
the right way. First, we'll need a good hash function.
Hash functions are a very interesting topic and there's a lot of good research on them. The hash function we use should be fast, have a low collision rate, and be able to handle strings of any length.
We can't just sum the characters in a string, because that would mean that "stop" and "pots" would have the same hash. Multiplying has the same problem. If we take each to the power of its position in the string, we get a better distribution, but it's still awful.
Using a simple python program, I brute-forced all possible 4-character strings and ran our power-hash function on them, the result showed that for 456976 possible strings, only 3760 were unique, which is terrible.
Instead of trying to come up with a new hash function, we can use one that's been well-tested and is known to work well.
The first time I wrote this, I used the hash function from the 'Red Dragon Book' (Compilers: Principles, Techniques, and Tools).
unsigned long hash_string(void *key) { unsigned long hash = 0, g; char *p = key; while (*p) { hash = (hash \<\< 4) + *p++; if ((g = hash \& 0xf0000000) != 0) { hash ^= g \>\> 24; hash ^= g; } } return hash; }
Used in section 39
This is a bit slow on modern processors because it's not very cache-friendly. We can do better. Let's use its child, ELFHash, from libc.
As you can see in the code below, this function avoids extra operations and should be much faster.
unsigned int hash_string(const void *key) { unsigned long hash = 0, hi = 0; const char *p = key; hash = *p; if (hash != 0 && p[1] != 0) { hash = (hash << 4) + p[1]; if (p[2] != 0) { hash = (hash << 4) + p[2]; if (p[3] != 0) { hash = (hash << 4) + p[3]; if (p[4] != 0) { hash = (hash << 4) + p[4]; p += 5; while (*p != 0) { hash = (hash << 4) + *p++; hi = hash & 0xf0000000l; hash ^= hi >> 24; } hash &= 0x0fffffffl; } } } } return hash; }
Used in section 39
We also need a comparison function for strings.
Finally, we'll need a destructor for entries.
These functions go in util.c
#include <string.h> #include <stdlib.h> #include "hash_table.h" {String Comparison, 37} {Hash Function, 36} {String Destructor, 38}
#ifndef UTIL_H #define UTIL_H #include "hash_table.h" int cmp_string(const void *key1, const void *key2); unsigned int hash_string(const void *key); void dtor_string(void *value, int is_key); #endif
Now we can implement token_create_string
the right way.
You might notice that we're using the same key and value. This way of using a hash table is normally called a set. We're using it to store strings, but we could use it to store anything we want to deduplicate.
hash_table_t *string_table; token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len) { if (string_table == NULL) { string_table = hash_table_create(2048, cmp_string, hash_string, dtor_string); } token_t *token = token_data_create(kind, lin, col, len); char *key = hash_table_get(string_table, (void *)s); if (key == NULL) { key = strdup(s); hash_table_put(string_table, key, key); } token_data(token)->data.s = key; return token; }
Used in section 42
We'll add an external declaration for string_table
in token.h
so other programs can take advantage of it.
#ifndef TOKEN_H #define TOKEN_H #include <stdint.h> // We use this for int64_t #include "hash_table.h" // We need this for the string table {Token Types, 10} {Opaque Token Type, 7} {Token Creation and Destruction Interface, 8} {Token Interface, 9} extern hash_table_t *string_table; extern int column; extern int line; #endif
Finally, we implement the token data structure in token.c
.
#include <stdlib.h> #include <string.h> #include <stdio.h> #include <assert.h> #include <ctype.h> #include "token.h" #include "hash_table.h" #include "util.h" {Token Data Structure, 12} {Token Data Access, 13} {Token Creation and Destruction, 18} {Token Create String, 19} {Token Debugging, 14}
Input will provide a simple interface for reading characters from a file. The stream itself is deliberately hidden from the tokenizer, so that the tokenizer doesn't have to worry about buffering or anything like that.
void input_init(const char *filename); int input_getc(void); void input_ungetc(int c); void input_destroy(void);
Used in section 53
When the program wants to start reading a file, it calls input_init
with the filename. It can then call input_getc
to get the next character in the file. If there's no more input, input_getc
will return EOF
.
There's also an input_ungetc
function, which allows the program to put a character back into the stream. I'll only allow one character to be put back, but that should be enough for the tokenizer.
Finally, when the program is done reading the file, it should call input_destroy
to clean up.
Per rule 1, we're trying to keep memory usage low. That means that instead of reading the entire file into memory, we'll need to read it in chunks. There are a couple of choices for how to do this:
Read a line at a time: This is a more natural approach, but it has two drawbacks. First, it requires a large buffer to store the line (C normally specifies BUFSIZ
as 8192 bytes). Second, if the line is longer than BUFSIZ
, we'll have to read the line in chunks anyway.
Choose some arbitrary buffer size and read that many bytes at a time: This is the approach I'm going to take. It's a little less natural, but it's more memory efficient.
Input will read chunks of 128 bytes at a time, reusing the same static buffer. This limitation is not visible to the tokenizer, which will only see the input_getc
interface.
When the buffer is exhausted, input_getc
will call nextline
, which will read the next chunk of the file.
The implementation of the input module is pretty straightforward. We have the following data structures and defines as globals:
#define CHUNK_SIZE 128 static char buffer[CHUNK_SIZE]; static int buffer_pos = 0; static int buffer_size = 0; static char unget_buffer_stack[8]; static int unget_buffer_stack_pos = 0; static FILE *file = NULL;
Used in section 52
When the program calls input_init
, we open the file.
void input_init(const char *filename) { file = fopen(filename, "r"); if (file == NULL) { fprintf(stderr, "Error: Cannot open file %s\n", filename); exit(1); } }
Used in section 52
When the program calls input_getc
, we return the next character in the buffer. If the buffer is exhausted, we call nextline
. We also track the line and column.
int input_getc(void) { if (unget_buffer_stack_pos > 0) { return unget_buffer_stack[--unget_buffer_stack_pos]; } if (buffer_pos == buffer_size) { buffer_size = fread(buffer, 1, CHUNK_SIZE, file); buffer_pos = 0; } if (buffer_size == 0) { return EOF; } char c = buffer[buffer_pos++]; return c; }
Used in section 52
When the program calls input_ungetc
, we save the character in the unget_buffer
.
void input_ungetc(int c) { unget_buffer_stack[unget_buffer_stack_pos++] = c; }
Used in section 52
Since we're not using dynamic memory allocation, cleanup is pretty simple.
We put the whole thing together in input.c
.
We'll need an external declaration for file
in input.h
so other programs can take advantage of it.
We'll implement the lexer interface in tokenizer.h
The tokenization interface will have a couple of functions. next_token
will return the next token in the input stream, init_tokenizer
will initialize the tokenizer, and destroy_tokenizer
will clean up.
We'll also have some helper functions for lookahead and matching.
'peek_token' will return the next token without consuming it (it technically does advance the input stream, but it saves the token so it can be reused).
'consume' will consume the next token if it matches a given kind. If it doesn't match, it will print an error message and exit.
Now we can finally implement the tokenizer. A stack for storing tokens for lookahead is defined, as is a hash table, modified by the parser, for typedefs.
#include <ctype.h> #include <errno.h> #include <float.h> #include <math.h> #include <stdarg.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "tokenizer.h" #include "token.h" #include "input.h" #include "util.h" token_t *left_stack[8]; int left_stack_pos = 0; hash_table_t *typedef_table = NULL; {Utility Functions, 57} {Tokenization Function, 59}
Utility functions are everything that doesn't directly tokenize the input.
void init_tokenizer(const char *filename) {
input_init(filename);
typedef_table = hash_table_create(16, cmp_string, hash_string, dtor_string);
}
void destroy_tokenizer(void) {
input_destroy();
}
void reject_token(token_t *token) {
left_stack[left_stack_pos++] = token;
}
token_t *peek_token(void) {
if (left_stack_pos > 0) {
return left_stack[left_stack_pos - 1];
}
token_t *token = next_token();
reject_token(token);
return token;
}
{Stringify Type, 58}
void consume(c_token_types kind) {
token_t *token = next_token();
if (token_type(token) != kind) {
fprintf(stderr, "Error: Expected token of type \"%s\", got \"%s\"\n", stringify_type(kind), stringify_type(token_type(token)));
exit(1);
}
token_destroy(token);
}
void consume_alt(c_token_types *kinds, int n) {
token_t *token = next_token();
for (int i = 0; i < n; i++) {
if (token_type(token) == kinds[i]) {
token_destroy(token);
return;
}
}
fprintf(stderr, "Error: Expected one of the following tokens: ");
for (int i = 0; i < n; i++) {
fprintf(stderr, "\"%s\" ", stringify_type(kinds[i]));
}
fprintf(stderr, "got \"%s\"\n", stringify_type(token_type(token)));
exit(1);
}
Used in section 56
We'll need a helper function to convert token types to strings. It's pretty simple, just tedious.
const char *stringify_type(c_token_types type) { switch (type) { case TOK_IF: return "if"; case TOK_ELSE: return "else"; case TOK_SWITCH: return "switch"; case TOK_CASE: return "case"; case TOK_DEFAULT: return "default"; case TOK_WHILE: return "while"; case TOK_DO: return "do"; case TOK_FOR: return "for"; case TOK_CONTINUE: return "continue"; case TOK_BREAK: return "break"; case TOK_RETURN: return "return"; case TOK_GOTO: return "goto"; case TOK_VOID: return "void"; case TOK_CHAR: return "char"; case TOK_SHORT: return "short"; case TOK_INT: return "int"; case TOK_LONG: return "long"; case TOK_FLOAT: return "float"; case TOK_DOUBLE: return "double"; case TOK_SIGNED: return "signed"; case TOK_UNSIGNED: return "unsigned"; case TOK_STRUCT: return "struct"; case TOK_UNION: return "union"; case TOK_ENUM: return "enum"; case TOK_TYPEDEF: return "typedef"; case TOK_AUTO: return "auto"; case TOK_REGISTER: return "register"; case TOK_STATIC: return "static"; case TOK_EXTERN: return "extern"; case TOK_CONST: return "const"; case TOK_VOLATILE: return "volatile"; case TOK_SIZEOF: return "sizeof"; case TOK_ADD: return "+"; case TOK_SUB: return "-"; case TOK_MUL: return "*"; case TOK_DIV: return "/"; case TOK_MOD: return "%"; case TOK_BIT_AND: return "&"; case TOK_BIT_OR: return "|"; case TOK_BIT_XOR: return "^"; case TOK_BIT_NOT: return "~"; case TOK_LSHIFT: return "<<"; case TOK_RSHIFT: return ">>"; case TOK_NOT: return "!"; case TOK_ASSIGN: return "="; case TOK_LT: return "<"; case TOK_GT: return ">"; case TOK_INC: return "++"; case TOK_DEC: return "--"; case TOK_EQ: return "=="; case TOK_NE: return "!="; case TOK_LE: return "<="; case TOK_GE: return ">="; case TOK_AND: return "&&"; case TOK_OR: return "||"; case TOK_MEMBER_POINTER: return "->"; case TOK_MEMBER: return "."; case TOK_COND_DECISION: return ":"; case TOK_COND: return "?"; case TOK_ASSIGN_ADD: return "+="; case TOK_ASSIGN_SUB: return "-="; case TOK_ASSIGN_MUL: return "*="; case TOK_ASSIGN_DIV: return "/="; case TOK_ASSIGN_MOD: return "%="; case TOK_ASSIGN_BITAND: return "&="; case TOK_ASSIGN_BITOR: return "|="; case TOK_ASSIGN_BITXOR: return "^="; case TOK_ASSIGN_LSHIFT: return "<<="; case TOK_ASSIGN_RSHIFT: return ">>="; case TOK_HASH: return "#"; case TOK_ID: return "identifier"; case TOK_TYPEDEF_NAME: return "typedef name"; case TOK_INTEGER_U32: case TOK_INTEGER_U64: case TOK_INTEGER_S32: case TOK_INTEGER_S64: return "integer constant"; case TOK_FLOAT_32: case TOK_FLOAT_64: return "floating constant"; case TOK_CHAR_CONST: return "character constant"; case TOK_STRING_ASCII: return "string constant"; case TOK_EOF: return "EOF"; case TOK_ERROR: return "error"; case TOK_LEFT_PAREN: return "("; case TOK_RIGHT_PAREN: return ")"; case TOK_LEFT_BRACKET: return "["; case TOK_RIGHT_BRACKET: return "]"; case TOK_LEFT_BRACE: return "{"; case TOK_RIGHT_BRACE: return "}"; case TOK_COMMA: return ","; case TOK_SEMICOLON: return ";"; case TOK_DOT: return "."; case TOK_ELLIPSIS: return "..."; } return "UNKNOWN"; }
Used in section 57
Now we can implement the tokenization function. The pattern is pretty simple: we call each of the tokenization functions in turn until we find a match. If we don't find a match, we print an error message and exit. You might wonder why skip_whitespace can return a token. This makes handling the divide operator easier as comments also start with a slash.
char file_name[1024]; {Warning/Error Functions, 60} {Skip Whitespace, 61} {Tokenize Identifier, 62} {Tokenize Number, 65} {Tokenize String, 74} {Tokenize Character, 73} {Tokenize Operator, 64} token_t *next_token(void) { if (left_stack_pos > 0) { return left_stack[--left_stack_pos]; } token_t *tok = skip_whitespace(); if (tok != NULL) { return tok; } tok = read_identifier(); if (tok != NULL) { return tok; } tok = read_number(); if (tok != NULL) { return tok; } tok = read_char_constant(); if (tok != NULL) { return tok; } tok = read_string_literal(); if (tok != NULL) { return tok; } tok = read_operator(); if (tok != NULL) { return tok; } int c = input_getc(); if (c == EOF) { return NULL; } tok_warn( "Warning: Ignoring unexpected character '%c' at line %d, column %d\n", c, line, column); return next_token(); } #ifdef TEST_TOKENIZER {Run Test, 76} #endif
Used in section 56
We'll need a couple of helper functions to skip whitespace and print warnings/errors.
void tok_error(const char *fmt, ...) { va_list args; va_start(args, fmt); fprintf(stderr, "Error in file %s at line %d, column %d: ", file_name, line, column); vfprintf(stderr, fmt, args); va_end(args); } void tok_warn(const char *fmt, ...) { va_list args; va_start(args, fmt); fprintf(stderr, "Warning in file %s at line %d, column %d: ", file_name, line, column); vfprintf(stderr, fmt, args); va_end(args); }
Used in section 59
The skip_whitespace
function is pretty simple. It just skips over any comments, whitespace, and line directives.
static token_t *skip_whitespace(void) { int c; while ((c = input_getc()) != EOF) { if (isspace(c)) { // Whitespace if (c == '\n') { line++; column = 1; } else { column++; } } else if (c == '#') // GCC preprocessor line control directive. { char buf[512]; int i = 0; while ((c = input_getc()) != EOF && c != '\n') { buf[i++] = c; column++; } buf[i] = '\0'; if (sscanf(buf, "%d \"%[^\"]\"", &line, file_name) == 2) { column = 1; } else { tok_error("Invalid #line directive\n"); } if (c == EOF) { return NULL; } } else if (c == '/') { // Comment c = input_getc(); if (c == '/') { while ((c = input_getc()) != EOF && c != '\n') { column++; } if (c == EOF) { return NULL; } line++; column = 1; } else if (c == '*') { // Multiline comment while ((c = input_getc()) != EOF) { if (c == '*') { c = input_getc(); if (c == '/') { break; } } else if (c == '\n') { line++; column = 1; } else { column++; } } if (c == EOF) { return NULL; } } else { // Handled here to simplify the code. if (c == '=') return token_create(TOK_ASSIGN_DIV, line, column, 2); input_ungetc(c); return token_create(TOK_DIV, line, column, 1); } } else { input_ungetc(c); return NULL; } } return NULL; }
Used in section 59
The read_identifier
function reads an identifier from the input stream. C identifiers can contain letters, digits, and underscores, but they can't start with a digit.
{Get Keyword, 63}
static token_t *read_identifier(void) {
int c;
char buf[1024];
int i = 0;
c = input_getc();
if (!isalpha(c) && c != '_') {
input_ungetc(c);
return NULL;
}
buf[i++] = c;
while ((c = input_getc()) != EOF) {
if (!isalnum(c) && c != '_') {
input_ungetc(c);
break;
}
buf[i++] = c;
if (i >= 1008) {
tok_error("Identifier too long\n");
exit(1);
}
}
buf[i] = '\0';
column += i;
// Check if it's a keyword
c_token_types kind = get_keyword(buf, i);
if (kind != TOK_ID) {
return token_create(kind, line, column, i);
}
// Check if it's a typedef
if (hash_table_get(typedef_table, buf) != NULL) {
return token_create(TOK_TYPEDEF_NAME, line, column, i);
}
return token_create_string(kind, line, column, buf, i);
}
Used in section 59
The get_keyword
function is a simple decision tree for identifying keywords. The code is pretty tedious, but it works.
c_token_types get_keyword(const char *buf, int len) { switch (buf[0]) { case 'a': if (len == 4 && buf[1] == 'u' && buf[2] == 't' && buf[3] == 'o') return TOK_AUTO; break; case 'b': if (len == 5 && buf[1] == 'r' && buf[2] == 'e' && buf[3] == 'a' && buf[4] == 'k') return TOK_BREAK; break; case 'c': switch (buf[1]) { case 'a': if (len == 4 && buf[2] == 's' && buf[3] == 'e') return TOK_CASE; break; case 'h': if (len == 4 && buf[2] == 'a' && buf[3] == 'r') return TOK_CHAR; break; case 'o': if (len == 5 && buf[2] == 'n' && buf[3] == 's' && buf[4] == 't') return TOK_CONST; if (len == 8 && buf[2] == 'n' && buf[3] == 't' && buf[4] == 'i' && buf[5] == 'n' && buf[6] == 'u' && buf[7] == 'e') return TOK_CONTINUE; break; } break; case 'd': switch (buf[1]) { case 'e': if (len == 7 && buf[2] == 'f' && buf[3] == 'a' && buf[4] == 'u' && buf[5] == 'l' && buf[6] == 't') return TOK_DEFAULT; break; case 'o': if (len == 2 && buf[2] == '\0') return TOK_DO; if (len == 6 && buf[2] == 'u' && buf[3] == 'b' && buf[4] == 'l' && buf[5] == 'e') return TOK_DOUBLE; break; } break; case 'e': switch (buf[1]) { case 'l': if (len == 4 && buf[2] == 's' && buf[3] == 'e') return TOK_ELSE; break; case 'n': if (len == 4 && buf[2] == 'u' && buf[3] == 'm') return TOK_ENUM; break; case 'x': if (len == 6 && buf[2] == 't' && buf[3] == 'e' && buf[4] == 'r' && buf[5] == 'n') return TOK_EXTERN; break; } break; case 'f': switch (buf[1]) { case 'l': if (len == 5 && buf[2] == 'o' && buf[3] == 'a' && buf[4] == 't') return TOK_FLOAT; break; case 'o': if (len == 3 && buf[2] == 'r') return TOK_FOR; break; } break; case 'g': if (len == 4 && buf[1] == 'o' && buf[2] == 't' && buf[3] == 'o') return TOK_GOTO; break; case 'i': switch (buf[1]) { case 'f': if (len == 2 && buf[2] == '\0') return TOK_IF; break; case 'n': if (len == 3 && buf[2] == 't') return TOK_INT; break; } break; case 'l': if (len == 4 && buf[1] == 'o' && buf[2] == 'n' && buf[3] == 'g') return TOK_LONG; break; case 'r': switch (buf[1]) { case 'e': if (len == 8 && buf[2] == 'g' && buf[3] == 'i' && buf[4] == 's' && buf[5] == 't' && buf[6] == 'e' && buf[7] == 'r') return TOK_REGISTER; if (len == 6 && buf[2] == 't' && buf[3] == 'u' && buf[4] == 'r' && buf[5] == 'n') return TOK_RETURN; break; } break; case 's': switch (buf[1]) { case 'h': if (len == 5 && buf[2] == 'o' && buf[3] == 'r' && buf[4] == 't') return TOK_SHORT; break; case 't': if (len == 6 && buf[2] == 'a' && buf[3] == 't' && buf[4] == 'i' && buf[5] == 'c') return TOK_STATIC; break; case 'i': if (len == 6 && buf[2] == 'g' && buf[3] == 'n' && buf[4] == 'e' && buf[5] == 'd') return TOK_SIGNED; if (len == 6 && buf[2] == 'z' && buf[3] == 'e' && buf[4] == 'o' && buf[5] == 'f') return TOK_SIZEOF; break; case 'r': if (len == 6 && buf[2] == 'u' && buf[3] == 'c' && buf[4] == 't') return TOK_STRUCT; break; case 'w': if (len == 6 && buf[2] == 'i' && buf[3] == 't' && buf[4] == 'c' && buf[5] == 'h') return TOK_SWITCH; break; } break; case 't': if (len == 7 && buf[1] == 'y' && buf[2] == 'p' && buf[3] == 'e' && buf[4] == 'd' && buf[5] == 'e' && buf[6] == 'f') return TOK_TYPEDEF; break; case 'u': switch (buf[1]) { case 'n': if (len == 5 && buf[2] == 'i' && buf[3] == 'o' && buf[4] == 'n') return TOK_UNION; if (len == 8 && buf[2] == 's' && buf[3] == 'i' && buf[4] == 'g' && buf[5] == 'n' && buf[6] == 'e' && buf[7] == 'd') return TOK_UNSIGNED; break; } break; case 'v': switch (buf[1]) { case 'o': if (len == 4 && buf[2] == 'i' && buf[3] == 'd') return TOK_VOID; if (len == 8 && buf[2] == 'l' && buf[3] == 'a' && buf[4] == 't' && buf[5] == 'i' && buf[6] == 'l' && buf[7] == 'e') return TOK_VOLATILE; break; } break; case 'w': if (len == 5 && buf[1] == 'h' && buf[2] == 'i' && buf[3] == 'l' && buf[4] == 'e') return TOK_WHILE; break; default: return TOK_ID; } return TOK_ID; }
Used in section 62
The read_operator
function works similarly to the read_identifier
function. It uses a decision tree to identify operators.
token_t *read_operator(void) { int c; c = input_getc(); switch (c) { case '!': { c = input_getc(); if (c == '=') return token_create(TOK_NE, line, column, 2); input_ungetc(c); return token_create(TOK_NOT, line, column, 1); } case '%': { c = input_getc(); if (c == '=') return token_create(TOK_ASSIGN_MOD, line, column, 2); input_ungetc(c); return token_create(TOK_MOD, line, column, 1); } case '&': { c = input_getc(); if (c == '&') return token_create(TOK_AND, line, column, 2); if (c == '=') return token_create(TOK_ASSIGN_BITAND, line, column, 2); input_ungetc(c); return token_create(TOK_BIT_AND, line, column, 1); } case '(': return token_create(TOK_LEFT_PAREN, line, column, 1); case ')': return token_create(TOK_RIGHT_PAREN, line, column, 1); case '*': { c = input_getc(); if (c == '=') return token_create(TOK_ASSIGN_MUL, line, column, 2); input_ungetc(c); return token_create(TOK_MUL, line, column, 1); } case '+': { c = input_getc(); if (c == '+') return token_create(TOK_INC, line, column, 2); if (c == '=') return token_create(TOK_ASSIGN_ADD, line, column, 2); input_ungetc(c); return token_create(TOK_ADD, line, column, 2); } case ',': return token_create(TOK_COMMA, line, column, 1); case '-': { c = input_getc(); if (c == '-') return token_create(TOK_DEC, line, column, 2); if (c == '=') return token_create(TOK_ASSIGN_SUB, line, column, 2); if (c == '>') return token_create(TOK_MEMBER_POINTER, line, column, 2); input_ungetc(c); return token_create(TOK_SUB, line, column, 1); } case '.': { c = input_getc(); if (c == '.') { c = input_getc(); if (c == '.') { return token_create(TOK_ELLIPSIS, line, column, 3); } else { // Bail out, can't store more than one unget tok_error("Unexpected character '.' at line %d, column %d\n", line, column); exit(1); } } return token_create('.', line, column, 1); } case '/': { c = input_getc(); if (c == '=') return token_create(TOK_ASSIGN_DIV, line, column, 2); input_ungetc(c); return token_create(TOK_DIV, line, column, 1); } case ':': return token_create(TOK_COND_DECISION, line, column, 1); case ';': return token_create(TOK_SEMICOLON, line, column, 1); case '<': { c = input_getc(); if (c == '<') { c = input_getc(); if (c == '=') return token_create(TOK_ASSIGN_LSHIFT, line, column, 3); input_ungetc(c); return token_create(TOK_LSHIFT, line, column, 2); } if (c == '=') return token_create(TOK_LE, line, column, 2); input_ungetc(c); return token_create(TOK_LT, line, column, 1); } case '=': { c = input_getc(); if (c == '=') return token_create(TOK_ASSIGN, line, column, 2); input_ungetc(c); return token_create(TOK_ASSIGN, line, column, 1); } case '>': { c = input_getc(); if (c == '>') { c = input_getc(); if (c == '=') return token_create(TOK_ASSIGN_RSHIFT, line, column, 3); input_ungetc(c); return token_create(TOK_RSHIFT, line, column, 2); } if (c == '=') return token_create(TOK_GE, line, column, 2); input_ungetc(c); return token_create(TOK_GT, line, column, 1); } case '?': return token_create(TOK_COND, line, column, 1); case '[': return token_create(TOK_LEFT_BRACKET, line, column, 1); case ']': return token_create(TOK_RIGHT_BRACKET, line, column, 1); case '^': { c = input_getc(); if (c == '=') return token_create(TOK_ASSIGN_BITXOR, line, column, 2); input_ungetc(c); return token_create(TOK_BIT_XOR, line, column, 1); } case '{': return token_create(TOK_LEFT_BRACE, line, column, 1); case '|': { c = input_getc(); if (c == '|') return token_create(TOK_OR, line, column, 2); if (c == '=') return token_create(TOK_ASSIGN_BITOR, line, column, 2); input_ungetc(c); return token_create(TOK_BIT_OR, line, column, 1); } case '}': return token_create(TOK_RIGHT_BRACE, line, column, 1); case '~': return token_create(TOK_BIT_NOT, line, column, 1); default: input_ungetc(c); return NULL; } return NULL; }
Used in section 59
The read_number
function reads a number from the input stream. It can be an integer or a floating-point number.
I've broken it up a bit to make it easier to read.
static token_t *read_number(void) { int c; char buf[1024]; int i = 0; c = input_getc(); {Check for valid prefix, 66} int radix = 10; {Process Radix, 67} int is_float = 0; {Read Number Loop, 68} buf[i] = '\0'; {Process Suffixes, 69} {Check for conflicting suffixes, 70} if (is_float) { {Convert to float, 71} } else { {Convert to integer, 72} } return NULL; }
Used in section 59
To determine if a character is a valid prefix for a number, we need to check if it's a digit or a period followed by a digit
// If we don't have a digit or decimal point, it's not a number if (!isdigit(c) && c != '.') { input_ungetc(c); return NULL; } // Decimal point followed by non-digit is a struct member if (c == '.') { char cnext = input_getc(); if (!isdigit(cnext)) { input_ungetc(cnext); return token_create(TOK_MEMBER, line, column, 1); } input_ungetc(cnext); }
Used in section 65
A C constant starting with a zero is either an octal or hexadecimal constant. We need to check the next character to determine which one it is.
// Check for hex and octal. if (c == '0') { char cnext = input_getc(); if (cnext == 'x' || cnext == 'X') { // Hex, discard the 0x radix = 16; } else { // Octal, discard the 0 input_ungetc(cnext); radix = 8; } } else { // Decimal, append the first digit buf[i++] = c; }
Used in section 65
while ((c = input_getc()) != EOF) { // Since there can be multiple writes to the buffer, we want to make sure we // don't overflow by giving a 4 byte pad if (i > 1020) { tok_error("Number too long\n"); return NULL; } // Valid digits for the radix: 0-9 for decimal, 0-7 for octal, 0-9 and // a-f/A-F for hex if ((radix == 10 && isdigit(c)) || (radix == 16 && isxdigit(c)) || (radix == 8 && c >= '0' && c <= '7')) { buf[i++] = c; // Decimal point and not a float yet, must be a float } else if (c == '.' && !is_float) { is_float = 1; if (radix != 10) { tok_error("Invalid floating point constant, expected decimal, got %s\n", radix == 16 ? "hexadecimal" : "octal"); return NULL; } buf[i++] = c; } // Exponent on the end of a constant. (By standard this forces it to be a // float) else if (c == 'e' || c == 'E') { buf[i++] = c; c = input_getc(); // Sign on the exponent if (c == '+' || c == '-') { buf[i++] = c; c = input_getc(); } // Exponent must be a digit, I.E no 1e1.2 if (!isdigit(c)) { tok_error("Invalid floating point exponent\n"); return NULL; } buf[i++] = c; is_float = 1; } else { // Reached the end, unget the character so other functions can read it input_ungetc(c); break; } }
Used in section 65
C constants can have suffixes to indicate their type. We need to check for these suffixes and set the appropriate flags.
int is_unsigned = 0; int is_long = 0; int is_single = 0; while (1) { c = input_getc(); if (c == 'u' || c == 'U') { if (is_unsigned) { tok_warn( "Warning: Duplicate suffix 'u' for integer constant ignored\n"); } is_unsigned = 1; } else if (c == 'l' || c == 'L') { if (is_long) { tok_warn( "Warning: Duplicate suffix 'l' for integer constant ignored\n"); } is_long = 1; } else if (c == 'f' || c == 'F') { if (is_single) { tok_warn("Warning: Duplicate suffix 'f' for floating point constant " "ignored\n"); } is_single = 1; } else { input_ungetc(c); break; } }
Used in section 65
If we find conflicting suffixes, we print a warning and ignore the suffixes.
if (is_single && is_long) { tok_warn("Warning: Invalid suffixes 'l' and 'f' for floating point " "constant. Ignoring 'l'\n"); is_long = 0; } if (is_single && is_unsigned) { tok_warn("Warning: Invalid suffixes 'u' and 'f' for floating point " "constant. Ignoring 'u'\n"); is_unsigned = 0; } if (is_single && !is_float) { tok_warn( "Warning: Invalid suffix 'f' for integer constant. Ignoring 'f'\n"); is_single = 0; }
Used in section 65
If the string contains a float, we pass it to strtod. We need to make sure that the number is in range for the given type and check for errors from strtod
errno = 0; // Strtod generates a unix-style error when it's given something out of // range, so we want to get on top of that quickly instead of ignoring it // That way we can avoid some nasty NAN-propagation in the constant folder. double f = strtod(buf, NULL); if (errno == ERANGE) { tok_error("Floating point constant out of range\n"); return NULL; } // Warn if the constant is out of range for a float, I.E it's too big or too // small if (is_single && (f < FLT_MIN || f > FLT_MAX)) { tok_warn( "Warning: Floating point constant %f is out of range for float\n", f); } // Warn if the constant is too precise for a float if (is_single && fabs((double)((float)f) - f) >= FLT_EPSILON) { tok_warn("Warning: Converting double precision floating point constant " "%f to float loses " "precision\n", f); } return token_create_float(is_single ? TOK_FLOAT_32 : TOK_FLOAT_64, line, column, f, i);
Used in section 65
If the string contains a number, we pass it to stroll. We need to make sure that the number is in range for the given type and check for errors from strtoll
errno = 0; uint64_t int_ = strtoull(buf, NULL, radix); // Same as above, but for integers if (errno == ERANGE) { tok_error("Integer constant out of range\n"); return NULL; } if (is_unsigned) { if (is_long) { return token_create_int(TOK_INTEGER_U64, line, column, int_, i); } else { if (int_ > UINT32_MAX) { tok_warn( "Warning: Integer constant %lld is out of range for unsigned " "int\n", int_); } return token_create_int(TOK_INTEGER_U32, line, column, int_, i); } } else { if (is_long) { // If the highest bit is set, that means this will overflow a signed // long (Due to two's complement) if (int_ & (1UL << 63)) { tok_warn( "Warning: Integer constant %lld is out of range for long long\n", i); } return token_create_int(TOK_INTEGER_S64, line, column, int_, i); } else { if (int_ & (1UL << 31)) { tok_warn("Warning: Integer constant %lld is out of range for int\n", int_); } return token_create_int(TOK_INTEGER_S32, line, column, int_, i); } }
Used in section 65
The read_char_constant
function reads a character constant from the input stream. It can be a single character or a multi-character escape sequence.
static token_t *read_char_constant(void) { int c; int len = 0; c = input_getc(); if (c != '\'') { input_ungetc(c); return NULL; } len++; c = input_getc(); if (c == '\'') { tok_error("Empty character constant\n"); return NULL; } if (c == '\\') { c = read_escape_sequence(&len); } int val = c; c = input_getc(); if (c != '\'') { tok_error("Expected closing quote for character constant\n"); return NULL; } len++; return token_create_char(TOK_CHAR_CONST, line, column, val, len); }
Used in section 59
The read_string_literal
function reads a string literal from the input stream.
For this function, an automatic-lifetime buffer is used to store the string it becomes too large. At that point, a heap-allocated buffer is used. This way we can avoid unnecessary heap allocations for small strings.
{Read Escape Sequence, 75}
static token_t *read_string_literal(void) {
int c;
c = input_getc();
if (c != '"') {
input_ungetc(c);
return NULL;
}
int i = 0;
char s_buf[512];
char *buf = s_buf;
int len = 512;
int esc_pad = 0;
while ((c = input_getc()) != EOF) {
if (c == '"') {
// Implicit skip of closing quote
break;
}
if (c == '\\') {
c = read_escape_sequence(&esc_pad);
if (c == 0) {
return NULL;
}
}
if (i >= len) {
if (buf == s_buf) {
buf = malloc(1024);
if (buf == NULL) {
fputs("Out of memory. Could not parse string literal.\n", stderr);
exit(1);
}
memcpy(buf, s_buf, 512);
len *= 2;
} else {
len *= 2;
buf = realloc(buf, len);
}
}
buf[i++] = c;
}
buf[i] = '\0';
if (c == EOF) {
tok_error("Unterminated string literal\n");
if (buf != s_buf) {
free(buf);
}
return NULL;
}
token_t *tok = token_create_string(TOK_STRING_ASCII, line, column, buf,
i + esc_pad + 2);
if (buf != s_buf) {
free(buf);
}
return tok;
}
Used in section 59
Escape sequences in C can either be single characters or octal/hexadecimal values. We need to handle both cases.
static char read_escape_sequence(int *len) { int c = input_getc(); *len += 1; switch (c) { case 'a': return '\a'; case 'b': return '\b'; case 'f': return '\f'; case 'n': return '\n'; case 'r': return '\r'; case 't': return '\t'; case 'v': return '\v'; case '\'': return '\''; case '"': return '"'; case '?': return '?'; case '\\': return '\\'; case '0': return '\0'; case 'x': { c = input_getc(); if (!isxdigit(c)) { tok_error("Invalid hexadecimal escape sequence\n"); return 0; } int val = 0; while (isxdigit(c)) { *len += 1; val = val * 16 + (isdigit(c) ? c - '0' : tolower(c) - 'a' + 10); c = input_getc(); } input_ungetc(c); return (char)val; } default: if (!isdigit(c)) { tok_error("Invalid escape sequence\n"); return 0; } int val = 0; while (isdigit(c)) { *len += 1; val = val * 8 + c - '0'; c = input_getc(); } input_ungetc(c); return (char)val; } }
Used in section 74
Finally, I'll add some code for running the tokenizer as its own program. This way we can test it out.
char *preprocess(char *in) { char *output_name = malloc(1024); snprintf(output_name, 1024, "%s.preprocessed", in); char *command = malloc(2048); snprintf(command, 2048, "gcc -E -xc %s -o %s", in, output_name); system(command); free(command); return output_name; } // Tokenize the input file int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "Usage: %s <input.c>\n", argv[0]); return 1; } char *input_name = argv[1]; char *preprocessed = preprocess(input_name); init_tokenizer(preprocessed); token_t *tok; while ((tok = next_token()) != NULL) { print_token(tok); token_destroy(tok); } destroy_tokenizer(); remove(preprocessed); free(preprocessed); hash_table_destroy(string_table); return 0; }
Used in section 59
I wrote this code in a single sitting, so there are bound to be bugs. I'll list them here as I find them. The code you see here is the final version, with all bugs fixed.
buffer_pos == buffer_size - 1
, left in from trying to plug some code for lookahead in, didn't work out, but I forgot to remove it, causes fallthrough to buffer_size == 0
check which if true returns EOF, preventing input initialization. Fixed by changing to buffer_pos == buffer_size
.
token->kind == TOK_STRING_ASCII
failed in token_string. Forgot to expand check for identifiers which also use token_string. Fixed by changing to token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TID
.
hash_table_get
with freed key. Fixed by moving the call to free after the call to hash_table_get
.
hash_table_get
in token_create_string created double free. Fixed by rewriting part of function.
int_ > INT32_MAX
does not work due to some weird casting. Added explicit cast.
That's it! The C Minus tokenizer is complete. It's hopefully pretty understandable, and given the testing I've put it through, it should be pretty robust.
Next time, we'll start on the parser.
Source code for the tokenizer is available here, header file is available here.
Source code for the input module is available here, header file is available here.
Source code for the hash table is available here, header file is available here.
Source code for the token module is available here, header file is available here.
A lot of the logic for this project is from either the Dragon Book, Engineering a Compiler, or LCC: A Retargetable Compiler for ANSI C. Grammars are from The C Reference Manual.
I got the idea for using zero-width arrays for optional content (the struct hack) from hacking on some weird virtio drivers (they seem to love it).
Crafting Interpreters by Bob Nystrom inspired me to do this project, so if you see any similarities, there's probably some unintentional influence there.
The code for the ELF hash function is from the glibc source code.
The idea for decision trees came from LCC.
Literate programming rendered using literate.