Lexer

1. General Project Structure

Since this is the first article, I'll outline the project structure for the C- compiler.

The project has a series of pretty typical stages:

The lexer. This takes a file as input and emits a series of tokens (Its input is already preprocessed, I outsource that to "gcc -E").
The parser. This takes the tokens and builds an abstract syntax tree (AST).
The type checker/semantic analyzer. This is used to ensure that the types of variables and functions are correct.
The code generator. This takes the AST and generates an intermediate representation (IR).
The optimizer. This takes the IR and optimizes it. This'll be broken up into a few stages.
The lowerer. This takes the IR and lowers it to a simpler IR.
The register allocator. This takes the IR, which has instructions in an infinite number of registers, and assigns them to a finite number of registers.
The code emitter. This takes the IR and emits RISC-V assembly.

As far as possible, I'd like to keep each of these stages separate. One benefit of this is that it simplifies memory management greatly. I plan to use an arena allocator for each stage, and by making sure the only thing on the actual heap is the output of the stage, and all temporary data is stored in the arena, I can free all the memory used by a stage by simply freeing the arena.

2. Some Rules

Here are some rules (more like guidelines) that I plan to follow for this project; they're mostly just to keep things simple and consistent.

PROGRAM LIKE IT'S 1999

640 KB ought to be enough for anybody. - Bill Gates

Maybe not that little, But I'm going to try to keep the project as simple as possible, 640 KB probably won't be enough, but I'll still aim for less than 10 MB of memory usage.

This places a lot of constraints on the project, but I think it's a good exercise in minimalism.

Some consequences of this are that I'll have to use memory-wise algorithms, be very careful about program structure, and avoid some of the bigger libraries (which will help with making this project self-hosting in the future).

PROGRAM IN C++--

I'm not a big fan of C++, but its class system helps prevent a lot of ugly bugs. To that end, I'm going to try and keep data structures out of header files, and only expose functions that operate on those data structures, to create a sort of approximation of a class. This has a few benefits:

Quicker compilation. A change to a data structure will only require one file to be recompiled, rather than every file that includes the header.
Less chance of bugs. If a function is the only way to interact with a data structure, then it's much harder to misuse that data structure.
Run time type checking. I can include some sort of tag in the first field of every data structure to ensure that the correct functions are being called.

DON'T GET FANCY

My goal here isn't to write the fastest interpreter in the world, or the most complete. I just want to make something that works and can be understood by someone else.

That means I'm going to avoid a lot of the tricks that are used in production interpreters, and focus more on simplicity and readability.

DESIGN FOR DEBUGGING

This code is going to be peppered with asserts and contain mechanisms to print out the state of the program at any point.

This might be painful, but it'll make debugging a lot simpler and let users look under the hood.

SMART DATA, STUPID CODE

A lot of times, the right data structure can replace 50-100 lines of procedural code. I'm going to try and design data structures which make the algorithms as simple as possible.

For example, instead of writing 50-100 lines of code to hold every keyword in the language, I can just use a simple hash table.

3. Misc

THIS IS A LITERATE PROGRAM! Go to this link to see the file that generated this HTML.

4. The Lexer

A lexical analyzer reads source code and produces tokens, which are the smallest units of meaning in a language. These tokens are then used by the parser to build an abstract syntax tree (AST), which represents the structure of the program.

For example, in the C programming language, tokens include keywords (if, else, while, etc.), identifiers (variable names), numbers, and punctuation (braces, semicolons, etc.).

Given a string like int main() { return 0; }, the lexer would produce a series of tokens like INT, IDENTIFIER(main), LPAREN, RPAREN, LBRACE, RETURN, INTCONSTANT(0), SEMICOLON, RBRACE.

5. Design

I'll break the lexer into several modules:

token.c will contain the token data structure and functions to create and destroy tokens.
input.c will contain the input data structure and functions to read from the input file.
tokenizer.c will contain the main lexer logic.

6. Token Interface

We'll need several components to represent a token:

The type of token, which will be an enum with values like TOK_IF or TOK_INTEGER_U32.
The value of the token. Some tokens, like keywords, don't have a value. Others, like identifiers or constants, do.
The line and column of the token, used for error messages.

To implement a class system in C, we'll hide the token implementation details behind an opaque pointer. We'll use a forward declaration of the token type in the header file and define the token type in the implementation file.

7.

{Opaque Token Type 7}

typedef struct token token_t;

Used in sections 11 and 41

8.

We'll need functions to create and destroy tokens.

{Token Creation and Destruction Interface 8}

token_t *token_create(c_token_types kind, int lin, int col, int len);
token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len);
token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len);
token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len);
token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len);
void token_destroy(token_t *token);

Used in section 41

9.

We'll also need functions to access the token data.

{Token Interface 9}

c_token_types token_type(token_t *token);
int64_t token_int(token_t *token);
double token_float(token_t *token);
const char *token_string(token_t *token);
char token_char(token_t *token);
int token_line(token_t *token);
int token_column(token_t *token);
void print_token(token_t *tok);

Used in sections 11 and 41

10.

We'll need types to represent the different kinds of tokens.

{Token Types 10}

typedef enum {
  // Control Keywords
  TOK_IF,
  TOK_ELSE,
  TOK_SWITCH,
  TOK_CASE,
  TOK_DEFAULT,
  TOK_WHILE,
  TOK_DO,
  TOK_FOR,
  TOK_CONTINUE,
  TOK_BREAK,
  TOK_RETURN,
  TOK_GOTO,

  // Type Keywords
  TOK_VOID,
  TOK_CHAR,
  TOK_SHORT,
  TOK_INT,
  TOK_LONG,
  TOK_FLOAT,
  TOK_DOUBLE,
  TOK_SIGNED,
  TOK_UNSIGNED,
  TOK_STRUCT,
  TOK_UNION,
  TOK_ENUM,
  TOK_TYPEDEF,

  // Storage Class/Specifier Keywords
  TOK_AUTO,
  TOK_REGISTER,
  TOK_STATIC,
  TOK_EXTERN,
  TOK_CONST,
  TOK_VOLATILE,

  // Misc Keywords
  TOK_SIZEOF,

  // Operators
  TOK_ADD,            // +
  TOK_SUB,            // -
  TOK_MUL,            // *
  TOK_DIV,            // /
  TOK_MOD,            // %
  TOK_BIT_AND,        // &
  TOK_BIT_OR,         // |
  TOK_BIT_XOR,        // ^
  TOK_BIT_NOT,        // ~
  TOK_LSHIFT,         // <<
  TOK_RSHIFT,         // >>
  TOK_NOT,            // !
  TOK_ASSIGN,         // =
  TOK_LT,             // <
  TOK_GT,             // >
  TOK_INC,            // ++
  TOK_DEC,            // --
  TOK_EQ,             // ==
  TOK_NE,             // !=
  TOK_LE,             // <=
  TOK_GE,             // >=
  TOK_AND,            // &&
  TOK_OR,             // ||
  TOK_MEMBER_POINTER, // ->
  TOK_MEMBER,         // .
  TOK_COND_DECISION,  // :
  TOK_COND,           // ?
  TOK_ASSIGN_ADD,     // +=
  TOK_ASSIGN_SUB,     // -=
  TOK_ASSIGN_MUL,     // *=
  TOK_ASSIGN_DIV,     // /=
  TOK_ASSIGN_MOD,     // %=
  TOK_ASSIGN_BITAND,  // &=
  TOK_ASSIGN_BITOR,   // |=
  TOK_ASSIGN_BITXOR,  // ^=
  TOK_ASSIGN_LSHIFT,  // <<=
  TOK_ASSIGN_RSHIFT,  // >>=

  // Separators
  TOK_LEFT_PAREN,    // (
  TOK_RIGHT_PAREN,   // )
  TOK_LEFT_BRACKET,  // [
  TOK_RIGHT_BRACKET, // ]
  TOK_LEFT_BRACE,    // {
  TOK_RIGHT_BRACE,   // }
  TOK_COMMA,         // ,
  TOK_SEMICOLON,     // ;
  TOK_DOT,           // .
  TOK_ELLIPSIS,      // ...
  TOK_HASH,          // #

  // Identifiers
  TOK_ID,
  TOK_TYPEDEF_NAME,

  // Constants
  TOK_INTEGER_U32,  // u
  TOK_INTEGER_U64,  // ul
  TOK_INTEGER_S32,  // (no suffix)
  TOK_INTEGER_S64,  // l
  TOK_FLOAT_32,     // f
  TOK_FLOAT_64,     // (no suffix)
  TOK_CHAR_CONST,         // 'c'
  TOK_STRING_ASCII, // "string" (width of 8 bits)

  // Special
  TOK_EOF,
  TOK_ERROR,
} c_token_types;

Used in sections 11 and 41

11.

We bring this all together in token.h. Line and column are exposed as global variables because skip_whitespace will need to update them.

{token.h 11}

#ifndef TOKEN_H
#define TOKEN_H
#include <stdint.h> // For int64_t
{Token Types, 10}
{Opaque Token Type, 7}
{Token Creation and Destruction, 18}
{Token Interface, 9}
extern int column;
extern int line;
#endif

Redefined in section 41

12. Token Implementation

Now that we have the interface, we can implement the token data structure. We'll need:

The token type.
A way to store extra data.
Implementations of the functions defined in the interface.

To verify the token isn't corrupt, we'll use a tag. TOK_MAGIC_1 represents a token with optional data, and TOK_MAGIC_2 represents a token without optional data.

A zero-length array is used in the token data structure. This GCC extension allows us to allocate memory for the token data structure and the token data in one allocation.

{Token Data Structure 12}

#define TOK_MAGIC_1 0x544F4B454E544F4Bul // "TOKENTOK"
#define TOK_MAGIC_2 0x544F4B544F4B454Eul // "TOKTOKEN"

struct token {
  long magic;
  int line;
  int column;
  short kind;
  long opt_data[0];
};

typedef struct token token_t;

struct token_data {
  union {
    int64_t i;
    double f;
    const char *s;
    char c;
  } data;
};

typedef struct token_data token_data_t;
int column = 1;
int line = 1;

Used in section 42

13.

We'll implement an interface for accessing the token data and a macro for accessing optional data.

{Token Data Access 13}

#define token_data(token) ((struct token_data *)((token)->opt_data))

c_token_types token_type(token_t *token) {
  assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2);
  return token->kind;
}

int64_t token_int(token_t *token) {
  assert(token->kind == TOK_INTEGER_U32 || token->kind == TOK_INTEGER_U64 || token->kind == TOK_INTEGER_S32 || token->kind == TOK_INTEGER_S64);
  assert(token->magic == TOK_MAGIC_1);
  return token_data(token)->data.i;
}

double token_float(token_t *token) {
  assert(token->kind == TOK_FLOAT_32 || token->kind == TOK_FLOAT_64);
  assert(token->magic == TOK_MAGIC_1);
  return token_data(token)->data.f;
}

const char *token_string(token_t *token) {
  assert(token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TYPEDEF_NAME);
  assert(token->magic == TOK_MAGIC_1);
  return token_data(token)->data.s;
}

char token_char(token_t *token) {
  assert(token->kind == TOK_CHAR_CONST);
  assert(token->magic == TOK_MAGIC_1);
  return token_data(token)->data.c;
}

int token_line(token_t *token) {
  assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2);
  return token->line;
}

int token_column(token_t *token) {
  assert(token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2);
  return token->column;
}

Used in section 42

14.

For debugging, we'll add a function to print the token type.

{Token Debugging 14}

{Token Type Enum to String, 15}
{Unescape String, 16}
{Print Token, 17}

Used in section 42

15.

This function returns a string with the token type name.

{Token Type Enum to String 15}

const char *token_name_from_type(c_token_types type) {
  switch (type) {
  case TOK_IF:
    return "TOK_IF";
  case TOK_ELSE:
    return "TOK_ELSE";
  case TOK_SWITCH:
    return "TOK_SWITCH";
  case TOK_CASE:
    return "TOK_CASE";
  case TOK_DEFAULT:
    return "TOK_DEFAULT";
  case TOK_WHILE:
    return "TOK_WHILE";
  case TOK_DO:
    return "TOK_DO";
  case TOK_FOR:
    return "TOK_FOR";
  case TOK_CONTINUE:
    return "TOK_CONTINUE";
  case TOK_BREAK:
    return "TOK_BREAK";
  case TOK_RETURN:
    return "TOK_RETURN";
  case TOK_GOTO:
    return "TOK_GOTO";
  case TOK_VOID:
    return "TOK_VOID";
  case TOK_CHAR:
    return "TOK_CHAR";
  case TOK_SHORT:
    return "TOK_SHORT";
  case TOK_INT:
    return "TOK_INT";
  case TOK_LONG:
    return "TOK_LONG";
  case TOK_FLOAT:
    return "TOK_FLOAT";
  case TOK_DOUBLE:
    return "TOK_DOUBLE";
  case TOK_SIGNED:
    return "TOK_SIGNED";
  case TOK_UNSIGNED:
    return "TOK_UNSIGNED";
  case TOK_STRUCT:
    return "TOK_STRUCT";
  case TOK_UNION:
    return "TOK_UNION";
  case TOK_ENUM:
    return "TOK_ENUM";
  case TOK_TYPEDEF:
    return "TOK_TYPEDEF";
  case TOK_AUTO:
    return "TOK_AUTO";
  case TOK_REGISTER:
    return "TOK_REGISTER";
  case TOK_STATIC:
    return "TOK_STATIC";
  case TOK_EXTERN:
    return "TOK_EXTERN";
  case TOK_CONST:
    return "TOK_CONST";
  case TOK_VOLATILE:
    return "TOK_VOLATILE";
  case TOK_SIZEOF:
    return "TOK_SIZEOF";
  case TOK_ADD:
    return "TOK_ADD";
  case TOK_SUB:
    return "TOK_SUB";
  case TOK_MUL:
    return "TOK_MUL";
  case TOK_DIV:
    return "TOK_DIV";
  case TOK_MOD:
    return "TOK_MOD";
  case TOK_BIT_AND:
    return "TOK_BIT_AND";
  case TOK_BIT_OR:
    return "TOK_BIT_OR";
  case TOK_BIT_XOR:
    return "TOK_BIT_XOR";
  case TOK_BIT_NOT:
    return "TOK_BIT_NOT";
  case TOK_LSHIFT:
    return "TOK_LSHIFT";
  case TOK_RSHIFT:
    return "TOK_RSHIFT";
  case TOK_NOT:
    return "TOK_NOT";
  case TOK_ASSIGN:
    return "TOK_ASSIGN";
  case TOK_LT:
    return "TOK_LT";
  case TOK_GT:
    return "TOK_GT";
  case TOK_INC:
    return "TOK_INC";
  case TOK_DEC:
    return "TOK_DEC";
  case TOK_EQ:
    return "TOK_EQ";
  case TOK_NE:
    return "TOK_NE";
  case TOK_LE:
    return "TOK_LE";
  case TOK_GE:
    return "TOK_GE";
  case TOK_AND:
    return "TOK_AND";
  case TOK_OR:
    return "TOK_OR";
  case TOK_MEMBER_POINTER:
    return "TOK_MEMBER_POINTER";
  case TOK_MEMBER:
    return "TOK_MEMBER";
  case TOK_COND_DECISION:
    return "TOK_COND_DECISION";
  case TOK_COND:
    return "TOK_COND";
  case TOK_ASSIGN_ADD:
    return "TOK_ASSIGN_ADD";
  case TOK_ASSIGN_SUB:
    return "TOK_ASSIGN_SUB";
  case TOK_ASSIGN_MUL:
    return "TOK_ASSIGN_MUL";
  case TOK_ASSIGN_DIV:
    return "TOK_ASSIGN_DIV";
  case TOK_ASSIGN_MOD:
    return "TOK_ASSIGN_MOD";
  case TOK_ASSIGN_BITAND:
    return "TOK_ASSIGN_BITAND";
  case TOK_ASSIGN_BITOR:
    return "TOK_ASSIGN_BITOR";
  case TOK_ASSIGN_BITXOR:
    return "TOK_ASSIGN_BITXOR";
  case TOK_ASSIGN_LSHIFT:
    return "TOK_ASSIGN_LSHIFT";
  case TOK_ASSIGN_RSHIFT:
    return "TOK_ASSIGN_RSHIFT";
  case TOK_HASH:
    return "TOK_HASH";
  case TOK_ID:
    return "TOK_ID";
  case TOK_TYPEDEF_NAME:
    return "TOK_TYPEDEF_NAME";
  case TOK_INTEGER_U32:
    return "TOK_INTEGER_U32";
  case TOK_INTEGER_U64:
    return "TOK_INTEGER_U64";
  case TOK_INTEGER_S32:
    return "TOK_INTEGER_S32";
  case TOK_INTEGER_S64:
    return "TOK_INTEGER_S64";
  case TOK_FLOAT_32:
    return "TOK_FLOAT_32";
  case TOK_FLOAT_64:
    return "TOK_FLOAT_64";
  case TOK_CHAR_CONST:
    return "TOK_CHAR_CONST";
  case TOK_STRING_ASCII:
    return "TOK_STRING_ASCII";
  case TOK_EOF:
    return "TOK_EOF";
  case TOK_ERROR:
    return "TOK_ERROR";
  case TOK_LEFT_PAREN:
    return "TOK_LEFT_PAREN";
  case TOK_RIGHT_PAREN:
    return "TOK_RIGHT_PAREN";
  case TOK_LEFT_BRACKET:
    return "TOK_LEFT_BRACKET";
  case TOK_RIGHT_BRACKET:
    return "TOK_RIGHT_BRACKET";
  case TOK_LEFT_BRACE:
    return "TOK_LEFT_BRACE";
  case TOK_RIGHT_BRACE:
    return "TOK_RIGHT_BRACE";
  case TOK_COMMA:
    return "TOK_COMMA";
  case TOK_SEMICOLON:
    return "TOK_SEMICOLON";
  case TOK_DOT:
    return "TOK_DOT";
  case TOK_ELLIPSIS:
    return "TOK_ELLIPSIS";
  }
  return "UNKNOWN";
}

Used in section 14

16.

This function adds escape characters to a string for printing.

{Unescape String 16}

#define clamp(x, min, max) ((x) < (min) ? (min) : (x) > (max) ? (max) : (x))
char *re_escape_string(const char *str) {
  int len = strlen(str);
  char *buf = malloc(len * 2 + 1);
  if (!buf) {
    fprintf(stderr, "Out of memory. Cannot escape string\n");
    exit(1);
  }
  int i = 0;
  for (int j = 0; j < len; j++) {
    switch (str[j]) {
    case '\a': buf[i++] = '\\'; buf[i++] = 'a'; break;
    case '\b': buf[i++] = '\\'; buf[i++] = 'b'; break;
    case '\f': buf[i++] = '\\'; buf[i++] = 'f'; break;
    case '\n': buf[i++] = '\\'; buf[i++] = 'n'; break;
    case '\r': buf[i++] = '\\'; buf[i++] = 'r'; break;
    case '\t': buf[i++] = '\\'; buf[i++] = 't'; break;
    case '\v': buf[i++] = '\\'; buf[i++] = 'v'; break;
    case '\\': buf[i++] = '\\'; buf[i++] = '\\'; break;
    case '\'': buf[i++] = '\\'; buf[i++] = '\''; break;
    case '"': buf[i++] = '\\'; buf[i++] = '"'; break;
    default: {
      if (isprint(str[j])) {
        buf[i++] = str[j];
      } else {
        buf[i++] = '\\';
        buf[i++] = 'x';
        buf[i++] = "0123456789abcdef"[clamp(str[j] >> 4, 0, 0xf)];
        buf[i++] = "0123456789abcdef"[clamp(str[j] & 0xf, 0, 0xf)];
      }
    }
    }
  }
  buf[i] = '\0';
  return buf;
}

Used in section 14

17.

This function prints the token type and value.

{Print Token 17}

void print_token(token_t *tok) {
  if (!tok) {
    printf("NULL\n");
    return;
  }
  const char *name = token_name_from_type(tok->kind);
  switch (tok->kind) {
  case TOK_ID:
  case TOK_STRING_ASCII: {
    char *escaped = re_escape_string(token_string(tok));
    printf("%s: \"%s\"@%d:%d\n", name, escaped, tok->line, tok->column);
    free(escaped);
    break;
  }
  case TOK_TYPEDEF_NAME: {
    char *escaped = re_escape_string(token_string(tok));
    printf("%s: %s@%d:%d\n", name, escaped, tok->line, tok->column);
    free(escaped);
    break;
  }
  case TOK_CHAR_CONST:
    printf("%s: '%c'@%d:%d\n", name, token_char(tok), tok->line, tok->column);
    break;
  case TOK_INTEGER_S32:
  case TOK_INTEGER_U32:
  case TOK_INTEGER_S64:
  case TOK_INTEGER_U64:
    printf("%s: %ld@%d:%d\n", name, token_int(tok), tok->line, tok->column);
    break;
  case TOK_FLOAT_32:
  case TOK_FLOAT_64:
    printf("%s: %f@%d:%d\n", name, token_float(tok), tok->line, tok->column);
    break;
  default:
    printf("%s@%d:%d\n", name, tok->line, tok->column);
    break;
  }
}

Used in section 14

18.

Now we can implement functions to create and destroy tokens. We'll start with the easy ones.

{Token Creation and Destruction 18}

token_t *token_data_create(c_token_types kind, int lin, int col, int len) {
  token_t *token = malloc(sizeof(token_t) + sizeof(struct token_data));
  if (!token) {
    fputs("Out of memory\n", stderr);
    exit(1);
  }
  token->magic = TOK_MAGIC_1;
  token->line = lin;
  token->column = col;
  column += len;
  token->kind = kind;
  return token;
}

token_t *token_create(c_token_types kind, int lin, int col, int len) {
  token_t *token = malloc(sizeof(token_t));
  if (!token) {
    fputs("Out of memory\n", stderr);
    exit(1);
  }
  token->magic = TOK_MAGIC_2;
  token->line = lin;
  token->column = col;
  column += len;
  token->kind = kind;
  return token;
}

token_t *token_create_int(c_token_types kind, int lin, int col, int64_t i, int len) {
  token_t *token = token_data_create(kind, lin, col, len);
  token_data(token)->data.i = i;
  return token;
}

token_t *token_create_float(c_token_types kind, int lin, int col, double f, int len) {
  token_t *token = token_data_create(kind, lin, col, len);
  token_data(token)->data.f = f;
  return token;
}

token_t *token_create_char(c_token_types kind, int lin, int col, char c, int len) {
  token_t *token = token_data_create(kind, lin, col, len);
  token_data(token)->data.c = c;
  return token;
}

void token_destroy(token_t *token) {
  if (token->magic == TOK_MAGIC_1 || token->magic == TOK_MAGIC_2) {
    free(token);
  } else {
    fputs("Corrupt token\n", stderr);
    exit(1);
  }
}

Used in sections 11 and 42

19.

token_create_string can be implemented either the easy way or the right way. Let's try the easy way.

{Token Create String 19}

token_t *token_create_string(c_token_types kind, int lin, int col, const char *s, int len) {
  token_t *token = token_create(kind, lin, col, len);
  token_data(token)->data.s = strdup(s);
  return token;
}

Redefined in section 40

Used in section 42

20.

There's an issue with this approach. token_create_string will be called for every identifier and every string in a program. Imagine a large program, say a shell, with a bunch of user input and output. That program will likely have 20-40 calls to fprintf, fscanf, strchr, strtok, each. We create a new string for each of those calls. That's a lot of duplicates, and can quickly add up to a lot of memory usage.

To fix this, we use a hash table to store the strings. We'll define a hash table in hash_table.h and hash_table.c.

21. Hash Table

A hash table is a data structure that maps keys to values. It's commonly used for implementing symbol tables to store variables and functions. To create a generic hash table, we'll need:

A hash function to convert keys into array indices
A comparison function to check if two keys are equal
An opaque type to represent the hash table
A destructor function to clean up keys and values when removing entries

Let's start with the interface:

22.

{Hash Table Opaque Types 22}

typedef struct hash_table hash_table_t;
typedef int (*hash_table_cmp_fn)(const void *key1, const void *key2);
typedef unsigned int (*hash_table_hash_fn)(const void *key);
typedef void (*hash_table_dtor)(void *data, int is_key);

Used in section 25

23.

{Hash Table Creation and Destruction 23}

hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor);
void hash_table_destroy(hash_table_t *table);

Used in section 25

24.

{Hash Table Access 24}

void *hash_table_get(const hash_table_t *table, const void *key);
void hash_table_put(hash_table_t *table, void *key, void *value);
void hash_table_remove(hash_table_t *table, const void *key);

Used in section 25

25.

{hash_table.h 25}

#ifndef HASH_TABLE_H
#define HASH_TABLE_H

{Hash Table Opaque Types, 22}
{Hash Table Creation and Destruction, 23}
{Hash Table Access, 24}

#endif /* HASH_TABLE_H */

26.

Now let's implement the hash table:

{hash_table.c 26}

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include "hash_table.h"

{Hash Table Data Structure, 27}
{Hash Table Entry Data Structure, 28}

hash_table_t *hash_table_create(int size, hash_table_cmp_fn cmp, hash_table_hash_fn hash, hash_table_dtor dtor) {
    {Allocate and Initialize Hash Table, 29}
    return table;
}

void hash_table_destroy(hash_table_t *table) {
    if (!table) return;
    {Destroy Entries, 30}
    free(table->entries);
    free(table);
}

void *hash_table_get(const hash_table_t *table, const void *key) {
    if (!table || !key) return NULL;
    {Get Entry By Hash, 31}
    {Loop Through Entries and Return Value if Match, 35}
    return NULL;
}

void hash_table_put(hash_table_t *table, void *key, void *value) {
    if (!table || !key) return;
    {Get Entry By Hash, 31}
    {Loop Through Entries and Replace Value if Key Matches, 32}
    {Allocate New Entry if No Match, 33}
}

void hash_table_remove(hash_table_t *table, const void *key) {
    if (!table || !key) return;
    {Get Entry By Hash, 31}
    {Loop Through Entries and Remove Entry if Key Matches, 34}
}

#ifdef TEST_HASH_TABLE
#include <assert.h>

static int string_cmp(const void *key1, const void *key2) {
    return strcmp((const char *)key1, (const char *)key2);
}

static unsigned int string_hash(const void *key) {
    unsigned int hash = 5381;
    const unsigned char *str = key;
    int c;

    while ((c = *str++))
    {
        hash = ((hash << 5) + hash) + c;
    }

    return hash;
}

int main(void) {
    hash_table_t *table = hash_table_create(16, string_cmp, string_hash, NULL);
    assert(table != NULL);

    hash_table_put(table, "foo", "bar");
    hash_table_put(table, "foo", "baz");
    assert(strcmp((const char *)hash_table_get(table, "foo"), "baz") == 0);

    hash_table_remove(table, "foo");
    assert(hash_table_get(table, "foo") == NULL);

    hash_table_destroy(table);
    printf("All tests passed!\n");
    return 0;
}
#endif

27.

The hash table data structure contains an array of entry pointers, the size of the array, and function pointers for comparison, hashing, and destruction.

{Hash Table Data Structure 27}

struct hash_table {
    struct hash_table_entry **entries;
    int size;
    hash_table_cmp_fn cmp;
    hash_table_hash_fn hash;
    hash_table_dtor dtor;
};

Used in section 26

28.

Each entry in the hash table contains a key, a value, and a pointer to the next entry in the chain for collision resolution via chaining.

{Hash Table Entry Data Structure 28}

struct hash_table_entry {
    void *key;
    void *value;
    struct hash_table_entry *next;
};

Used in section 26

29.

To allocate a hash table, we allocate memory for the table structure and its entries, initialize the entries to NULL, and set the function pointers.

{Allocate and Initialize Hash Table 29}

hash_table_t *table = malloc(sizeof(struct hash_table));
if (!table) {
    fprintf(stderr, "Error: Out of memory, could not allocate hash table\n");
    exit(EXIT_FAILURE);
}
table->entries = calloc(size, sizeof(struct hash_table_entry *));
if (!table->entries) {
    fprintf(stderr, "Error: Out of memory, could not allocate hash table entries\n");
    free(table);
    exit(EXIT_FAILURE);
}
table->size = size;
table->cmp = cmp;
table->hash = hash;
table->dtor = dtor;

Used in section 26

30.

To destroy the entries in a hash table, we iterate through all entries, free the keys and values using the destructor if provided, and free the entry itself.

{Destroy Entries 30}

for (int i = 0; i < table->size; i++) {
    struct hash_table_entry *entry = table->entries[i];
    while (entry) {
        struct hash_table_entry *next = entry->next;
        if (table->dtor) {
            table->dtor(entry->key, 1);
            table->dtor(entry->value, 0);
        }
        free(entry);
        entry = next;
    }
}

Used in section 26

31.

To retrieve an entry's hash bucket, we apply the hash function to the key and take the modulus of the result with the table size.

{Get Entry By Hash 31}

unsigned int hash = table->hash(key) % table->size;
struct hash_table_entry *entry = table->entries[hash];

Used in section 26

32.

When putting a new entry in the table, we first check if the key already exists. If it does, we update the value; otherwise, we create a new entry.

{Loop Through Entries and Replace Value if Key Matches 32}

while (entry) {
    if (table->cmp(entry->key, key) == 0) {
        if (table->dtor) table->dtor(entry->value, 0);
        entry->value = value;
        return;
    }
    entry = entry->next;
}

Used in section 26

33.

If no matching key is found, we create a new entry and insert it at the beginning of the hash bucket.

This can possibly improve performance due to a property called temporal locality. When we access an entry, we're likely to access it again soon. Since the entry is at the beginning of the list, it's likely to be accessed again soon.

{Allocate New Entry if No Match 33}

struct hash_table_entry *new_entry = malloc(sizeof(struct hash_table_entry));
if (!new_entry) {
    fprintf(stderr, "Error: Out of memory, could not allocate hash table entry\n");
    exit(EXIT_FAILURE);
}
new_entry->key = key;
new_entry->value = value;
new_entry->next = table->entries[hash];
table->entries[hash] = new_entry;

Used in section 26

34.

To remove an entry, we find its bucket, update the linked list to bypass it, then free the entry and its contents.

{Loop Through Entries and Remove Entry if Key Matches 34}

struct hash_table_entry *prev = NULL;
while (entry) {
    if (table->cmp(entry->key, key) == 0) {
        if (prev)
            prev->next = entry->next;
        else
            table->entries[hash] = entry->next;
        
        if (table->dtor) {
            table->dtor(entry->key, 1);
            table->dtor(entry->value, 0);
        }
        free(entry);
        return;
    }
    prev = entry;
    entry = entry->next;
}

Used in section 26

35.

To retrieve a value from a given bucket, we just walk the list and return the value if a matching key is found.

{Loop Through Entries and Return Value if Match 35}

while (entry) {
    if (table->cmp(entry->key, key) == 0) {
        return entry->value;
    }
    entry = entry->next;
}

Used in section 26

36.

We're now almost ready to implement token_create_string the right way. First, we'll need a good hash function.

Hash functions are a very interesting topic and there's a lot of good research on them. The hash function we use should be fast, have a low collision rate, and be able to handle strings of any length.

We can't just sum the characters in a string, because that would mean that "stop" and "pots" would have the same hash. Multiplying has the same problem. If we take each to the power of its position in the string, we get a better distribution, but it's still awful.

Using a simple python program, I brute-forced all possible 4-character strings and ran our power-hash function on them, the result showed that for 456976 possible strings, only 3760 were unique, which is terrible.

Instead of trying to come up with a new hash function, we can use one that's been well-tested and is known to work well.

The first time I wrote this, I used the hash function from the 'Red Dragon Book' (Compilers: Principles, Techniques, and Tools).

{Hash Function 36}

unsigned long hash_string(void *key) {
  unsigned long hash = 0, g;
  char *p = key;
  while (*p) {
    hash = (hash \<\< 4) + *p++;
    if ((g = hash \& 0xf0000000) != 0) {
      hash ^= g \>\> 24;
      hash ^= g;
    }
  }
  return hash;
}

Used in section 39

This is a bit slow on modern processors because it's not very cache-friendly. We can do better. Let's use its child, ELFHash, from libc.

As you can see in the code below, this function avoids extra operations and should be much faster.

{Hash Function 36} :=

unsigned int hash_string(const void *key) {
  unsigned long hash = 0, hi = 0;
  const char *p = key;
  hash = *p;
  if (hash != 0 && p[1] != 0) {
    hash = (hash << 4) + p[1];
    if (p[2] != 0) {
      hash = (hash << 4) + p[2];
      if (p[3] != 0) {
        hash = (hash << 4) + p[3];
        if (p[4] != 0) {
          hash = (hash << 4) + p[4];
          p += 5;
          while (*p != 0) {
            hash = (hash << 4) + *p++;
            hi = hash & 0xf0000000l;
            hash ^= hi >> 24;
          }
          hash &= 0x0fffffffl;
        }
      }
    }
  }
  return hash;
}

Used in section 39

37.

We also need a comparison function for strings.

{String Comparison 37}

int cmp_string(const void *key1, const void *key2) {
  return strcmp((char *)key1, (char *)key2);
}

Used in section 39

38.

Finally, we'll need a destructor for entries.

{String Destructor 38}

void dtor_string(void *value, int is_key) {
  if (is_key) {
    free(value); // Since the key and value are the same, we only need to free once.
  }
}

Used in section 39

39.

These functions go in util.c

{util.c 39}

#include <string.h>
#include <stdlib.h>
#include "hash_table.h"
{String Comparison, 37}
{Hash Function, 36}
{String Destructor, 38}

{util.h 39}

#ifndef UTIL_H
#define UTIL_H
#include "hash_table.h"
int cmp_string(const void *key1, const void *key2);
unsigned int hash_string(const void *key);
void dtor_string(void *value, int is_key);
#endif

40.

Now we can implement token_create_string the right way.

You might notice that we're using the same key and value. This way of using a hash table is normally called a set. We're using it to store strings, but we could use it to store anything we want to deduplicate.

{Token Create String 19} :=

hash_table_t *string_table;
token_t *token_create_string(c_token_types kind, int lin, int col,
                                    const char *s, int len) {
  if (string_table == NULL) {
    string_table = hash_table_create(2048, cmp_string, hash_string, dtor_string);
  }
  token_t *token = token_data_create(kind, lin, col, len);
  char *key = hash_table_get(string_table, (void *)s);
  if (key == NULL) {
    key = strdup(s);
    hash_table_put(string_table, key, key);
  }
  token_data(token)->data.s = key;
  return token;
}

Used in section 42

41.

We'll add an external declaration for string_table in token.h so other programs can take advantage of it.

{token.h 11} :=

#ifndef TOKEN_H
#define TOKEN_H
#include <stdint.h> // We use this for int64_t
#include "hash_table.h" // We need this for the string table
{Token Types, 10}
{Opaque Token Type, 7}
{Token Creation and Destruction Interface, 8}
{Token Interface, 9}
extern hash_table_t *string_table;
extern int column;
extern int line;
#endif

42.

Finally, we implement the token data structure in token.c.

{token.c 42}

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <assert.h>
#include <ctype.h>
#include "token.h"
#include "hash_table.h"
#include "util.h"
{Token Data Structure, 12}
{Token Data Access, 13}
{Token Creation and Destruction, 18}
{Token Create String, 19}
{Token Debugging, 14}

43. Input

Input will provide a simple interface for reading characters from a file. The stream itself is deliberately hidden from the tokenizer, so that the tokenizer doesn't have to worry about buffering or anything like that.

44.

{Input Interface 44}

void input_init(const char *filename);
int input_getc(void);
void input_ungetc(int c);
void input_destroy(void);

Used in section 53

When the program wants to start reading a file, it calls input_init with the filename. It can then call input_getc to get the next character in the file. If there's no more input, input_getc will return EOF.

There's also an input_ungetc function, which allows the program to put a character back into the stream. I'll only allow one character to be put back, but that should be enough for the tokenizer.

Finally, when the program is done reading the file, it should call input_destroy to clean up.

45. Input Design Decisions

Per rule 1, we're trying to keep memory usage low. That means that instead of reading the entire file into memory, we'll need to read it in chunks. There are a couple of choices for how to do this:

Read a line at a time: This is a more natural approach, but it has two drawbacks. First, it requires a large buffer to store the line (C normally specifies BUFSIZ as 8192 bytes). Second, if the line is longer than BUFSIZ, we'll have to read the line in chunks anyway.
Choose some arbitrary buffer size and read that many bytes at a time: This is the approach I'm going to take. It's a little less natural, but it's more memory efficient.

Input will read chunks of 128 bytes at a time, reusing the same static buffer. This limitation is not visible to the tokenizer, which will only see the input_getc interface.

When the buffer is exhausted, input_getc will call nextline, which will read the next chunk of the file.

46. Input Implementation

The implementation of the input module is pretty straightforward. We have the following data structures and defines as globals:

47.

{Input Data 47}

#define CHUNK_SIZE 128
static char buffer[CHUNK_SIZE];
static int buffer_pos = 0;
static int buffer_size = 0;
static char unget_buffer_stack[8];
static int unget_buffer_stack_pos = 0;

static FILE *file = NULL;

Used in section 52

When the program calls input_init, we open the file.

48.

{Input Initialization 48}

void input_init(const char *filename) {
  file = fopen(filename, "r");
  if (file == NULL) {
    fprintf(stderr, "Error: Cannot open file %s\n", filename);
    exit(1);
  }
}

Used in section 52

When the program calls input_getc, we return the next character in the buffer. If the buffer is exhausted, we call nextline. We also track the line and column.

49.

{Input Get Character 49}

int input_getc(void) {
  if (unget_buffer_stack_pos > 0) {
    return unget_buffer_stack[--unget_buffer_stack_pos];
  }
  if (buffer_pos == buffer_size) {
    buffer_size = fread(buffer, 1, CHUNK_SIZE, file);
    buffer_pos = 0;
  }
  if (buffer_size == 0) {
    return EOF;
  }
  char c = buffer[buffer_pos++];
  return c;
}

Used in section 52

When the program calls input_ungetc, we save the character in the unget_buffer.

50.

{Input Unget Character 50}

void input_ungetc(int c) {
  unget_buffer_stack[unget_buffer_stack_pos++] = c;
}

Used in section 52

Since we're not using dynamic memory allocation, cleanup is pretty simple.

51.

{Input Destroy 51}

void input_destroy(void) {
  fclose(file);
}

Used in section 52

52.

We put the whole thing together in input.c.

{input.c 52}

#include <stdio.h>
#include <stdlib.h>
#include "input.h"
{Input Data, 47}
{Input Initialization, 48}
{Input Get Character, 49}
{Input Unget Character, 50}
{Input Destroy, 51}

53.

We'll need an external declaration for file in input.h so other programs can take advantage of it.

{input.h 53}

#ifndef INPUT_H
#define INPUT_H
{Input Interface, 44}
#endif

54.

We'll implement the lexer interface in tokenizer.h

{tokenizer.h 54}

#ifndef TOKENIZER_H
#define TOKENIZER_H
#include "token.h"
#include "input.h"
{Tokenization Interface, 55}
#endif

55.

The tokenization interface will have a couple of functions. next_token will return the next token in the input stream, init_tokenizer will initialize the tokenizer, and destroy_tokenizer will clean up.

We'll also have some helper functions for lookahead and matching.

'peek_token' will return the next token without consuming it (it technically does advance the input stream, but it saves the token so it can be reused).

'consume' will consume the next token if it matches a given kind. If it doesn't match, it will print an error message and exit.

{Tokenization Interface 55}

void init_tokenizer(const char *filename);
void destroy_tokenizer(void);
token_t *next_token(void);
void reject_token(token_t *token);
token_t *peek_token(void);
void consume(c_token_types kind);
void consume_alt(c_token_types *kinds, int n);

Used in section 54

56.

Now we can finally implement the tokenizer. A stack for storing tokens for lookahead is defined, as is a hash table, modified by the parser, for typedefs.

{tokenizer.c 56}

#include <ctype.h>
#include <errno.h>
#include <float.h>
#include <math.h>
#include <stdarg.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "tokenizer.h"
#include "token.h"
#include "input.h"
#include "util.h"
token_t *left_stack[8];
int left_stack_pos = 0;
hash_table_t *typedef_table = NULL;
{Utility Functions, 57}
{Tokenization Function, 59}

57.

Utility functions are everything that doesn't directly tokenize the input.

{Utility Functions 57}

void init_tokenizer(const char *filename) {
  input_init(filename);
  typedef_table = hash_table_create(16, cmp_string, hash_string, dtor_string);
}

void destroy_tokenizer(void) {
  input_destroy();
}

void reject_token(token_t *token) {
  left_stack[left_stack_pos++] = token;
}

token_t *peek_token(void) {
  if (left_stack_pos > 0) {
    return left_stack[left_stack_pos - 1];
  }
    token_t *token = next_token();
    reject_token(token);
    return token;
}

{Stringify Type, 58}

void consume(c_token_types kind) {
  token_t *token = next_token();
  if (token_type(token) != kind) {
    fprintf(stderr, "Error: Expected token of type \"%s\", got \"%s\"\n", stringify_type(kind), stringify_type(token_type(token)));
    exit(1);
  }
  token_destroy(token);
}

void consume_alt(c_token_types *kinds, int n) {
  token_t *token = next_token();
  for (int i = 0; i < n; i++) {
    if (token_type(token) == kinds[i]) {
      token_destroy(token);
      return;
    }
  }
  fprintf(stderr, "Error: Expected one of the following tokens: ");
  for (int i = 0; i < n; i++) {
    fprintf(stderr, "\"%s\" ", stringify_type(kinds[i]));
  }
  fprintf(stderr, "got \"%s\"\n", stringify_type(token_type(token)));
  exit(1);
}

Used in section 56

58.

We'll need a helper function to convert token types to strings. It's pretty simple, just tedious.

{Stringify Type 58}

const char *stringify_type(c_token_types type) {
  switch (type) {
  case TOK_IF:
    return "if";
  case TOK_ELSE:
    return "else";
  case TOK_SWITCH:
    return "switch";
  case TOK_CASE:
    return "case";
  case TOK_DEFAULT:
    return "default";
  case TOK_WHILE:
    return "while";
  case TOK_DO:
    return "do";
  case TOK_FOR:
    return "for";
  case TOK_CONTINUE:
    return "continue";
  case TOK_BREAK:
    return "break";
  case TOK_RETURN:
    return "return";
  case TOK_GOTO:
    return "goto";
  case TOK_VOID:
    return "void";
  case TOK_CHAR:
    return "char";
  case TOK_SHORT:
    return "short";
  case TOK_INT:
    return "int";
  case TOK_LONG:
    return "long";
  case TOK_FLOAT:
    return "float";
  case TOK_DOUBLE:
    return "double";
  case TOK_SIGNED:
    return "signed";
  case TOK_UNSIGNED:
    return "unsigned";
  case TOK_STRUCT:
    return "struct";
  case TOK_UNION:
    return "union";
  case TOK_ENUM:
    return "enum";
  case TOK_TYPEDEF:
    return "typedef";
  case TOK_AUTO:
    return "auto";
  case TOK_REGISTER:
    return "register";
  case TOK_STATIC:
    return "static";
  case TOK_EXTERN:
    return "extern";
  case TOK_CONST:
    return "const";
  case TOK_VOLATILE:
    return "volatile";
  case TOK_SIZEOF:
    return "sizeof";
  case TOK_ADD:
    return "+";
  case TOK_SUB:
    return "-";
  case TOK_MUL:
    return "*";
  case TOK_DIV:
    return "/";
  case TOK_MOD:
    return "%";
  case TOK_BIT_AND:
    return "&";
  case TOK_BIT_OR:
    return "|";
  case TOK_BIT_XOR:
    return "^";
  case TOK_BIT_NOT:
    return "~";
  case TOK_LSHIFT:
    return "<<";
  case TOK_RSHIFT:
    return ">>";
  case TOK_NOT:
    return "!";
  case TOK_ASSIGN:
    return "=";
  case TOK_LT:
    return "<";
  case TOK_GT:
    return ">";
  case TOK_INC:
    return "++";
  case TOK_DEC:
    return "--";
  case TOK_EQ:
    return "==";
  case TOK_NE:
    return "!=";
  case TOK_LE:
    return "<=";
  case TOK_GE:
    return ">=";
  case TOK_AND:
    return "&&";
  case TOK_OR:
    return "||";
  case TOK_MEMBER_POINTER:
    return "->";
  case TOK_MEMBER:
    return ".";
  case TOK_COND_DECISION:
    return ":";
  case TOK_COND:
    return "?";
  case TOK_ASSIGN_ADD:
    return "+=";
  case TOK_ASSIGN_SUB:
    return "-=";
  case TOK_ASSIGN_MUL:
    return "*=";
  case TOK_ASSIGN_DIV:
    return "/=";
  case TOK_ASSIGN_MOD:
    return "%=";
  case TOK_ASSIGN_BITAND:
    return "&=";
  case TOK_ASSIGN_BITOR:
    return "|=";
  case TOK_ASSIGN_BITXOR:
    return "^=";
  case TOK_ASSIGN_LSHIFT:
    return "<<=";
  case TOK_ASSIGN_RSHIFT:
    return ">>=";
  case TOK_HASH:
    return "#";
  case TOK_ID:
    return "identifier";
  case TOK_TYPEDEF_NAME:
    return "typedef name";
  case TOK_INTEGER_U32:
  case TOK_INTEGER_U64:
  case TOK_INTEGER_S32:
  case TOK_INTEGER_S64:
    return "integer constant";
  case TOK_FLOAT_32:
  case TOK_FLOAT_64:
    return "floating constant";
  case TOK_CHAR_CONST:
    return "character constant";
  case TOK_STRING_ASCII:
    return "string constant";
  case TOK_EOF:
    return "EOF";
  case TOK_ERROR:
    return "error";
  case TOK_LEFT_PAREN:
    return "(";
  case TOK_RIGHT_PAREN:
    return ")";
  case TOK_LEFT_BRACKET:
    return "[";
  case TOK_RIGHT_BRACKET:
    return "]";
  case TOK_LEFT_BRACE:
    return "{";
  case TOK_RIGHT_BRACE:
    return "}";
  case TOK_COMMA:
    return ",";
  case TOK_SEMICOLON:
    return ";";
  case TOK_DOT:
    return ".";
  case TOK_ELLIPSIS:
    return "...";
  }
  return "UNKNOWN";
}

Used in section 57

59.

Now we can implement the tokenization function. The pattern is pretty simple: we call each of the tokenization functions in turn until we find a match. If we don't find a match, we print an error message and exit. You might wonder why skip_whitespace can return a token. This makes handling the divide operator easier as comments also start with a slash.

{Tokenization Function 59}

char file_name[1024];
{Warning/Error Functions, 60}
{Skip Whitespace, 61}
{Tokenize Identifier, 62}
{Tokenize Number, 65}
{Tokenize String, 74}
{Tokenize Character, 73}
{Tokenize Operator, 64}
token_t *next_token(void) {
  if (left_stack_pos > 0) {
    return left_stack[--left_stack_pos];
  }
  token_t *tok = skip_whitespace();
  if (tok != NULL) {
    return tok;
  }
  tok = read_identifier();
  if (tok != NULL) {
    return tok;
  }
  tok = read_number();
  if (tok != NULL) {
    return tok;
  }
  tok = read_char_constant();
  if (tok != NULL) {
    return tok;
  }
  tok = read_string_literal();
  if (tok != NULL) {
    return tok;
  }
  tok = read_operator();
  if (tok != NULL) {
    return tok;
  }
  int c = input_getc();
  if (c == EOF) {
    return NULL;
  }
  tok_warn(
      "Warning: Ignoring unexpected character '%c' at line %d, column %d\n", c,
      line, column);
  return next_token();
}

#ifdef TEST_TOKENIZER
{Run Test, 76}
#endif

Used in section 56

60.

We'll need a couple of helper functions to skip whitespace and print warnings/errors.

{Warning/Error Functions 60}

void tok_error(const char *fmt, ...) {
  va_list args;
  va_start(args, fmt);
  fprintf(stderr, "Error in file %s at line %d, column %d: ", file_name, line,
          column);
  vfprintf(stderr, fmt, args);
  va_end(args);
}

void tok_warn(const char *fmt, ...) {
  va_list args;
  va_start(args, fmt);
  fprintf(stderr, "Warning in file %s at line %d, column %d: ", file_name, line,
          column);
  vfprintf(stderr, fmt, args);
  va_end(args);
}

Used in section 59

61.

The skip_whitespace function is pretty simple. It just skips over any comments, whitespace, and line directives.

{Skip Whitespace 61}

static token_t *skip_whitespace(void) {
  int c;
  while ((c = input_getc()) != EOF) {
    if (isspace(c)) { // Whitespace
      if (c == '\n') {
        line++;
        column = 1;
      } else {
        column++;
      }
    } else if (c == '#') // GCC preprocessor line control directive.
    {
      char buf[512];
      int i = 0;
      while ((c = input_getc()) != EOF && c != '\n') {
        buf[i++] = c;
        column++;
      }
      buf[i] = '\0';
      if (sscanf(buf, "%d \"%[^\"]\"", &line, file_name) == 2) {
        column = 1;
      } else {
        tok_error("Invalid #line directive\n");
      }
      if (c == EOF) {
        return NULL;
      }
    } else if (c == '/') { // Comment
      c = input_getc();
      if (c == '/') {
        while ((c = input_getc()) != EOF && c != '\n') {
          column++;
        }
        if (c == EOF) {
          return NULL;
        }
        line++;
        column = 1;
      } else if (c == '*') { // Multiline comment
        while ((c = input_getc()) != EOF) {
          if (c == '*') {
            c = input_getc();
            if (c == '/') {
              break;
            }
          } else if (c == '\n') {
            line++;
            column = 1;
          } else {
            column++;
          }
        }
        if (c == EOF) {
          return NULL;
        }
      } else { // Handled here to simplify the code.
        if (c == '=')
          return token_create(TOK_ASSIGN_DIV, line, column, 2);
        input_ungetc(c);
        return token_create(TOK_DIV, line, column, 1);
      }
    } else {
      input_ungetc(c);
      return NULL;
    }
  }
  return NULL;
}

Used in section 59

62.

The read_identifier function reads an identifier from the input stream. C identifiers can contain letters, digits, and underscores, but they can't start with a digit.

{Tokenize Identifier 62}

{Get Keyword, 63}
static token_t *read_identifier(void) {
  int c;
  char buf[1024];
  int i = 0;
  c = input_getc();
  if (!isalpha(c) && c != '_') {
    input_ungetc(c);
    return NULL;
  }
  buf[i++] = c;
  while ((c = input_getc()) != EOF) {
    if (!isalnum(c) && c != '_') {
      input_ungetc(c);
      break;
    }
    buf[i++] = c;
    if (i >= 1008) {
      tok_error("Identifier too long\n");
      exit(1);
    }
  }
  buf[i] = '\0';
  column += i;
  // Check if it's a keyword
  c_token_types kind = get_keyword(buf, i);
  if (kind != TOK_ID) {
    return token_create(kind, line, column, i);
  }
  // Check if it's a typedef
  if (hash_table_get(typedef_table, buf) != NULL) {
    return token_create(TOK_TYPEDEF_NAME, line, column, i);
  }
  return token_create_string(kind, line, column, buf, i);
}

Used in section 59

63.

The get_keyword function is a simple decision tree for identifying keywords. The code is pretty tedious, but it works.

{Get Keyword 63}

c_token_types get_keyword(const char *buf, int len) {
  switch (buf[0]) {
  case 'a':
    if (len == 4 && buf[1] == 'u' && buf[2] == 't' && buf[3] == 'o')
      return TOK_AUTO;
    break;

  case 'b':
    if (len == 5 && buf[1] == 'r' && buf[2] == 'e' && buf[3] == 'a' &&
        buf[4] == 'k')
      return TOK_BREAK;
    break;

  case 'c':
    switch (buf[1]) {
    case 'a':
      if (len == 4 && buf[2] == 's' && buf[3] == 'e')
        return TOK_CASE;
      break;
    case 'h':
      if (len == 4 && buf[2] == 'a' && buf[3] == 'r')
        return TOK_CHAR;
      break;
    case 'o':
      if (len == 5 && buf[2] == 'n' && buf[3] == 's' && buf[4] == 't')
        return TOK_CONST;
      if (len == 8 && buf[2] == 'n' && buf[3] == 't' && buf[4] == 'i' &&
          buf[5] == 'n' && buf[6] == 'u' && buf[7] == 'e')
        return TOK_CONTINUE;
      break;
    }
    break;

  case 'd':
    switch (buf[1]) {
    case 'e':
      if (len == 7 && buf[2] == 'f' && buf[3] == 'a' && buf[4] == 'u' &&
          buf[5] == 'l' && buf[6] == 't')
        return TOK_DEFAULT;
      break;
    case 'o':
      if (len == 2 && buf[2] == '\0')
        return TOK_DO;
      if (len == 6 && buf[2] == 'u' && buf[3] == 'b' && buf[4] == 'l' &&
          buf[5] == 'e')
        return TOK_DOUBLE;
      break;
    }
    break;

  case 'e':
    switch (buf[1]) {
    case 'l':
      if (len == 4 && buf[2] == 's' && buf[3] == 'e')
        return TOK_ELSE;
      break;
    case 'n':
      if (len == 4 && buf[2] == 'u' && buf[3] == 'm')
        return TOK_ENUM;
      break;
    case 'x':
      if (len == 6 && buf[2] == 't' && buf[3] == 'e' && buf[4] == 'r' &&
          buf[5] == 'n')
        return TOK_EXTERN;
      break;
    }
    break;

  case 'f':
    switch (buf[1]) {
    case 'l':
      if (len == 5 && buf[2] == 'o' && buf[3] == 'a' && buf[4] == 't')
        return TOK_FLOAT;
      break;
    case 'o':
      if (len == 3 && buf[2] == 'r')
        return TOK_FOR;
      break;
    }
    break;

  case 'g':
    if (len == 4 && buf[1] == 'o' && buf[2] == 't' && buf[3] == 'o')
      return TOK_GOTO;
    break;

  case 'i':
    switch (buf[1]) {
    case 'f':
      if (len == 2 && buf[2] == '\0')
        return TOK_IF;
      break;
    case 'n':
      if (len == 3 && buf[2] == 't')
        return TOK_INT;
      break;
    }
    break;

  case 'l':
    if (len == 4 && buf[1] == 'o' && buf[2] == 'n' && buf[3] == 'g')
      return TOK_LONG;
    break;

  case 'r':
    switch (buf[1]) {
    case 'e':
      if (len == 8 && buf[2] == 'g' && buf[3] == 'i' && buf[4] == 's' &&
          buf[5] == 't' && buf[6] == 'e' && buf[7] == 'r')
        return TOK_REGISTER;
      if (len == 6 && buf[2] == 't' && buf[3] == 'u' && buf[4] == 'r' &&
          buf[5] == 'n')
        return TOK_RETURN;
      break;
    }
    break;

  case 's':
    switch (buf[1]) {
    case 'h':
      if (len == 5 && buf[2] == 'o' && buf[3] == 'r' && buf[4] == 't')
        return TOK_SHORT;
      break;
    case 't':
      if (len == 6 && buf[2] == 'a' && buf[3] == 't' && buf[4] == 'i' &&
          buf[5] == 'c')
        return TOK_STATIC;
      break;
    case 'i':
      if (len == 6 && buf[2] == 'g' && buf[3] == 'n' && buf[4] == 'e' &&
          buf[5] == 'd')
        return TOK_SIGNED;
      if (len == 6 && buf[2] == 'z' && buf[3] == 'e' && buf[4] == 'o' &&
          buf[5] == 'f')
        return TOK_SIZEOF;
      break;
    case 'r':
      if (len == 6 && buf[2] == 'u' && buf[3] == 'c' && buf[4] == 't')
        return TOK_STRUCT;
      break;
    case 'w':
      if (len == 6 && buf[2] == 'i' && buf[3] == 't' && buf[4] == 'c' &&
          buf[5] == 'h')
        return TOK_SWITCH;
      break;
    }
    break;

  case 't':
    if (len == 7 && buf[1] == 'y' && buf[2] == 'p' && buf[3] == 'e' &&
        buf[4] == 'd' && buf[5] == 'e' && buf[6] == 'f')
      return TOK_TYPEDEF;
    break;

  case 'u':
    switch (buf[1]) {
    case 'n':
      if (len == 5 && buf[2] == 'i' && buf[3] == 'o' && buf[4] == 'n')
        return TOK_UNION;
      if (len == 8 && buf[2] == 's' && buf[3] == 'i' && buf[4] == 'g' &&
          buf[5] == 'n' && buf[6] == 'e' && buf[7] == 'd')
        return TOK_UNSIGNED;
      break;
    }
    break;

  case 'v':
    switch (buf[1]) {
    case 'o':
      if (len == 4 && buf[2] == 'i' && buf[3] == 'd')
        return TOK_VOID;
      if (len == 8 && buf[2] == 'l' && buf[3] == 'a' && buf[4] == 't' &&
          buf[5] == 'i' && buf[6] == 'l' && buf[7] == 'e')
        return TOK_VOLATILE;
      break;
    }
    break;

  case 'w':
    if (len == 5 && buf[1] == 'h' && buf[2] == 'i' && buf[3] == 'l' &&
        buf[4] == 'e')
      return TOK_WHILE;
    break;

  default:
    return TOK_ID;
  }
  return TOK_ID;
}

Used in section 62

64.

The read_operator function works similarly to the read_identifier function. It uses a decision tree to identify operators.

{Tokenize Operator 64}


token_t *read_operator(void) {
  int c;
  c = input_getc();
  switch (c) {
  case '!': {
    c = input_getc();
    if (c == '=')
      return token_create(TOK_NE, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_NOT, line, column, 1);
  }
  case '%': {
    c = input_getc();
    if (c == '=')
      return token_create(TOK_ASSIGN_MOD, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_MOD, line, column, 1);
  }
  case '&': {
    c = input_getc();
    if (c == '&')
      return token_create(TOK_AND, line, column, 2);
    if (c == '=')
      return token_create(TOK_ASSIGN_BITAND, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_BIT_AND, line, column, 1);
  }
  case '(':
    return token_create(TOK_LEFT_PAREN, line, column, 1);
  case ')':
    return token_create(TOK_RIGHT_PAREN, line, column, 1);
  case '*': {
    c = input_getc();
    if (c == '=')
      return token_create(TOK_ASSIGN_MUL, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_MUL, line, column, 1);
  }
  case '+': {
    c = input_getc();
    if (c == '+')
      return token_create(TOK_INC, line, column, 2);
    if (c == '=')
      return token_create(TOK_ASSIGN_ADD, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_ADD, line, column, 2);
  }
  case ',':
    return token_create(TOK_COMMA, line, column, 1);
  case '-': {
    c = input_getc();
    if (c == '-')
      return token_create(TOK_DEC, line, column, 2);
    if (c == '=')
      return token_create(TOK_ASSIGN_SUB, line, column, 2);
    if (c == '>')
      return token_create(TOK_MEMBER_POINTER, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_SUB, line, column, 1);
  }
  case '.': {
    c = input_getc();
    if (c == '.') {
      c = input_getc();
      if (c == '.') {
        return token_create(TOK_ELLIPSIS, line, column, 3);
      } else {
        // Bail out, can't store more than one unget
        tok_error("Unexpected character '.' at line %d, column %d\n", line,
                  column);
        exit(1);
      }
    }
    return token_create('.', line, column, 1);
  }
  case '/': {
    c = input_getc();
    if (c == '=')
      return token_create(TOK_ASSIGN_DIV, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_DIV, line, column, 1);
  }
  case ':':
    return token_create(TOK_COND_DECISION, line, column, 1);
  case ';':
    return token_create(TOK_SEMICOLON, line, column, 1);
  case '<': {
    c = input_getc();
    if (c == '<') {
      c = input_getc();
      if (c == '=')
        return token_create(TOK_ASSIGN_LSHIFT, line, column, 3);
      input_ungetc(c);
      return token_create(TOK_LSHIFT, line, column, 2);
    }
    if (c == '=')
      return token_create(TOK_LE, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_LT, line, column, 1);
  }
  case '=': {
    c = input_getc();
    if (c == '=')
      return token_create(TOK_ASSIGN, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_ASSIGN, line, column, 1);
  }
  case '>': {
    c = input_getc();
    if (c == '>') {
      c = input_getc();
      if (c == '=')
        return token_create(TOK_ASSIGN_RSHIFT, line, column, 3);
      input_ungetc(c);
      return token_create(TOK_RSHIFT, line, column, 2);
    }
    if (c == '=')
      return token_create(TOK_GE, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_GT, line, column, 1);
  }
  case '?':
    return token_create(TOK_COND, line, column, 1);
  case '[':
    return token_create(TOK_LEFT_BRACKET, line, column, 1);
  case ']':
    return token_create(TOK_RIGHT_BRACKET, line, column, 1);
  case '^': {
    c = input_getc();
    if (c == '=')
      return token_create(TOK_ASSIGN_BITXOR, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_BIT_XOR, line, column, 1);
  }
  case '{':
    return token_create(TOK_LEFT_BRACE, line, column, 1);
  case '|': {
    c = input_getc();
    if (c == '|')
      return token_create(TOK_OR, line, column, 2);
    if (c == '=')
      return token_create(TOK_ASSIGN_BITOR, line, column, 2);
    input_ungetc(c);
    return token_create(TOK_BIT_OR, line, column, 1);
  }
  case '}':
    return token_create(TOK_RIGHT_BRACE, line, column, 1);
  case '~':
    return token_create(TOK_BIT_NOT, line, column, 1);
  default:
    input_ungetc(c);
    return NULL;
  }

  return NULL;
}

Used in section 59

65.

The read_number function reads a number from the input stream. It can be an integer or a floating-point number.

I've broken it up a bit to make it easier to read.

{Tokenize Number 65}

static token_t *read_number(void) {
  int c;
  char buf[1024];
  int i = 0;
  c = input_getc();
  {Check for valid prefix, 66}
  int radix = 10;
  {Process Radix, 67}
  int is_float = 0;
  {Read Number Loop, 68}
  buf[i] = '\0';
  {Process Suffixes, 69}
  {Check for conflicting suffixes, 70}
  if (is_float) {
    {Convert to float, 71}
  } else {
    {Convert to integer, 72}
  }
  return NULL;
}

Used in section 59

66.

To determine if a character is a valid prefix for a number, we need to check if it's a digit or a period followed by a digit

{Check for valid prefix 66}

 // If we don't have a digit or decimal point, it's not a number
  if (!isdigit(c) && c != '.') {
    input_ungetc(c);
    return NULL;
  }
  // Decimal point followed by non-digit is a struct member
  if (c == '.') {
    char cnext = input_getc();
    if (!isdigit(cnext)) {
      input_ungetc(cnext);
      return token_create(TOK_MEMBER, line, column, 1);
    }
    input_ungetc(cnext);
  }

Used in section 65

67.

A C constant starting with a zero is either an octal or hexadecimal constant. We need to check the next character to determine which one it is.

{Process Radix 67}

  // Check for hex and octal.
  if (c == '0') {
    char cnext = input_getc();
    if (cnext == 'x' || cnext == 'X') {
      // Hex, discard the 0x
      radix = 16;
    } else {
      // Octal, discard the 0
      input_ungetc(cnext);
      radix = 8;
    }
  } else {
    // Decimal, append the first digit
    buf[i++] = c;
  }

Used in section 65

68.

{Read Number Loop 68}

   while ((c = input_getc()) != EOF) {
    // Since there can be multiple writes to the buffer, we want to make sure we
    // don't overflow by giving a 4 byte pad
    if (i > 1020) {
      tok_error("Number too long\n");
      return NULL;
    }
    // Valid digits for the radix: 0-9 for decimal, 0-7 for octal, 0-9 and
    // a-f/A-F for hex
    if ((radix == 10 && isdigit(c)) || (radix == 16 && isxdigit(c)) ||
        (radix == 8 && c >= '0' && c <= '7')) {
      buf[i++] = c;
      // Decimal point and not a float yet, must be a float
    } else if (c == '.' && !is_float) {
      is_float = 1;
      if (radix != 10) {
        tok_error("Invalid floating point constant, expected decimal, got %s\n",
                  radix == 16 ? "hexadecimal" : "octal");
        return NULL;
      }
      buf[i++] = c;
    }
    // Exponent on the end of a constant. (By standard this forces it to be a
    // float)
    else if (c == 'e' || c == 'E') {
      buf[i++] = c;
      c = input_getc();
      // Sign on the exponent
      if (c == '+' || c == '-') {
        buf[i++] = c;
        c = input_getc();
      }
      // Exponent must be a digit, I.E no 1e1.2
      if (!isdigit(c)) {
        tok_error("Invalid floating point exponent\n");
        return NULL;
      }
      buf[i++] = c;
      is_float = 1;
    } else {
      // Reached the end, unget the character so other functions can read it
      input_ungetc(c);
      break;
    }
  }

Used in section 65

69.

C constants can have suffixes to indicate their type. We need to check for these suffixes and set the appropriate flags.

{Process Suffixes 69}

  int is_unsigned = 0;
  int is_long = 0;
  int is_single = 0;
  while (1) {
    c = input_getc();
    if (c == 'u' || c == 'U') {
      if (is_unsigned) {
        tok_warn(
            "Warning: Duplicate suffix 'u' for integer constant ignored\n");
      }
      is_unsigned = 1;
    } else if (c == 'l' || c == 'L') {
      if (is_long) {
        tok_warn(
            "Warning: Duplicate suffix 'l' for integer constant ignored\n");
      }
      is_long = 1;
    } else if (c == 'f' || c == 'F') {
      if (is_single) {
        tok_warn("Warning: Duplicate suffix 'f' for floating point constant "
                 "ignored\n");
      }
      is_single = 1;
    } else {
      input_ungetc(c);
      break;
    }
  }

Used in section 65

70.

If we find conflicting suffixes, we print a warning and ignore the suffixes.

{Check for conflicting suffixes 70}

    if (is_single && is_long) {
    tok_warn("Warning: Invalid suffixes 'l' and 'f' for floating point "
             "constant. Ignoring 'l'\n");
    is_long = 0;
  }
  if (is_single && is_unsigned) {
    tok_warn("Warning: Invalid suffixes 'u' and 'f' for floating point "
             "constant. Ignoring 'u'\n");
    is_unsigned = 0;
  }
  if (is_single && !is_float) {
    tok_warn(
        "Warning: Invalid suffix 'f' for integer constant. Ignoring 'f'\n");
    is_single = 0;
  }

Used in section 65

71.

If the string contains a float, we pass it to strtod. We need to make sure that the number is in range for the given type and check for errors from strtod

{Convert to float 71}

    errno = 0;
    // Strtod generates a unix-style error when it's given something out of
    // range, so we want to get on top of that quickly instead of ignoring it
    // That way we can avoid some nasty NAN-propagation in the constant folder.
    double f = strtod(buf, NULL);
    if (errno == ERANGE) {
      tok_error("Floating point constant out of range\n");
      return NULL;
    }
    // Warn if the constant is out of range for a float, I.E it's too big or too
    // small
    if (is_single && (f < FLT_MIN || f > FLT_MAX)) {
      tok_warn(
          "Warning: Floating point constant %f is out of range for float\n", f);
    }
    // Warn if the constant is too precise for a float
    if (is_single && fabs((double)((float)f) - f) >= FLT_EPSILON) {
      tok_warn("Warning: Converting double precision floating point constant "
               "%f to float loses "
               "precision\n",
               f);
    }
    return token_create_float(is_single ? TOK_FLOAT_32
                                        : TOK_FLOAT_64,
                              line, column, f, i);

Used in section 65

72.

If the string contains a number, we pass it to stroll. We need to make sure that the number is in range for the given type and check for errors from strtoll

{Convert to integer 72}

    errno = 0;
    uint64_t int_ = strtoull(buf, NULL, radix);
    // Same as above, but for integers
    if (errno == ERANGE) {
      tok_error("Integer constant out of range\n");
      return NULL;
    }
    if (is_unsigned) {
      if (is_long) {
        return token_create_int(TOK_INTEGER_U64, line, column, int_, i);
      } else {
        if (int_ > UINT32_MAX) {
          tok_warn(
              "Warning: Integer constant %lld is out of range for unsigned "
              "int\n",
              int_);
        }
        return token_create_int(TOK_INTEGER_U32, line, column, int_, i);
      }
    } else {
      if (is_long) {
        // If the highest bit is set, that means this will overflow a signed
        // long (Due to two's complement)
        if (int_ & (1UL << 63)) {
          tok_warn(
              "Warning: Integer constant %lld is out of range for long long\n",
              i);
        }
        return token_create_int(TOK_INTEGER_S64, line, column, int_, i);
      } else {
        if (int_ & (1UL << 31)) {
          tok_warn("Warning: Integer constant %lld is out of range for int\n",
                   int_);
        }
        return token_create_int(TOK_INTEGER_S32, line, column, int_, i);
      }
    }

Used in section 65

73.

The read_char_constant function reads a character constant from the input stream. It can be a single character or a multi-character escape sequence.

{Tokenize Character 73}

static token_t *read_char_constant(void) {
  int c;
  int len = 0;
  c = input_getc();
  if (c != '\'') {
    input_ungetc(c);
    return NULL;
  }
  len++;
  c = input_getc();
  if (c == '\'') {
    tok_error("Empty character constant\n");
    return NULL;
  }
  if (c == '\\') {
    c = read_escape_sequence(&len);
  }
  int val = c;
  c = input_getc();
  if (c != '\'') {
    tok_error("Expected closing quote for character constant\n");
    return NULL;
  }
  len++;
  return token_create_char(TOK_CHAR_CONST, line, column, val, len);
}

Used in section 59

74.

The read_string_literal function reads a string literal from the input stream.

For this function, an automatic-lifetime buffer is used to store the string it becomes too large. At that point, a heap-allocated buffer is used. This way we can avoid unnecessary heap allocations for small strings.

{Tokenize String 74}

{Read Escape Sequence, 75}
static token_t *read_string_literal(void) {
  int c;
  c = input_getc();
  if (c != '"') {
    input_ungetc(c);
    return NULL;
  }
  int i = 0;
  char s_buf[512];
  char *buf = s_buf;
  int len = 512;
  int esc_pad = 0;
  while ((c = input_getc()) != EOF) {
    if (c == '"') {
      // Implicit skip of closing quote
      break;
    }
    if (c == '\\') {
      c = read_escape_sequence(&esc_pad);
      if (c == 0) {
        return NULL;
      }
    }
    if (i >= len) {
      if (buf == s_buf) {
        buf = malloc(1024);
        if (buf == NULL) {
          fputs("Out of memory. Could not parse string literal.\n", stderr);
          exit(1);
        }
        memcpy(buf, s_buf, 512);
        len *= 2;
      } else {
        len *= 2;
        buf = realloc(buf, len);
      }
    }
    buf[i++] = c;
  }
  buf[i] = '\0';
  if (c == EOF) {
    tok_error("Unterminated string literal\n");
    if (buf != s_buf) {
      free(buf);
    }
    return NULL;
  }

  token_t *tok = token_create_string(TOK_STRING_ASCII, line, column, buf,
                                     i + esc_pad + 2);
  if (buf != s_buf) {
    free(buf);
  }
  return tok;
}

Used in section 59

75.

Escape sequences in C can either be single characters or octal/hexadecimal values. We need to handle both cases.

{Read Escape Sequence 75}

static char read_escape_sequence(int *len) {
  int c = input_getc();
  *len += 1;
  switch (c) {
  case 'a':
    return '\a';
  case 'b':
    return '\b';
  case 'f':
    return '\f';
  case 'n':
    return '\n';
  case 'r':
    return '\r';
  case 't':
    return '\t';
  case 'v':
    return '\v';
  case '\'':
    return '\'';
  case '"':
    return '"';
  case '?':
    return '?';
  case '\\':
    return '\\';
  case '0':
    return '\0';
  case 'x': {
    c = input_getc();
    if (!isxdigit(c)) {
      tok_error("Invalid hexadecimal escape sequence\n");
      return 0;
    }
    int val = 0;
    while (isxdigit(c)) {
      *len += 1;
      val = val * 16 + (isdigit(c) ? c - '0' : tolower(c) - 'a' + 10);
      c = input_getc();
    }
    input_ungetc(c);
    return (char)val;
  }
  default:
    if (!isdigit(c)) {
      tok_error("Invalid escape sequence\n");
      return 0;
    }
    int val = 0;
    while (isdigit(c)) {
      *len += 1;
      val = val * 8 + c - '0';
      c = input_getc();
    }
    input_ungetc(c);
    return (char)val;
  }
}

Used in section 74

76.

Finally, I'll add some code for running the tokenizer as its own program. This way we can test it out.

{Run Test 76}

char *preprocess(char *in) {
  char *output_name = malloc(1024);
  snprintf(output_name, 1024, "%s.preprocessed", in);
  char *command = malloc(2048);
  snprintf(command, 2048, "gcc -E -xc %s -o %s", in, output_name);
  system(command);
  free(command);
  return output_name;
}

// Tokenize the input file
int main(int argc, char **argv) {
  if (argc != 2) {
    fprintf(stderr, "Usage: %s <input.c>\n", argv[0]);
    return 1;
  }
  char *input_name = argv[1];
  char *preprocessed = preprocess(input_name);
  init_tokenizer(preprocessed);
  token_t *tok;
  while ((tok = next_token()) != NULL) {
    print_token(tok);
    token_destroy(tok);
  }
  destroy_tokenizer();
  remove(preprocessed);
  free(preprocessed);
  hash_table_destroy(string_table);
  return 0;
}

Used in section 59

Bugs/Errata

I wrote this code in a single sitting, so there are bound to be bugs. I'll list them here as I find them. The code you see here is the final version, with all bugs fixed.

had buffer_pos == buffer_size - 1, left in from trying to plug some code for lookahead in, didn't work out, but I forgot to remove it, causes fallthrough to buffer_size == 0 check which if true returns EOF, preventing input initialization. Fixed by changing to buffer_pos == buffer_size.
assertion token->kind == TOK_STRING_ASCII failed in token_string. Forgot to expand check for identifiers which also use token_string. Fixed by changing to token->kind == TOK_STRING_ASCII || token->kind == TOK_ID || token->kind == TOK_TID.
token_create_string - call to hash_table_get with freed key. Fixed by moving the call to free after the call to hash_table_get.
ibid - Design of hash table and call to hash_table_get in token_create_string created double free. Fixed by rewriting part of function.
Tokenizer missing code to handle GCC preprocessor line directives. Fixed by adding code to handle them.
Destructor for string literals missing in tokenizer teardown, added it in.
read_number - check int_ > INT32_MAX does not work due to some weird casting. Added explicit cast.
read_char_constant - Forgot to handle '\\0'. Added.
skip_whitespace - When a division operator occurs in code, skip_whitespace assumes it's a comment. Fixed by adding a check for division operator.
hash_string - Optimization, not a bug, Dragon Book hash function not very fast due to misprediction. Replaced with ELF hash function.
read_identifier - strlen gets called 3 times even though we already get the string len by incrementing an array index. Ew. Used i instead of strlen.
read_identifier - stringized version of keywords stored, not needed. Code added to call token_create instead of token_create_string for keywords.
Everywhere - Checks added for memory allocation failure.
Not a bug. Removed the seperate token type for TID. Will try to handle in the parser.

Conclusion

That's it! The C Minus tokenizer is complete. It's hopefully pretty understandable, and given the testing I've put it through, it should be pretty robust.

Next time, we'll start on the parser.

Source code, biblography

Source code for the tokenizer is available here, header file is available here.

Source code for the input module is available here, header file is available here.

Source code for the hash table is available here, header file is available here.

Source code for the token module is available here, header file is available here.

A lot of the logic for this project is from either the Dragon Book, Engineering a Compiler, or LCC: A Retargetable Compiler for ANSI C. Grammars are from The C Reference Manual.

I got the idea for using zero-width arrays for optional content (the struct hack) from hacking on some weird virtio drivers (they seem to love it).

Crafting Interpreters by Bob Nystrom inspired me to do this project, so if you see any similarities, there's probably some unintentional influence there.

The code for the ELF hash function is from the glibc source code.

The idea for decision trees came from LCC.

Literate programming rendered using literate.