Defining the IR

Before we can define the parser, we need to define the intermediate representation (IR) that we will use for calc programs. In the basic structure, we defined some "pseudo-Rust" structures like Statement and Expression; now we are going to define them for real.

"Salsa structs"

In addition to regular Rust types, we will make use of various salsa structs. A salsa struct is a struct that has been annotated with one of the salsa annotations:

  • #[salsa::input], which designates the "base inputs" to your computation;
  • #[salsa::tracked], which designate intermediate values created during your computation;
  • #[salsa::interned], which designate small values that are easy to compare for equality.

All salsa structs store the actual values of their fields in the salsa database. This permits us to track when the values of those fields change to figure out what work will need to be re-executed.

When you annotate a struct with one of the above salsa attributes, salsa actually generates a bunch of code to link that struct into the database. This code must be connected to some jar. By default, this is crate::Jar, but you can specify a different jar with the jar= attribute (e.g., #[salsa::input(jar = MyJar)]). You must also list the struct in the jar definition itself, or you will get errors.

Input structs

The first thing we will define is our input. Every salsa program has some basic inputs that drive the rest of the computation. The rest of the program must be some deterministic function of those base inputs, such that when those inputs change, we can try to efficiently recompute the new result of that function.

Inputs are defined as Rust structs with a #[salsa::input] annotation:


#![allow(unused)]
fn main() {
#[salsa::input]
pub struct SourceProgram {
    #[return_ref]
    text: String,
}
}

In our compiler, we have just one simple input, the ProgramSource, which has a text field (the string).

The data lives in the database

Although they are declared like other Rust structs, salsa structs are implemented quite differently. The values of their fields are stored in the salsa database, and the struct itself just contains a numeric identifier. This means that the struct instances are copy (no matter what fields they contain). Creating instances of the struct and accessing fields is done by invoking methods like new as well as getters and setters.

More concretely, the #[salsa::input] annotation will generate a struct for ProgramSource like this:


#![allow(unused)]
fn main() {
#[define(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct ProgramSource(salsa::Id);
}

It will also generate a method new that lets you create a ProgramSource in the database. For an input, a &mut db reference is required, along with the values for each field:


#![allow(unused)]
fn main() {
let source = ProgramSource::new(&mut db, "print 11 + 11".to_string());
}

You can read the value of the field with source.text(&db), and you can set the value of the field with source.set_text(&mut db, "print 11 * 2".to_string()).

Database revisions

Whenever a function takes an &mut reference to the database, that means that it can only be invoked from outside the incrementalized part of your program, as explained in the overview. When you change the value of an input field, that increments a 'revision counter' in the database, indicating that some inputs are different now. When we talk about a "revision" of the database, we are referring to the state of the database in between changes to the input values.

Tracked structs

Next we will define a tracked struct to represent the functions in our input. Whereas inputs represent the start of a computation, tracked structs represent intermediate values created during your computation. In this case, we are going to parse the raw input program, and create a Function for each of the functions defined by the user.


#![allow(unused)]
fn main() {
#[salsa::tracked]
pub struct Function {
    #[id]
    name: FunctionId,
    args: Vec<VariableId>,
    body: Expression,
}
}

Unlike with inputs, the fields of tracked structs are immutable once created. Otherwise, working with a tracked struct is quite similar to an input:

  • You can create a new value by using new, but with a tracked struct, you only need an &dyn database, not &mut (e.g., Function::new(&db, some_name, some_args, some_body))
  • You use a getter to read the value of a field, just like with an input (e.g., my_func.args(db) to read the args field).

id fields

To get better reuse across revisions, particularly when things are reordered, you can mark some entity fields with #[id]. Normally, you would do this on fields that represent the "name" of an entity. This indicates that, across two revisions R1 and R2, if two functions are created with the same name, they refer to the same entity, so we can compare their other fields for equality to determine what needs to be re-executed. Adding #[id] attributes is an optimization and never affects correctness. For more details, see the algorithm page of the reference.

Interned structs

The final kind of salsa struct are interned structs. As with input and tracked structs, the data for an interned struct is stored in the database, and you just pass around a single integer. Unlike those structs, if you intern the same data twice, you get back the same integer.

A classic use of interning is for small strings like function names and variables. It's annoying and inefficient to pass around those names with String values which must be cloned; it's also inefficient to have to compare them for equality via string comparison. Therefore, we define two interned structs, FunctionId and VariableId, each with a single field that stores the string:


#![allow(unused)]
fn main() {
#[salsa::interned]
pub struct VariableId {
    #[return_ref]
    pub text: String,
}

#[salsa::interned]
pub struct FunctionId {
    #[return_ref]
    pub text: String,
}
}

When you invoke e.g. FunctionId::new(&db, "my_string".to_string()), you will get back a FunctionId that is just a newtype'd integer. But if you invoke the same call to new again, you get back the same integer:


#![allow(unused)]
fn main() {
let f1 = FunctionId::new(&db, "my_string".to_string());
let f2 = FunctionId::new(&db, "my_string".to_string());
assert_eq!(f1, f2);
}

Expressions and statements

We'll also intern expressions and statements. This is convenient primarily because it allows us to have recursive structures very easily. Since we don't really need the "cheap equality comparison" aspect of interning, this isn't the most efficient choice, and many compilers would opt to represent expressions/statements in some other way.


#![allow(unused)]
fn main() {
#[salsa::interned]
pub struct Statement {
    data: StatementData,
}

#[derive(Eq, PartialEq, Clone, Hash)]
pub enum StatementData {
    /// Defines `fn <name>(<args>) = <body>`
    Function(Function),
    /// Defines `print <expr>`
    Print(Expression),
}

#[salsa::interned]
pub struct Expression {
    #[return_ref]
    data: ExpressionData,
}

#[derive(Eq, PartialEq, Clone, Hash)]
pub enum ExpressionData {
    Op(Expression, Op, Expression),
    Number(OrderedFloat<f64>),
    Variable(VariableId),
    Call(FunctionId, Vec<Expression>),
}

#[derive(Eq, PartialEq, Copy, Clone, Hash, Debug)]
pub enum Op {
    Add,
    Subtract,
    Multiply,
    Divide,
}
}

Interned ids are guaranteed to be consistent within a revision, but not across revisions (but you don't have to care)

Interned ids are guaranteed not to change within a single revision, so you can intern things from all over your program and get back consistent results. When you change the inputs, however, salsa may opt to clear some of the interned values and choose different integers. However, if this happens, it will also be sure to re-execute every function that interned that value, so all of them still see a consistent value, just a different one than they saw in a previous revision.

In other words, within a salsa computation, you can assume that interning produces a single consistent integer, and you don't have to think about it. If however you export interned identifiers outside the computation, and then change the inputs, they may not longer be valid or may refer to different values.