Summary
- We introduce
#[salsa::interned]
queries which convert aKey
type into a numeric index of typeValue
, whereValue
is either the typeInternId
(defined by a salsa) or some newtype thereof. - Each interned query
foo
also produces an inverselookup_foo
method that converts back from theValue
to theKey
that was interned. - The
InternId
type (defined by salsa) is basically a newtype'd integer, but it internally usesNonZeroU32
to enable space-saving optimizations in memory layout. - The
Value
types can be any type that implements thesalsa::InternIndex
trait, also introduced by this RFC. This trait has two methods,from_intern_id
andas_intern_id
. - The interning is integrated into the GC and tracked like any other query, which means that interned values can be garbage-collected, and any computation that was dependent on them will be collected.
Motivation
The need for interning
Many salsa applications wind up needing the ability to construct
"interned keys". Frequently this pattern emerges because we wish to
construct identifiers for things in the input. These identifiers
generally have a "tree-like shape". For example, in a compiler, there
may be some set of input files -- these are enumerated in the inputs
and serve as the "base" for a path that leads to items in the user's
input. But within an input file, there are additional structures, such
as struct
or impl
declarations, and these structures may contain
further structures within them (such as fields or methods). This gives
rise to a path like so that can be used to identify a given item:
PathData = <file-name>
| PathData / <identifier>
These paths could be represented in the compiler with an Arc
, but
because they are omnipresent, it is convenient to intern them instead
and use an integer. Integers are Copy
types, which is convenient,
and they are also small (32 bits typically suffices in practice).
Why interning is difficult today: garbage collection
Unfortunately, integrating interning into salsa at present presents some hard choices, particularly with a long-lived application. You can easily add an interning table into the database, but unless you do something clever, it will simply grow and grow forever. But as the user edits their programs, some paths that used to exist will no longer be relevant -- for example, a given file or impl may be removed, invalidating all those paths that were based on it.
Due to the nature of salsa's recomputation model, it is not easy to detect when paths that used to exist in a prior revision are no longer relevant in the next revision. This is because salsa never explicitly computes "diffs" of this kind between revisions -- it just finds subcomputations that might have gone differently and re-executes them. Therefore, if the code that created the paths (e.g., that processed the result of the parser) is part of a salsa query, it will simply not re-create the invalidated paths -- there is no explicit "deletion" point.
In fact, the same is true of all of salsa's memoized query values. We
may find that in a new revision, some memoized query values are no
longer relevant. For example, in revision R1, perhaps we computed
foo(22)
and foo(44)
, but in the new input, we now only need to
compute foo(22)
. The foo(44)
value is still memoized, we just
never asked for its value. This is why salsa includes a garbage
collector, which can be used to cleanup these memoized values that are
no longer relevant.
But using a garbage collection strategy with a hand-rolled interning scheme is not easy. You could trace through all the values in salsa's memoization tables to implement a kind of mark-and-sweep scheme, but that would require for salsa to add such a mechanism. It might also be quite a lot of tracing! The current salsa GC mechanism has no need to walk through the values themselves in a memoization table, it only examines the keys and the metadata (unless we are freeing a value, of course).
How this RFC changes the situation
This RFC presents an alternative. The idea is to move the interning into salsa itself by creating special "interning queries". Dependencies on these queries are tracked like any other query and hence they integrate naturally with salsa's garbage collection mechanisms.
User's guide
This section covers how interned queries are expected to be used.
Declaring an interned query
You can declare an interned query like so:
#[salsa::query_group]
trait Foo {
#[salsa::interned]
fn intern_path_data(&self, data: PathData) -> salsa::InternId;
]
Query keys. Like any query, these queries can take any number of keys. If multiple
keys are provided, then the interned key is a tuple of each key
value. In order to be interned, the keys must implement Clone
,
Hash
and Eq
.
Return type. The return type of an interned key may be of any type
that implements salsa::InternIndex
: salsa provides an impl for the
type salsa::InternId
, but you can implement it for your own.
Inverse query. For each interning query, we automatically generate
a reverse query that will invert the interning step. It is named
lookup_XXX
, where XXX
is the name of the query. Hence here it
would be fn lookup_intern_path(&self, key: salsa::InternId) -> Path
.
The expected us
Using an interned query is quite straightforward. You simply invoke it
with a key, and you will get back an integer, and you can use the
generated lookup
method to convert back to the original value:
let key = db.intern_path(path_data1);
let path_data2 = db.lookup_intern_path_data(key);
Note that the interned value will be cloned -- so, like all Salsa values, it is best if that is a cheap operation. Interestingly, interning can help to keep recursive, tree-shapes values cheap, because the "pointers" within can be replaced with interned keys.
Custom return types
The return type for an intern query does not have to be a InternId
. It can
be any type that implements the salsa::InternKey
trait:
pub trait InternKey {
/// Create an instance of the intern-key from a `InternId` value.
fn from_intern_id(v: InternId) -> Self;
/// Extract the `InternId` with which the intern-key was created.
fn as_intern_id(&self) -> InternId;
}
Recommended practice
This section shows the recommended practice for using interned keys,
building on the Path
and PathData
example that we've been working
with.
Naming Convention
First, note the recommended naming convention: the intern key is
Foo
and the key's associated data FooData
(in our case, Path
and
PathData
). The intern key is given the shorter name because it is
used far more often. Moreover, other types should never store the full
data, but rather should store the interned key.
Defining the intern key
The intern key should always be a newtype struct that implements
the InternKey
trait. So, something like this:
pub struct Path(InternId);
impl salsa::InternKey for Path {
fn from_intern_id(v: InternId) -> Self {
Path(v)
}
fn as_intern_id(&self) -> InternId {
self.0
}
}
Convenient lookup method
It is often convenient to add a lookup
method to the newtype key:
impl Path {
// Adding this method is often convenient, since you can then
// write `path.lookup(db)` to access the data, which reads a bit better.
pub fn lookup(&self, db: &impl MyDatabase) -> PathData {
db.lookup_intern_path_data(*self)
}
}
Defining the data type
Recall that our paths were defined by a recursive grammar like so:
PathData = <file-name>
| PathData / <identifier>
This recursion is quite typical of salsa applications. The recommended
way to encode it in the PathData
structure itself is to build on other
intern keys, like so:
#[derive(Clone, Hash, Eq, ..)]
enum PathData {
Root(String),
Child(Path, String),
// ^^^^ Note that the recursive reference here
// is encoded as a Path.
}
Note though that the PathData
type will be cloned whenever the value
for an interned key is looked up, and it may also be cloned to store
dependency information between queries. So, as an optimization, you
might prefer to avoid String
in favor of Arc<String>
-- or even
intern the strings as well.
Interaction with the garbage collector
Interned keys can be garbage collected as normal, with one caveat. Even if requested, Salsa will never collect the results generated in the current revision. This is because it would permit the same key to be interned twice in the same revision, possibly mapping to distinct intern keys each time.
Note that if an interned key is collected, its index will be re-used. Salsa's dependency tracking system should ensure that anything incorporating the older value is considered dirty, but you may see the same index showing up more than once in the logs.
Reference guide
Interned keys are implemented using a hash-map that maps from the interned data to its index, as well as a vector containing (for each index) various bits of data. In addition to the interned data, we must track the revision in which the value was interned and the revision in which it was last accessed, to help manage the interaction with the GC. Finally, we have to track some sort of free list that tracks the keys that are being re-used. The current implementation never actually shrinks the vectors and maps from their maximum size, but this might be a useful thing to be able to do (this is effectively a memory allocator, so standard allocation strategies could be used here).
InternId
Presently the InternId
type is implemented to wrap a NonZeroU32
:
pub struct InternId {
value: NonZeroU32,
}
This means that Option<InternId>
(or Option<Path>
, continuing our
example from before) will only be a single word. To accommodate this,
the InternId
constructors require that the value is less than
InternId::MAX
; the value is deliberately set low (currently to
0xFFFF_FF00
) to allow for more sentinel values in the future (Rust
doesn't presently expose the capability of having sentinel values
other than zero on stable, but it is possible on nightly).
Alternatives and future work
None at present.