# Kris's Research Notes

## June 29, 2012

### Memoization of Rates

Filed under: Implementation — Kris Reyes @ 4:38 am

In this note, we comment on an implementation detail that accounts for a 10x speed-up to a naive KMC implementation: memoization of rates. Memoization is the act of storing rates computed in the course of the simulated so that they do not have to be re-computed again if needed. We summarize why doing this makes sense in the KMC context and give two consequences of its implementation which improves both performance and memory usage.

## Motivation

The rate of any event in our KMC simulations are a function of a local neighborhood about the atom performing the event. In the GaAs simulations, for example, the small neighborhood is a 5×5 neighborhood centered at an atom and all rates for events of the atom depend solely on the species of all 25 atoms in the neighborhood. A priori there are $3^{25} \approx 2^{50}$ such neighborhoods, and hence it would be impossible to precompute the rates for all neighborhoods. Indeed, assuming 12 events per atom and 8 bytes to store the rate of each event, we would need a 96 PB table. We do not have a system with this much memory.

Such a look-up table would be nice to avoid the floating-point computations needed to compute the rates. Instead, we use a table that stores recently computed rates for neighborhoods encountered during the normal course of the simulation. That way, if we ever encounter a neighborhood previously seen, we have the rates at hand, saving a lot of work. This is called memoization of rates.

KMC simulations are particularly amenable to memoization because the set of “active” local neighborhoods — that is all the neighborhoods about all atoms at a particular time in the simulation — is small (relative to # atoms) and slowly changing. Consider the diffusion of an Ga adatom:

After the diffusion event, we see that the adatom maintains the same local neighborhood as before. In fact, the set of local neighborhoods remains the same after the diffusion — any local neighborhood in the right configuration can be found in the left.

In fact, we can measure the number of active neighborhoods during an actual simulation. We consider the Ga droplet growth and crystallization experiments previously discussed in the notes. The experiment proceeds in three phases.

1. Ga atoms are deposited on an As terminated surface at a rate of 0.1 ML/sec until $\theta = 4$ ML of Ga has been deposited. This forms liquid droplets.
2. The system is allowed to anneal in the absence of any deposition for 30 seconds.
3. An As flux is introduced at a deposition rate of 0.1 ML/sec until the droplets are fully crystallized.

The simulation was performed on a domain size of 256×512 = 131072 atom positions. At periodic steps, we measure the number of 5×5 neighborhoods in the atomic configuration at those steps, allowing us to plot # active neighborhoods vs. # MC steps:

The dashed red lines indicate the start of a different experimental phase ( the beginning of annealing, then crystallization). We see an initial jump from the small number of neighborhoods found in the initially flat, As-terminated substrate. As the configuration fills in to a Ga-terminated substrate, the number of neighborhoods increase at first, then relax before nucleation of liquid droplets start introducing neighborhoods not seen before. This leads to a slow increase in neighborhoods which also relaxes during the annealing stage as the system equilibriates. When crystallization begins, many new neighborhoods show up, as indicated by the big jump in the above plot. However this behavior also levels off as the droplet becomes fully crystallized.

From this plot, we see that the number of neighborhoods active during the simulation is relatively small compared to the a priori upper bound of neighborhoods. We also see that their growth is slow vs. # mc steps and there are periods of no effective growth. Both of these observations imply that there is a lot of redundancy in space and time of the neighborhoods and hence the memoization of rates will probably save us a significant amount of work.

## Hash table Implementation

As indicated above, using an array indexed by local neighborhood is impossible to do because we require a 50-bit index to represent an atom’s local neighborhood (5×5 atoms and 2 bits per atom to represent 3 species). Instead, we implement the memoization by hash table, which makes sense given the small number of active neighborhoods relative to the size of the space of all neighborhoods.

Here we briefly recall the basics of hash tables. Let $\Omega$ be the set of all neighborhoods (so that for our example $\left| \Omega \right| = 2^{50}$). The first ingredient in a hash table scheme is a hash function:

$\displaystyle h: \Omega \rightarrow \left\{1, 2, \hdots, N\right\}$,

where $N$ is some integer. We assume $N \ll \left| \Omega \right|$. Then by pigeon hole principle, there exists $x, y \in \Omega$ such that $h(x) = h(y)$. We call such pairs collisions.

Instead of having a lookup table of size $\left| \Omega\right|$, we use an array $A$ of size $N$. Every entry of the array stores a pair $(x, r(x))$ where $r(x) = [r_1(x), \hdots, r_m(x)]$ is the set of rates associated with neighborhood $x$. We would like to use $A$ as a look-up table, so that entry $(x, r(x))$ is located at index $h(x)$. But the presence of collisions makes this impossible. So we must perform some work to deal with collisions. The way we do this is by a scheme called linear open addressing, which means entry $(x,r(x))$ is stored hash table $A$ at index $h(x) + i$ of the first unoccupied entry after index $h(x)$.

### Example of hashing

Throughout the simulations, we continuously query our hash table for rates $r(x)$. To see how the hash table is used, we consider a toy example where we desire the rates associated with three neighborhoods $x, y, z$ where $x, z$ form a collision: $h(x) = h(z)$.

We first encounter a neighborhood $x$ and we desire the rates $r(x)$. We query the hash table for $r(x)$. We search the table starting at $h(x)$ until we either encounter the proper entry for $x$ or an empty spot, which tells us $x$ is not in the table.

Hence we need to calculate the rates $r(x)$:

A similar thing occurs when we encounter neighborhood $y$, which maps to $h(y) = 6$.

No collision occurred and $y$ was not in the table, so rates $r(y)$ must be computed:

When the simulation encounters neighborhood $z$, it queries the hash table for $r(z)$. In this case, there is a collision because $h(z) = h(x) = 5$:

Again, we traverse the table until either $z$ is found, or a free space is encountered. In this case, a free space is found, meaning $z$ was not in the table so once again we must calculate $r(z)$:

Finally, suppose we encounter the neighborhood $z$ again. We query the hash table for $r(z)$:

We again traverse down the list until either a free space is encountered (meaning $z$ was not in the table) or we find $z$. In this case, $z$ is in the table, so we do not need to calculate $r(z)$.

Queries to the hash table proceed as in the above manner, adding entries into the table for all neighborhoods not encountered before. As illustrated above, if the hash function maps neighborhoods close together, long chaining results. This is bad for performance, as longer chains means longer times to find neighborhoods within chains. In the extreme case that $h(x)$ is constant, we get one long chain, and look-ups take time $\mathcal O($# neighborhoods$)$. This is undesirable, so some care must be taken to define a hash function that avoids long chaining.

### Searching for good hash functions

We search for a good hash function $h$ by a simulated annealing algorithm. We assume the function is of the form:

$\displaystyle h(a; x) = (x\gg a_1) \oplus (x\gg a_3) \oplus (x\gg a_5)$,

$a = (a_1, a_2, a_3)$

where we think of $x$ as a 50-bit number and $x \gg c$ means a (non-cyclic) right shift by $c$ bits and $\oplus$ is the XOR operation. In other words, $h$ is a linear combination of contiguous portions of $x$. While very simple, such a form suffices for the simulations we consider (as we see in the sequel).

We generate a set of neighborhoods $X \subset \Omega$ observed in the liquid Ga droplet formation/crystallization simulations. For the tuple $a=(a_1, a_2, a_3)$, we define the number of collisions in $X$.

$\displaystyle N(a; X) = \left|\left\{ (x,y) : h(a; x) = h(a; y) \right\}\right|$

Starting with uniform random $a^{(0)} \in \left\{-64, ..., 64\right\}^63$, we perform a random mutation

$a^\prime = a^{(0)} + \delta$,

where $\delta = \left(\delta_1, \delta_2, \delta_3\right), \delta_i \in \left\{-4, -3, \hdots, 3, 4\right\}$, chosen uniformly and independently. We accept the mutation according to a Metropolis selection rule:

$\displaystyle a^{(t+1)} = \begin{cases} a^\prime & \text{ if } N(a^\prime; X) < N(a^{(t)}; X) \\ a^\prime & \text{ with probability } e^{-\beta (N(a^\prime, X)} - N(a, X)) \text{ if } N(a^\prime; X \ge N(a^{(t)}; X) \\ a^{(t)} & \text{ with probability } 1-e^{-\beta (N(a^\prime, X)} - N(a, X)) \text{ if } N(a^\prime; X \ge N(a^{(t)}; X) \end{cases}$

where $\beta$ is some notion of inverse temperature. Iterating over long time $t$ and several starting points $a^{(0)}$ yields an optimal $a^\star$ and hence a hash function $h(a^\star; x)$ that attempts to minimize collisions in the training set.

Here is a plot of number of collisions $N(a^{(t)}, X)$ vs. $t$, number of mutations for three independently selected $a^{(0)}$. We see that our simulated annealing yields efficient hash functions after 2000 mutations:

We use the values $a = (41, 19, 0)$ throughout our simulations.

## Memory Gains

Using a hash table, rates of events are now indexed by neighborhood, instead of indexed by atoms. Several atoms have the same neighborhood, so by collecting rates by neighborhood instead of associating them to individual atoms, we end up saving memory.

Consider the implementation of the CDF we had used previously. We had a binary tree in which the leaf nodes represented the rates of events for each specific atom. If we have $N$ atoms and $M$ potential events per atom, then there are $NM$ leafs in the tree and a total of $2NM - 1$ nodes total:

If we instead store rate information by neighborhood, within the hash table, we can remove redundancy from the last layers of the CDF by eliminating repeated leaf nodes for all atoms that have the same neighborhood, and hence the same rates:

Then the CDF tree (terminated at the atom level) will have a total of $2N-1$ nodes, while each active neighborhood is associated with an event CDF tree, each with a total of $2M-1$ nodes, a total of $L(2M-1)$ nodes if there are $L$ neighborhoods in the hash table. If we assume $L \ll N$, then we have eliminated $2NM - 2N - 2ML - L$ nodes — a significantly large portion of the nodes!

## Simulation Results

We perform the Ga droplet formation and crystallization simulations, varying Ga thickness $\theta$ in order to establish a relationship between number of MC steps (which increases with $\theta$) and wall clock time. This gives us a rough measure of performance. We do so with and without memoization. When memoization is turned off, we compute explicitly the rates of all affected atoms at each MC step. Here is a plot of wall clock time vs. MC steps:

Here the blue triangles correspond to simulations with memoization, while red circles are simulations without memoization. We see the expected linear relationship between time and MC steps, and by examining the slope of the fitted lines, we observe a 10x speed-up when memoization is implemented.

Moreover, we counted the number of times the hash table was queried in a specific simulation. When we deposit $\theta = 2.0$ monolayers of Ga and fully crystallize as described above, the hash table was queried 557821194 times after 18532977 MC steps (resulting in an average of 30 neighborhoods queried per step). Among those queries, 557750012 neighborhoods were already in the table with precomputed rates. That is, a neighborhood’s rates were found in the hash table 99.987% of the time.