diff --git a/docs/soar_manual/05_ReinforcementLearning.md b/docs/soar_manual/05_ReinforcementLearning.md
index f49665a3..7f1d959f 100644
--- a/docs/soar_manual/05_ReinforcementLearning.md
+++ b/docs/soar_manual/05_ReinforcementLearning.md
@@ -1,3 +1,4 @@
+
{{manual_wip_warning}}
# Reinforcement Learning
@@ -7,81 +8,91 @@ knowledge based on a given reward function. This chapter describes the RL
mechanism and how it is integrated with production memory, the decision cycle,
and the state stack. We assume that the reader is familiar with basic
reinforcement learning concepts and notation. If not, we recommend first
-reading *Reinforcement Learning: An Introduction* (1998) by Richard S. Sutton and
+reading _Reinforcement Learning: An Introduction_ (1998) by Richard S. Sutton and
Andrew G. Barto. The detailed behavior of the RL mechanism is determined by
-numerous parameters that can be controlled and configured via the `rl` command.
-Please refer to the documentation for that command in section 9.4.2 on page 238.
+numerous parameters that can be controlled and configured via the
+[`rl` command](../reference/cli/cmd_rl.md).
## RL Rules
-Soar’s RL mechanism learns Q-values for state-operator[1](#footnote1)
-pairs. Q-values are stored as numeric-indifferent preferences created by
-specially formulated productions called **RL rules**. RL rules are identified
+???+ info
+ In this context, the term "state" refers to the state of the task or
+ environment, not a state identifier. For the rest of this chapter, bold capital
+ letter names such as `S1` will refer to identifiers and italic lowercase names
+ such as $s_1$ will refer to task states.
+
+Soar’s RL mechanism learns $Q$-values for state-operator
+pairs. $Q$-values are stored as numeric-indifferent preferences created by
+specially formulated productions called **RL rules**. RL rules are identified
by syntax. A production is a RL rule if and only if its left hand side tests for
a proposed operator, its right hand side creates a single numeric-indifferent
-preference, and it is not a template rule (see Section 5.4.2 for template
-rules). These constraints ease the technical requirements of identifying/
-updating RL rules and makes it easy for the agent programmer to add/ maintain RL
-capabilities within an agent. We define an
-**RL operator** as an operator with numeric-indifferent preferences created by
-RL rules.
+preference, and it is not a template rule (see
+[rule templates](#rule-templates)). These constraints ease the technical
+requirements of identifying/ updating RL rules and makes it easy for the agent
+programmer to add/ maintain RL capabilities within an agent. We define an **RL
+operator** as an operator with numeric-indifferent preferences created by RL
+rules.
The following is an RL rule:
```Soar
sp {rl*3*12*left
- (state ^name task-name
- ^x 3
- ^y 12
- ^operator +)
- ( ^name move
- ^direction left)
--->
-( ^operator = 1.5)
+ (state ^name task-name
+ ^x 3
+ ^y 12
+ ^operator +)
+ ( ^name move
+ ^direction left)
+ -->
+ ( ^operator = 1.5)
}
```
-
-
Note that the LHS of the rule can test for anything as long as it contains a
test for a proposed operator. The RHS is constrained to exactly one action:
creating a numeric-indifferent preference for the proposed operator.
The following are not RL rules:
-```Soar
+```Soar hl_lines="4"
sp {multiple*preferences
-(state ^operator +)
--->
-( ^operator = 5, >)
+ (state ^operator +)
+ -->
+ ( ^operator = 5, >) # (1)
}
+```
+
+1. Proposes multiple preferences for the proposed operator and thus does not
+ comply with the rule format
+```Soar hl_lines="5"
sp {variable*binding
-(state ^operator +
-^value )
--->
-( ^operator = )
+ (state ^operator +
+ ^value )
+ -->
+ ( ^operator = ) # (1)
}
+```
+
+1. Does not provide a constant for the numeric-indifferent preference value
+```Soar hl_lines="5"
sp {invalid*actions
-(state ^operator +)
--->
-( ^operator = 5)
-(write (crlf) |This is not an RL rule.|)
+ (state ^operator +)
+ -->
+ ( ^operator = 5)
+ (write (crlf) |This is not an RL rule.|) # (1)
}
```
-The first rule proposes multiple preferences for the proposed operator and thus
-does not comply with the rule format. The second rule does not comply because it
-does not provide a constant for the numeric-indifferent preference value. The
-third rule does not comply because it includes a RHS function action in addition
-to the numeric-indifferent preference action.
+1. Includes a RHS function action in addition to the numeric-indifferent
+ preference action.
In the typical RL use case, the user intends for the agent to learn the best
operator in each possible state of the environment. The most straightforward way
to achieve this is to give the agent a set of RL rules, each matching exactly
-one possible state-operator pair. This approach is equivalent to a table-based
-RL algorithm, where the Q-value of each state- operator pair corresponds to the
+one possible state-operator pair. This approach is equivalent to a table-based
+RL algorithm, where the $Q$-value of each state-operator pair corresponds to the
numeric-indifferent preference created by exactly one RL rule.
In the more general case, multiple RL rules can match a single state-operator
@@ -91,26 +102,34 @@ memory context, and multiple rules can modify the preferences for a single
operator, and a single rule can be instantiated multiple ways to modify
preferences for multiple operators. For RL in Soar, all numeric-indifferent
preferences for an operator are summed when calculating the operator’s
-Q-value[2](#footnote2). In this context, RL rules can be interpreted more generally as binary
-features in a linear approximator of each state-operator pair’s Q-value, and
-their numeric-indifferent preference values their weights. In other words,
+$Q$-value.
+
+???+ info
+ This is assuming the value of
+ [**numeric-indifferent-mode**](../reference/cli/cmd_decide.md#decide-numeric-indifferent-mode)
+ is set to **sum**. In general, the RL mechanism only works correctly when
+ this is the case, and we assume this case in the rest of the chapter.
+
+In this context, RL rules can be interpreted
+more generally as binary features in a linear approximator of each
+state-operator pair’s $Q$-value, and their numeric-indifferent preference values
+their weights. In other words,
$$Q(s, a) = w_1 \phi_2 (s, a) + w_2 \phi_2 (s, a) + \ldots + w_n \phi_n (s, a)$$
where all RL rules in production memory are numbered $1 \dots n$, $Q(s, a)$ is
-the Q-value of the state-operator pair $(s, a)$, $w_i$ is the
+the $Q$-value of the state-operator pair $(s, a)$, $w_i$ is the
numeric-indifferent preference value of RL rule $i$, $\phi_i (s, a) = 0$ if RL
-rule $i$ does not match $(s, a)$, and $\phi_i (s, a) = 1$ if it does. This
+rule $i$ does not match $(s, a)$, and $\phi_i (s, a) = 1$ if it does. This
interpretation allows RL rules to simulate a number of popular function
approximation schemes used in RL such as tile coding and sparse coding.
-
## Reward Representation
RL updates are driven by reward signals. In Soar, these reward signals are given
to the RL mechanism through a working memory link called the **reward-link**.
Each state in Soar’s state stack is automatically populated with
-a reward-link structure upon creation. Soar will check each structure for a
+a `reward-link` structure upon creation. Soar will check each structure for a
numeric reward signal for the last operator executed in the associated state at
the beginning of every decision phase. Reward is also collected when the agent
is halted or a state is retracted.
@@ -122,25 +141,24 @@ In order to be recognized, the reward signal must follow this pattern:
( ^value [val])
```
-where `` is the reward-link identifier, `` is some intermediate
+where `` is the `reward-link` identifier, `` is some intermediate
identifier, and `[val]` is any constant numeric value. Any structure that does not
match this pattern is ignored. If there are multiple valid reward signals, their
-values are summed into a single reward signal. As an example, consider the
+values are summed into a single reward signal. As an example, consider the
following state:
```Soar
(S1 ^reward-link R1)
-(R1 ^reward R2)
-(R2 ^value 1.0)
-(R1 ^reward R3)
-(R3 ^value -0.2)
+ (R1 ^reward R2)
+ (R2 ^value 1.0)
+ (R1 ^reward R3)
+ (R3 ^value -0.2)
```
In this state, there are two reward signals with values 1.0 and -0.2. They will
be summed together for a total reward of 0.8 and this will be the value given to
the RL update algorithm.
-
There are two reasons for requiring the intermediate identifier. The first is so
that multiple reward signals with the same value can exist simultaneously. Since
working memory is a set, multiple WMEs with identical values in all three
@@ -148,21 +166,21 @@ positions (identifier, attribute, value) cannot exist simultaneously. Without an
intermediate identifier, specifying two rewards with the same value would
require a WME structure such as
-
```Soar
(S1 ^reward-link R1)
-(R1 ^reward 1.0)
-(R1 ^reward 1.0)
+ (R1 ^reward 1.0)
+ (R1 ^reward 1.0)
```
-which is invalid. With the intermediate identifier, the rewards would be specified as
+which is invalid. With the intermediate identifier, the rewards would be
+specified as
```Soar
(S1 ^reward-link R1)
-(R1 ^reward R2)
-(R2 ^value 1.0)
-(R1 ^reward R3)
-(R3 ^value 1.0)
+ (R1 ^reward R2)
+ (R2 ^value 1.0)
+ (R1 ^reward R3)
+ (R3 ^value 1.0)
```
which is valid. The second reason for requiring an intermediate identifier in
@@ -173,23 +191,23 @@ or programmer. For example:
```Soar
(S1 ^reward-link R1)
-(R1 ^reward R2)
-(R2 ^value 1.0)
-(R2 ^source environment)
-(R1 ^reward R3)
-(R3 ^value -0.2)
-(R3 ^source intrinsic)
-(R3 ^duration 5)
+ (R1 ^reward R2)
+ (R2 ^value 1.0)
+ (R2 ^source environment)
+ (R1 ^reward R3)
+ (R3 ^value -0.2)
+ (R3 ^source intrinsic)
+ (R3 ^duration 5)
```
The `(R2 ^source environment)`,`(R3 ^source intrinsic)`, and `(R3 ^duration 5)`
WMEs are arbitrary and ignored by RL, but were added by the agent to keep track
of where the rewards came from and for how long.
-Note that the reward-link is not part of the io structure and is not modified
+Note that the `reward-link` is not part of the io structure and is not modified
directly by the environment. Reward information from the environment should be
-copied, via rules, from the input-link to the reward-link. Also note that when
-collecting rewards, Soar simply scans the reward-link and sums the values of all
+copied, via rules, from the `output-link` to the reward-link. Also note that when
+collecting rewards, Soar simply scans the `reward-link` and sums the values of all
valid reward WMEs. The WMEs are not modified and no bookkeeping is done to keep
track of previously seen WMEs. This means that reward WMEs that exist for
multiple decision cycles will be collected multiple times if not removed or
@@ -201,236 +219,279 @@ Soar’s RL mechanism is integrated naturally with the decision cycle and perfor
online updates of RL rules. Whenever an RL operator is selected, the values of
the corresponding RL rules will be updated. The update can be on-policy (Sarsa)
or off-policy (Q-Learning), as controlled by the **learning-policy** parameter
-of the rl command. (See page 238.) Let $\delta_t$ be the amount of change for the
-Q-value of an RL operator in a single update. For Sarsa, we have
+of the [`rl`](../reference/cli/cmd_rl.md) command. Let $\delta_t$ be the amount
+of change for the $Q$-value of an RL operator in a single update. For Sarsa, we
+have
-$$ \delta_t = \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)
-\right] $$
+$$
+\delta_t = \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)
+\right]
+$$
where
-- $Q(s_t, a_t)$ is the Q-value of the state and chosen operator in decision cycle $t$.
-- $Q(s_{t+1}, a_{t+1})$ is the Q-value of the state and chosen RL operator in the next decision cycle.
-- $r_{t+1}$ is the total reward collected in the next decision cycle.
-- $\alpha$ and $\gamma$ are the settings of the learning-rate and discount-rate parameters of the `rl` command, respectively.
+- $Q(s_t, a_t)$ is the $Q$-value of the state and chosen operator in decision
+ cycle $t$.
+- $Q(s_{t+1}, a_{t+1})$ is the $Q$-value of the state and chosen RL operator in
+ the next decision cycle.
+- $r_{t+1}$ is the total reward collected in the next decision cycle.
+- $\alpha$ and $\gamma$ are the settings of the `learning-rate` and
+ `discount-rate` parameters of the `rl` command, respectively.
Note that since $\delta_t$ depends on $Q(s_{t+1}, a_{t+1})$, the update for the
operator selected in decision cycle $t$ is not applied until the next RL
-operator is chosen. For Q-Learning, we have
-$$ \delta_t = \alpha \left[ r_{t+1} + \gamma \underset{a \in A_{t+1}}{\max} Q(s_{t+1}, a) - Q(s_t, a_t) \right] $$
+operator is chosen. For Q-Learning, we have
+
+$$
+\delta_t = \alpha \left[ r_{t+1} + \gamma \underset{a \in A_{t+1}}{\max}
+Q(s_{t+1}, a) - Q(s_t, a_t) \right]
+$$
+
where $A_{t+1}$ is the set of RL operators proposed in the next decision cycle.
-Finally, $\delta_t$ is divided by the number of RL rules comprising the Q-value
+Finally, $\delta_t$ is divided by the number of RL rules comprising the $Q$-value
for the operator and the numeric-indifferent values for each RL rule is updated
by that amount.
An example walkthrough of a Sarsa update with $\alpha = 0.3$ and $\gamma = 0.9$
(the default settings in Soar) follows.
+1. In decision cycle $t$, an operator `O1` is proposed, and RL rules `rl-1`
+ and `rl-2` create the following numeric-indifferent preferences for it:
-1. In decision cycle $t$, an operator `O1` is proposed, and RL rules **rl-1**
- and **rl-2** create the following numeric-indifferent preferences for it:
- ```
- rl-1: (S1 ^operator O1 = 2.3)
- rl-2: (S1 ^operator O1 = -1)
- ```
- The Q-value for `O1` is $Q(s_t, \textbf{O1}) = 2.3 - 1 = 1.3$.
+ ```Soar
+ rl-1: (S1 ^operator O1 = 2.3)
+ rl-2: (S1 ^operator O1 = -1)
+ ```
-2. `O1` is selected and executed, so $Q(s_t, a_t) = Q(s_t, \textbf{O1}) = 1.3$.
+ The $Q$-value for `O1` is $Q(s_t, \textbf{O1}) = 2.3 - 1 = 1.3$.
-3. In decision cycle $t+1$, a total reward of 1.0 is collected on the
- reward-link, an operator O2is proposed, and another RL rule **rl-3** creates the
- following numeric-indifferent preference for it:
- ```
- rl-3: (S1 ^operator O2 = 0.5)
- ```
+2. `O1` is selected and executed, so $Q(s_t, a_t) = Q(s_t, \textbf{O1}) = 1.3$.
- So $Q(s_{t+1}, \textbf{O2}) = 0.5$.
+3. In decision cycle $t+1$, a total reward of 1.0 is collected on the
+ `reward-link`, an operator `O2` is proposed, and another RL rule `rl-3`
+ creates the following numeric-indifferent preference for it:
-4. `O2` is selected, so $Q(s_{t+1}, a_{t+1}) = Q(s_{t+1}, \textbf{O2}) = 0.5$ Therefore,
+ ```Soar
+ rl-3: (S1 ^operator O2 = 0.5)
+ ```
-$$\delta_t = \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] = 0.3 \times [ 1.0 + 0.9 \times 0.5 - 1.3 ] = 0.045$$
+ So $Q(s_{t+1}, \textbf{O2}) = 0.5$.
-Since rl-1 and rl-2 both contributed to the Q-value of O1, $\delta_t$ is evenly divided
-amongst them, resulting in updated values of
+4. `O2` is selected, so $Q(s_{t+1}, a_{t+1}) = Q(s_{t+1}, \textbf{O2}) = 0.5$ Therefore,
-```
-rl-1: ( ^operator = 2.3225)
-rl-2: ( ^operator = -0.9775)
-```
+ $$
+ \delta_t = \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t,
+ a_t) \right] = 0.3 \times [ 1.0 + 0.9 \times 0.5 - 1.3 ] = 0.045
+ $$
+
+ Since `rl-1` and `rl-2` both contributed to the $Q$-value of `O1`,
+ $\delta_t$ is evenly divided amongst them, resulting in updated values of
-5. **rl-3** will be updated when the next RL operator is selected.
+ ```Soar
+ rl-1: ( ^operator = 2.3225)
+ rl-2: ( ^operator = -0.9775)
+ ```
+
+5. **rl-3** will be updated when the next RL operator is selected.
### Gaps in Rule Coverage
-The previous description had assumed that RL operators were selected in both decision
-cyclestandt+ 1. If the operator selected int+ 1 is not an RL operator, thenQ(st+1,at+1)
-would not be defined, and an update for the RL operator selected at timetwill be undefined.
-We will call a sequence of one or more decision cycles in which RL operators are not selected
-between two decision cycles in which RL operators are selected agap. Conceptually, it is
-desirable to use the temporal difference information from the RL operator after the gap to
-update the Q-value of the RL operator before the gap. There are no intermediate storage
-locations for these updates. Requiring that RL rules support operators at every decision
-can be difficult for agent programmers, particularly for operators that do not represent steps
-in a task, but instead perform generic maintenance functions, such as cleaning processed
-output-link structures.
-
-To address this issue, Soar's RL mechanism supports automatic propagation of updates over gaps.
-For a gap of length $n$, the Sarsa update is
-$$\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1} Q(s_{t+n+1}, a_{t+n+1}) - Q(s_t, a_t) \right]$$
+The previous description had assumed that RL operators were selected in both
+decision cycles $t$ and $t+1$. If the operator selected in $t+1$ is not an RL operator,
+then $Q(s_{t+1}, a_{t+1})$ would not be defined, and an update for the RL operator
+selected at time $t$ will be undefined. We will call a sequence of one or more
+decision cycles in which RL operators are not selected between two decision
+cycles in which RL operators are selected a gap. Conceptually, it is desirable
+to use the temporal difference information from the RL operator after the gap
+to update the $Q$-value of the RL operator before the _gap_. There are no
+intermediate storage locations for these updates. Requiring that RL rules
+support operators at every decision can be difficult for agent programmers,
+particularly for operators that do not represent steps in a task, but instead
+perform generic maintenance functions, such as cleaning processed output-link
+structures.
+
+To address this issue, Soar's RL mechanism supports automatic propagation of
+updates over gaps. For a gap of length $n$, the Sarsa update is
+
+$$
+\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1}
+Q(s_{t+n+1}, a_{t+n+1}) - Q(s_t, a_t) \right]
+$$
+
and the Q-Learning update is
-$$\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1} \underset{a \in A_{t+n+1}}{\max} Q(s_{t+n+1}, a) - Q(s_t, a_t) \right]$$
-Note that rewards will still be collected during the gap, but they are discounted based on the number of decisions they are removed from the initial RL operator.
+$$
+\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1}
+\underset{a \in A_{t+n+1}}{\max} Q(s_{t+n+1}, a) - Q(s_t, a_t) \right]
+$$
-Gap propagation can be disabled by setting the **temporal-extension** parameter of the
-rl command to off. When gap propagation is disabled, the RL rules preceding a gap are
-updated usingQ(st+1,at+1) = 0. The rl setting of the watch command (see Section 9.6.1
-on page 259) is useful in identifying gaps.
+Note that rewards will still be collected during the gap, but they are
+discounted based on the number of decisions they are removed from the initial
+RL operator.
-![Example Soar substate operator trace.](Images/rl-optrace.svg)
+Gap propagation can be disabled by setting the **temporal-extension** parameter
+of the [`rl` command](../reference/cli/cmd_rl.md) to off. When gap propagation
+is disabled, the RL rules preceding a gap are updated using $Q(s_{t+1}, a_{t+1})
+= 0$. The rl setting of the [`watch`](../reference/cli/cmd_trace.md) command is
+useful in identifying gaps.
### RL and Substates
-When an agent has multiple states in its state stack, the RL mechanism will treat each
-substate independently. As mentioned previously, each state has its own reward-link.
-When an RL operator is selected in a stateS, the RL updates for that operator are only
-affected by the rewards collected on the reward-link for Sand the Q-values of subsequent
-RL operators selected inS.
+When an agent has multiple states in its state stack, the RL mechanism will
+treat each substate independently. As mentioned previously, each state has its
+own `reward-link`. When an RL operator is selected in a state `S`, the RL updates
+for that operator are only affected by the rewards collected on the `reward-link`
+for Sand the $Q$-values of subsequent RL operators selected in `S`.
The only exception to this independence is when a selected RL operator forces an
operator- no-change impasse. When this occurs, the number of decision cycles the
RL operator at the superstate remains selected is dependent upon the processing
-in the impasse state. Consider the operator trace in Figure 5.1.
-
-- At decision cycle 1, RL operatorO1is selected inS1and causes an
-operator-no-change impass for three decision cycles.
-- In the substateS2,
-operatorsO2,O3, andO4are selected and applied sequentially.
-- Meanwhile inS1,
-rewardsr 2 ,r 3 , andr 4 are put on thereward-linksequentially. - Finally, the
-impasse is resolved by O4, the proposal for O1 is retracted, and RL operatorO5is
-selected inS1.
+in the impasse state. Consider the operator trace in the following figure:
-In this scenario, only the RL update forQ(s 1 ,O1) will be different from the
-ordinary case. Its value depends on the setting of the **hrl-discount**
-parameter of the rlcommand. When this parameter is set to the default valueon,
-the rewards onS1and the Q-value of O5are discounted by the number of decision
-cycles they are removed from the selection of O1. In this case the update for $Q(s_1, \textbf{O1})$ is
+![Example Soar substate operator trace.](Images/rl-optrace.svg)
-$$\delta_1 = \alpha \left[ r_2 + \gamma r_3 + \gamma^2 r_4 + \gamma^3 Q(s_5, \textbf{O5}) - Q(s_1, \textbf{O1}) \right]$$
+- At decision cycle 1, RL operator `O1` is selected in `S1` and causes an
+ operator-no-change impass for three decision cycles.
+- In the substate `S2`, operators `O2`, `O3`, and `O4` are selected and
+ applied sequentially.
+- Meanwhile in `S1`, rewards $r_2$ ,$r_3$ , and $r_4$ are put on the
+ `reward-link` sequentially.
+- Finally, the impasse is resolved by `O4`, the proposal for `O1` is retracted,
+ and RL operator `O5` is selected in `S1`.
+
+In this scenario, only the RL update for $Q(s_1, O1)$ will be different from the
+ordinary case. Its value depends on the setting of the `hrl-discount`
+parameter of the [`rl` command](../reference/cli/cmd_rl.md). When this
+parameter is set to the default value on, the rewards on `S1` and the $Q$-value
+of `O5` are discounted by the number of decision cycles they are removed from the
+selection of `O1`. In this case the update for $Q(s_1, \textbf{O1})$ is
+
+$$
+\delta_1 = \alpha \left[ r_2 + \gamma r_3 + \gamma^2 r_4 + \gamma^3 Q(s_5,
+\textbf{O5}) - Q(s_1, \textbf{O1}) \right]
+$$
which is equivalent to having a three decision gap separating `O1` and `O5`.
-When hrl-discount is set to off, the number of cycles O1has been impassed will be
-ignored. Thus the update would be
+When `hrl-discount` is set to `off`, the number of cycles `O1` has been impassed
+will be ignored. Thus the update would be
-$$\delta_1 = \alpha \left[ r_2 + r_3 + r_4 + \gamma Q(s_5, \textbf{O5}) - Q(s_1, \textbf{O1}) \right]$$
+$$
+\delta_1 = \alpha \left[ r_2 + r_3 + r_4 + \gamma Q(s_5, \textbf{O5}) -
+Q(s_1, \textbf{O1}) \right]
+$$
For impasses other than operator no-change, RL acts as if the impasse hadn’t
-occurred. If O1is the last RL operator selected before the impasse,r 2 the
-reward received in the decision cycle immediately following, and On, the first
-operator selected after the impasse, thenO1 is updated with
+occurred. If `O1` is the last RL operator selected before the impasse, $r_2$
+the reward received in the decision cycle immediately following, and $O_n$, the
+first operator selected after the impasse, then `O1` is updated with
-$$\delta_1 = \alpha \left[ r_2 + \gamma Q(s_n, \textbf{O}_\textbf{n}) - Q(s_1, \textbf{O1}) \right]$$
+$$
+\delta_1 = \alpha \left[ r_2 + \gamma Q(s_n, \textbf{O}_\textbf{n}) - Q(s_1,
+\textbf{O1}) \right]
+$$
If an RL operator is selected in a substate immediately prior to the state’s
retraction, the RL rules will be updated based only on the reward signals
-present and not on the Q-values of future operators. This point is not covered
+present and not on the $Q$-values of future operators. This point is not covered
in traditional RL theory. The retraction of a substate corresponds to a
suspension of the RL task in that state rather than its termination, so the last
update assumes the lack of information about future rewards rather than the
discontinuation of future rewards. To handle this case, the numeric-indifferent
preference value of each RL rule is stored as two separate values, the expected
-current reward(ECR) and expected future reward (EFR). The ECR is an estimate of
+current reward (ECR) and expected future reward (EFR). The ECR is an estimate of
the expected immediate reward signal for executing the corresponding RL
-operator. The EFR is an estimate of the time discounted Q-value of the next RL
+operator. The EFR is an estimate of the time discounted $Q$-value of the next RL
operator. Normal updates correspond to traditional RL theory (showing the Sarsa
case for simplicity):
-$$ \delta_{ECR} = \alpha \left[ r_t - ECR(s_t, a_t) \right] $$
-
-$$ \delta_{EFR} = \alpha \left[ \gamma Q(s_{t+1}, a_{t+1}) - EFR(s_t, a_t) \right] $$
-
-$$ \delta_t = \delta_{ECR} + \delta_{EFR} $$
-
-$$ = \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - \left( ECR(s_t, a_t) + EFR(s_t, a_t) \right) \right] $$
-
-$$ = \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] $$
+$$
+\begin{aligned}
+\delta_{ECR} &= \alpha \left[ r_t - ECR(s_t, a_t) \right] \\
+\delta_{EFR} &= \alpha \left[ \gamma Q(s_{t+1}, a_{t+1}) - EFR(s_t, a_t)\right] \\
+\delta_t &= \delta_{ECR} + \delta_{EFR} \\
+ &= \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - \left( ECR(s_t, a_t) +
+EFR(s_t, a_t) \right) \right] \\
+ &= \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]
+\end{aligned}
+$$
During substate retraction, only the ECR is updated based on the reward signals
present at the time of retraction, and the EFR is unchanged.
Soar’s automatic subgoaling and RL mechanisms can be combined to naturally implement
-hierarchical reinforcement learning algorithms such as MAXQ and options.
+hierarchical reinforcement learning algorithms such as `MAXQ` and options.
### Eligibility Traces
The RL mechanism supports eligibility traces, which can improve the speed of
-learning by updating RL rules across multiple sequential steps. The
-**eligibility-trace-decay-rate** and **eligibility-trace-tolerance** parameters
-*control this mechanism. By setting eligibility-trace-decay-rate to 0
-(de- fault), eligibility traces are in effect disabled. When eligibility traces
+learning by updating RL rules across multiple sequential steps.
+
+The `eligibility-trace-decay-rate` and `eligibility-trace-tolerance`
+parameters control this mechanism. By setting `eligibility-trace-decay-rate` to 0
+(default), eligibility traces are in effect disabled. When eligibility traces
are enabled, the particular algorithm used is dependent upon the learning
-policy. For Sarsa, the eligibility trace implementation isSarsa($\lambda$). For
-Q-Learning, the eligibility trace implementation is *Watkin's Q($\lambda$)*.
-
-#### Exploration
-
-The decide indifferent-selection command (page 198) determines how operators are
-selected based on their numeric-indifferent preferences. Although all the
-indifferent selection settings are valid regardless of how the
-numeric-indifferent preferences were arrived at, the epsilon-greedy and
-boltzmann settings are specifically designed for use with RL and cor- respond to
-the two most common exploration strategies. In an effort to maintain backwards
-compatibility, the default exploration policy is soft max. As a result, one
-should change to epsilon-greedy or boltzmann when the reinforcement learning
-mechanism is enabled.
-
-### GQ($\lambda$)
-
-Sarsa($\lambda$) and Watkin’s Q($\lambda$) help agents to solve the temporal
-credit assignment problem more quickly. However, if you wish to implement
-something akin to CMACs to generalize from experience, convergence is not
-guaranteed by these algorithms. GQ($\lambda$) is a gradient descent algorithm
-designed to ensure convergence when learning off-policy. Soar’s learning-policy
-can be set to
-**on-policy-gq-lambda** or **off-policy-gq-lambda** to increase the likelihood
-of convergence when learning under these conditions. If you should choose to use
-one of these algorithms, we recommend setting the rl **step-size-parameter** to
-something small, such as 0.01 in order to ensure that the secondary set of
-weights used by GQ($\lambda$)change slowly enough for efficient convergence.
+policy. For Sarsa, the eligibility trace implementation is $Sarsa(\lambda)$. For
+Q-Learning, the eligibility trace implementation is $Watkin's Q(\lambda)$.
+
+#### Exploration
+
+The [`decide indifferent-selection` command](../reference/cli/cmd_decide.md#decide-indifferent-selection)
+determines how operators are selected based on their
+numeric-indifferent preferences. Although all the indifferent selection
+settings are valid regardless of how the numeric-indifferent preferences were
+arrived at, the `epsilon-greedy` and `boltzmann` settings are specifically designed
+for use with RL and cor- respond to the two most common exploration strategies.
+In an effort to maintain backwards compatibility, the default exploration
+policy is `softmax`. As a result, one should change to `epsilon-greedy` or
+`boltzmann` when the reinforcement learning mechanism is enabled.
+
+### GQ(λ)
+
+
+_Sarsa($\lambda$)_ and _Watkin’s Q($\lambda$)_ help agents to solve the
+temporal credit assignment problem more quickly. However, if you wish to
+implement something akin to CMACs to generalize from experience, convergence is
+not guaranteed by these algorithms. $GQ(\lambda)$ is a gradient descent
+algorithm designed to ensure convergence when learning off-policy. Soar’s
+`learning-policy` can be set to **on-policy-gq-lambda** or
+**off-policy-gq-lambda** to increase the likelihood of convergence when
+learning under these conditions. If you should choose to use one of these
+algorithms, we recommend setting the `rl` **step-size-parameter** to something
+small, such as 0.01 in order to ensure that the secondary set of weights used
+by $GQ(\lambda)$ change slowly enough for efficient convergence.
## Automatic Generation of RL Rules
The number of RL rules required for an agent to accurately approximate operator
-Q-values is usually unfeasibly large to write by hand, even for small domains.
+$Q$-values is usually unfeasibly large to write by hand, even for small domains.
Therefore, several methods exist to automate this.
### The gp Command
-The gp command can be used to generate productions based on simple patterns.
-This is useful if the states and operators of the environment can be
-distinguished by a fixed number of dimensions with finite domains. An example is
-a grid world where the states are described by integer row/column coordinates,
-and the available operators are to move north, south, east, or west. In this
-case, a single gp command will generate all necessary RL rules:
+The [`gp` command](../reference/cli/cmd_gp.md) can be used to generate
+productions based on simple patterns. This is useful if the states and
+operators of the environment can be distinguished by a fixed number of
+dimensions with finite domains. An example is a grid world where the states are
+described by integer row/column coordinates, and the available operators are to
+move north, south, east, or west. In this case, a single `gp` command will
+generate all necessary RL rules:
```Soar
gp {gen*rl*rules
-(state ^name gridworld
-^operator +
-^row [ 1 2 3 4 ]
-^col [ 1 2 3 4 ])
-( ^name move
-^direction [ north south east west ])
--->
-( ^operator = 0.0)
-}
+ (state ^name gridworld
+ ^operator +
+ ^row [ 1 2 3 4 ]
+ ^col [ 1 2 3 4 ])
+ ( ^name move
+ ^direction [ north south east west ])
+ -->
+ ( ^operator = 0.0)
+ }
```
-For more information see the documentation for this command on page 205.
-
### Rule Templates
Rule templates allow Soar to dynamically generate new RL rules based on a
@@ -443,30 +504,31 @@ with 1000 rows and columns. Attempting to generate RL rules for each grid cell
and action a priori will result in $1000 \times 1000 \times 4 = 4 \times 10^6$
productions. However, if most of those cells are unreachable due to walls, then
the agent will never fire or update most of those productions. Templates give
-the programmer the convenience of the gp command without filling production
-memory with unnecessary rules.
+the programmer the convenience of the [`gp`
+command](../reference/cli/cmd_gp.md) without filling production memory with
+unnecessary rules.
Rule templates have variables that are filled in to generate RL rules as the
agent encounters novel combinations of variable values. A rule template is valid
if and only if it is marked with the **:template** flag and, in all other
-respects, adheres to the format of an RL rule. However, whereas an RL rule may
+respects, adheres to the format of an RL rule. However, whereas an RL rule may
only use constants as the numeric-indifference preference value, a rule template
may use a variable. Consider the following rule template:
```Soar
sp {sample*rule*template
-:template
-(state ^operator +
-^value )
--->
-( ^operator = )
-}
+ :template
+ (state ^operator +
+ ^value )
+ -->
+ ( ^operator = )
+ }
```
-During agent execution, this rule template will match working memory and create new
-productions by substituting all variables in the rule template that matched against constant
-values with the values themselves. Suppose that the LHS of the rule template matched
-against the state
+During agent execution, this rule template will match working memory and create
+new productions by substituting all variables in the rule template that matched
+against constant values with the values themselves. Suppose that the LHS of the
+rule template matched against the state
```Soar
(S1 ^value 3.2)
@@ -477,23 +539,23 @@ Then the following production will be added to production memory:
```Soar
sp {rl*sample*rule*template*1
-(state ^operator +
-^value 3.2)
--->
-( ^operator = 3.2)
-}
+ (state ^operator +
+ ^value 3.2)
+ -->
+ ( ^operator = 3.2)
+ }
```
-The variable `` is replaced by3.2on both the LHS and the RHS, but `` and
-`` are not replaced because they matches against identifiers (S1andO1). As
-with other RL rules, the value of3.2on the RHS of this rule may be updated later
-by reinforcement learning, whereas the value of 3.2 on the LHS will remain
-unchanged. If `` had matched against a non-numeric constant, it will be
-replaced by that constant on the LHS, but the RHS numeric-indifference
+The variable `` is replaced by `3.2` on both the LHS and the RHS, but ``
+and `` are not replaced because they matches against identifiers (`S1` and
+`O1`). As with other RL rules, the value of `3.2` on the RHS of this rule may
+be updated later by reinforcement learning, whereas the value of `3.2` on the LHS
+will remain unchanged. If `` had matched against a non-numeric constant, it
+will be replaced by that constant on the LHS, but the RHS numeric-indifference
preference value will be set to zero to make the new rule valid.
-The new production’s name adheres to the following pattern:rl*template-name*id,
-where template-name is the name of the originating rule template and id is
+The new production’s name adheres to the following pattern: `rl*template-name*id`,
+where `template-name` is the name of the originating rule template and id is
monotonically increasing integer that guarantees the uniqueness of the name.
If an identical production already exists in production memory, then the newly
@@ -505,18 +567,7 @@ using the gp command or via custom scripting when possible.
### Chunking
Since RL rules are regular productions, they can be learned by chunking just
-like any other production. This method is more general than using the gp command
-or rule templates, and is useful if the environment state consists of
-arbitrarily complex relational structures that cannot be enumerated.
-
-## Footnotes
-
-- [1]: In this context, the term "state" refers to the
-state of the task or environment, not a state identifier. For the rest of this
-chapter, bold capital letter names such as S1 will refer to identifiers and italic
-lowercase names such as $s_1$ will refer to task states.
-- [2]: This is assuming the value of
-**numeric-indifferent-mode** is set to
-**sum**. In general, the RL mechanism only works correctly when this is the
-case, and we assume this case in the rest of the chapter. See page 198 for more
-information about this parameter.
+like any other production. This method is more general than using the
+[`gp` command](../reference/cli/cmd_gp.md) or rule templates, and is useful if
+the environment state consists of arbitrarily complex relational structures
+that cannot be enumerated.
diff --git a/docs/soar_manual/06_SemanticMemory.md b/docs/soar_manual/06_SemanticMemory.md
index eed10725..8069f2ef 100644
--- a/docs/soar_manual/06_SemanticMemory.md
+++ b/docs/soar_manual/06_SemanticMemory.md
@@ -1,14 +1,16 @@
+
{{manual_wip_warning}}
# Semantic Memory
-Soar’s semantic memory is a repository for long-term declarative knowledge, supplement-
-ing what is contained in short-term working memory (and production memory). Episodic
-memory, which contains memories of the agent’s experiences, is described in [Chapter 7](./07_EpisodicMemory.md). The
-knowledge encoded in episodic memory is organized temporally, and specific information is
-embedded within the context of when it was experienced, whereas knowledge in semantic
-memory is independent of any specific context, representing more general facts about the
-world.
+Soar’s semantic memory is a repository for long-term declarative knowledge,
+supplement- ing what is contained in short-term working memory (and production
+memory). Episodic memory, which contains memories of the agent’s experiences,
+is described in [Chapter 7](./07_EpisodicMemory.md). The knowledge encoded in
+episodic memory is organized temporally, and specific information is embedded
+within the context of when it was experienced, whereas knowledge in semantic
+memory is independent of any specific context, representing more general facts
+about the world.
This chapter is organized as follows: [semantic memory structures in working
memory](#working-memory-structure); [representation of knowledge in semantic
@@ -17,53 +19,60 @@ knowledge](#storing-semantic-knowledge); [retrieving semantic
knowledge](#retrieving-semantic-knowledge); and a [discussion of
performance](#performance). The detailed behavior of semantic memory is
determined by numerous parameters that can be controlled and configured via the
-**smem** command. Please refer to the documentation for that command in Section
-9.5.1 on page 243.
+[`smem` command](../reference/cli/cmd_smem.md).
## Working Memory Structure
-Upon creation of a new state in working memory (see Section 2.7.1 on page 28; Section 3.4 on
-page 85), the architecture creates the following augmentations to facilitate agent interaction
-with semantic memory:
+Upon creation of a new state in working memory (see
+[Impasse Types](02_TheSoarArchitecture.md#impasse-types);
+[Impasses in Working Memory and in Production](03_SyntaxOfSoarPrograms.md#impasses-in-working-memory-and-in-productions))
+, the architecture creates the following augmentations
+to facilitate agent interaction with semantic memory:
```Soar
( ^smem )
-( ^command )
-( ^result )
+ ( ^command )
+ ( ^result )
```
-As rules augment the command structure in order to access/change semantic knowledge (6.3,
-6.4), semantic memory augments the result structure in response. Production actions
-should not remove augmentations of the result structure directly, as semantic memory will
-maintain these WMEs.
-
-Figure 6.1: Example long-term identifier with four augmentations.
+As rules augment the command structure in order to access/change semantic
+knowledge
+([Storing Semantic Knowledge](#storing-semantic-knowledge),
+[Retrieving Semantic Knowledge](#retrieving-semantic-knowledge))
+, semantic memory augments the `result` structure in response. Production actions
+should not remove augmentations of the `result` structure directly, as semantic
+memory will maintain these WMEs.
## Knowledge Representation
-The representation of knowledge in semantic memory is similar to that in working memory
-(see Section 2.2 on page 14) – both include graph structures that are composed of symbolic
-elements consisting of an identifier, an attribute, and a value. It is important to note,
-however, key differences:
+The representation of knowledge in semantic memory is similar to that in
+[working memory](02_TheSoarArchitecture.md#working-memory-the-current-situation)
+– both include graph structures that are composed of symbolic elements
+consisting of an identifier, an attribute, and a value. It is important to
+note, however, key differences:
-Currently semantic memory only supports attributes that are symbolic constants
-(string, integer, or decimal), but not attributes that are identifiers
+- Currently semantic memory only supports attributes that are symbolic constants
+ (string, integer, or decimal), but not attributes that are identifiers
-Whereas working memory is a single, connected, directed graph, semantic memory can
-be disconnected, consisting of multiple directed, connected sub-graphs
+- Whereas working memory is a single, connected, directed graph, semantic
+ memory can be disconnected, consisting of multiple directed, connected
+ sub-graphs
-From Soar 9.6 onward, **Long-termidentifiers** (LTIs) are defined as identifiers that
-exist in semantic memory only. Each LTI is permanently associated with a specific number
-that labels it (e.g. @5 or @7). Instances of an LTI can be loaded into working memory as
-regular short-term identifiers (STIs) linked with that specific LTI. For clarity, when printed,
-a short-term identifier associated with an LTI is followed with the label of that LTI. For
-example, if the working memory ID L7 is associated with the LTI named `@29`, printing that
-STI would appear as `L7` (`@29`).
+From Soar 9.6 onward, **Long-term identifiers** (LTIs) are defined as
+identifiers that exist in semantic memory _only_. Each LTI is permanently
+associated with a specific number that labels it (e.g. `@5` or `@7`). Instances of
+an LTI can be loaded into working memory as regular short-term identifiers
+(STIs) linked with that specific LTI. For clarity, when printed, a short-term
+identifier associated with an LTI is followed with the label of that LTI. For
+example, if the working memory `ID L7` is associated with the LTI named `@29`,
+printing that STI would appear as `L7 (@29)`.
When presented in a figure, long-term identifiers will be indicated by a
-double-circle. For instance, Figure 6.1 depicts the long-term identifier @68,
-with four augmentations, representing the addition fact of `6 + 7 = 13` (or,
-rather, 3, carry 1, in context of multi-column arithmetic).
+double-circle. For instance, the following figure depicts the long-term
+identifier `@1`, with four augmentations, representing the addition fact of
+$6 + 7 = 13$ (or, rather, 3, carry 1, in context of multi-column arithmetic).
+
+![Example long-term identifier with four augmentations.](Images/smem-concept.svg)
### Integrating Long-Term Identifiers with Soar
@@ -71,97 +80,109 @@ Integrating long-term identifiers in Soar presents a number of theoretical and
implementation challenges. This section discusses the state of integration with
each of Soar’s memories/learning mechanisms.
-
-#### Working Memory
-
-Long-term identifiers themselves never exist in working memory. Rather, instances of long
-term memories are loaded into working memory as STIs through queries or retrievals, and
-manipulated just like any other WMEs. Changes to any STI augmentations do not directly
-have any effect upon linked LTIs in semantic memory. Changes to LTIs themselves only
-occur though store commands on the command link or through command-line directives
-such as `smem --add` (see below).
-
-Each time an agent loads an instance of a certain LTI from semantic memory into working
-memory using queries or retrievals, the instance created will always be a new unique STI.
-This means that if same long-term memory is retrieved multiple times in succession, each
-retrieval will result in a different STI instance, each linked to the same LTI. A benefit of this
-is that a retrieved long-term memory can be modified without compromising the ability to
-recall what the actual stored memory is.^1
-
-(^1) Before Soar 9.6, LTIs were themselves retrieved into working memory. This meant all augmentations
-to such IDs, whether from the original retrieval or added after retrieval, would always be merged under the
-same ID, unless deep-copy was used to make a duplicate short-term memory.
-
-#### Procedural Memory
-
-Soar productions can use various conditions to test whether an STI is associated with an
-LTI or whether two STIs are linked to the same LTI (see Section 3.3.5.3 on page 53). LTI
+#### Working Memory
+
+Long-term identifiers themselves never exist in working memory. Rather,
+instances of long term memories are loaded into working memory as STIs through
+queries or retrievals, and manipulated just like any other WMEs. Changes to any
+STI augmentations do not directly have any effect upon linked LTIs in semantic
+memory. Changes to LTIs themselves only occur though `store` commands on the
+command link or through command-line directives such as `smem --add` (see
+below).
+
+Each time an agent loads an instance of a certain LTI from semantic memory into
+working memory using queries or retrievals, the instance created will always be
+a new unique STI. This means that if same long-term memory is retrieved
+multiple times in succession, each retrieval will result in a different STI
+instance, each linked to the same LTI. A benefit of this is that a retrieved
+long-term memory can be modified without compromising the ability to recall
+what the actual stored memory is.
+
+???+ info
+ Before Soar 9.6, LTIs were themselves retrieved into working memory. This
+ meant all augmentations to such IDs, whether from the original retrieval or
+ added after retrieval, would always be merged under the same ID, unless
+ deep-copy was used to make a duplicate short-term memory.
+
+#### Procedural Memory
+
+Soar productions can use various conditions to test whether an STI is
+associated with an LTI or whether two STIs are linked to the same LTI (see
+[Predicates for Values](03_SyntaxOfSoarPrograms.md#predicates-for-values)). LTI
names (e.g. `@6`) may not appear in the action side of productions.
-#### Episodic Memory
+#### Episodic Memory
-Episodic memory (see Section 7 on page 157) faithfully captures LTI-linked STIs, including
-the episode of transition. Retrieved episodes contain STIs as they existed during the episode,
-regardless of any changes to linked LTIs that transpired since the episode occurred.
+[Episodic memory](07_EpisodicMemory.md) faithfully captures LTI-linked
+STIs, including the episode of transition. Retrieved episodes contain STIs as
+they existed during the episode, regardless of any changes to linked LTIs that
+transpired since the episode occurred.
## Storing Semantic Knowledge
-###Store command
+### Store command
-An agent stores a long-term identifier in semantic memory by creating a **^store** command:
-this is a WME whose identifier is the command link of a state’s smem structure, the attribute
-is store, and the value is a short-term identifier.
+An agent stores a long-term identifier in semantic memory by creating a
+`^store` command: this is a WME whose identifier is the command link of a
+state’s smem structure, the attribute is store, and the value is a short-term
+identifier.
```Soar
^smem.command.store
```
-Semantic memory will encode and store all WMEs whose identifier is the value of the store
-command. Storing deeper levels of working memory is achieved through multiple store commands.
+Semantic memory will encode and store all WMEs whose identifier is the value of
+the store command. Storing deeper levels of working memory is achieved through
+multiple store commands.
-Multiple store commands can be issued in parallel. Storage commands are processed on
-every state at the end of every phase of every decision cycle. Storage is guaranteed to
-succeed and a status WME will be created, where the identifier is the **^result** link of the
-smem structure of that state, the attribute is success, and the value is the value of the store
-command above.
+Multiple store commands can be issued in parallel. Storage commands are
+processed on every state at the end of every phase of every decision cycle.
+Storage is guaranteed to succeed and a status WME will be created, where the
+identifier is the `^result` link of the smem structure of that state, the
+attribute is success, and the value is the value of the store command above.
```Soar
^smem.result.success
```
-If the identifier used in the store command is not linked to any existing LTIs, a new LTI
-will be created in smem and the stored STI will be linked to it. If the identifier used in
-the store command is already linked to an LTI, the store will overwrite that long-term
-memory. For example, if an existing LTI@5had augmentations^A do ^B re ^C mi, and a
-storecommand stored short-term identifierL35which was linked to@5but had only the
-augmentation^D fa, the LTI@5would be changed to only have^D fa.
+If the identifier used in the store command is not linked to any existing LTIs,
+a new LTI will be created in smem and the stored STI will be linked to it. If
+the identifier used in the store command is already linked to an LTI, the store
+will overwrite that long-term memory. For example, if an existing LTI `@5` had
+augmentations `^A do` `^B re` `^C mi`, and a `store` command stored short-term
+identifier `L35` which was linked to `@5` but had only the augmentation
+`^D fa`, the LTI `@5` would be changed to only have `^D fa`.
### Store-new command
-The **^store-new** command structure is just like the ^store command, except that smem
-will always store the given memory as an entirely new structure, regardless of whether the
-given STI was linked to an existing LTI or not. Any STIs that don’t already have links will
-get linked to the newly created LTIs. But if a stored STI was already linked to some LTI,
-Soar will not re-link it to the newly created LTI.
+The `^store-new` command structure is just like the `^store` command, except
+that smem will always store the given memory as an entirely new structure,
+regardless of whether the given STI was linked to an existing LTI or not. Any
+STIs that don’t already have links will get linked to the newly created LTIs.
+But if a stored STI was already linked to some LTI, Soar will not re-link it to
+the newly created LTI.
-If this behavior is not desired, the agent can add a **^link-to-new-LTM yes** augmentation
-to override this behavior. One use for this setting is to allow chunking to backtrace through
-a stored memory in a manner that will be consistent with a later state of memory when the
-newly stored LTI is retrieved again.
+If this behavior is not desired, the agent can add a `^link-to-new-LTM yes`
+augmentation to override this behavior. One use for this setting is to allow
+chunking to backtrace through a stored memory in a manner that will be
+consistent with a later state of memory when the newly stored LTI is retrieved
+again.
### User-Initiated Storage
-Semantic memory provides agent designers the ability to store semantic knowledge via the
-**add** switch of the **smem** command (see Section 9.5.1 on page 243). The format of the
-command is nearly identical to the working memory manipulation components of the RHS
-of a production (i.e. no RHS-functions; see Section 3.3.6 on page 67). For instance:
+Semantic memory provides agent designers the ability to store semantic
+knowledge via the `add` switch of the [`smem` command](../reference/cli/cmd_smem.md).
+The format of the command is nearly identical to the working memory
+manipulation components of the RHS of a production (i.e. no RHS-functions; see
+[The action side of productions](03_SyntaxOfSoarPrograms.md#the-action-side-of-productions-or-rhs)).
+For instance:
```Soar
smem --add {
-( ^add10-facts )
-( ^digit1 1 ^digit-10 11)
-( ^digit1 2 ^digit-10 12)
-( ^digit1 3 ^digit-10 13)
+ ( ^add10-facts )
+ ( ^digit1 1 ^digit-10 11)
+ ( ^digit1 2 ^digit-10 12)
+ ( ^digit1 3 ^digit-10 13)
}
```
@@ -170,68 +191,76 @@ command instance will add a new long-term identifier (represented by the
temporary ’arithmetic’ variable) with three augmentations. The value of each
augmentation will each become an LTI with two constant attribute/value pairs.
Manual storage can be arbitrarily complex and use standard dot-notation. The add
-command also supports hardcoded LTI ids such as@1in place of variables.
+command also supports hardcoded LTI ids such as `@1` in place of variables.
### Storage Location
-Semantic memory uses SQLite to facilitate efficient and standardized storage and querying of
-knowledge. The semantic store can be maintained in memory or on disk (per the database
-and path parameters; see Section 9.5.1). If the store is located on disk, users can use any
-standard SQLite programs/components to access/query its contents. However, using a disk-
-based semantic store is very costly (performance is discussed in greater detail in Section 6.5
-on page 155), and running in memory is recommended for most runs.
-
-Note that changes to storage parameters, for example database, path and append will
-not have an effect until the database is used after an initialization. This happens either
-shortly after launch (on first use) or after a database initialization command is issued. To
-switch databases or database storage types while running, set your new parameters and then
-perform an –init command.
-
-The **path** parameter specifies the file system path the database is stored in. When path is
-set to a valid file system path and database mode is set to file, then the SQLite database is
-written to that path.
-
-The **append** parameter will determine whether all existing facts stored in a database on
-disk will be erased when semantic memory loads. Note that this affects soar init also. In
-other words, if the append setting is off, all semantic facts stored to disk will be lost when
-a soar init is performed. For semantic memory,append mode is on by default.
-
-Note: As of version 9.3.3, Soar used a new schema for the semantic memory database.
-This means databases from 9.3.2 and below can no longer be loaded. A conversion utility is
-available in Soar 9.4 to convert from the old schema to the new one.
-
-The **lazy-commit** parameter is a performance optimization. If set to on(default), disk
-databases will not reflect semantic memory changes until the Soar kernel shuts down. This
-improves performance by avoiding disk writes. The optimization parameter (see Section
-
+Semantic memory uses SQLite to facilitate efficient and standardized storage
+and querying of knowledge. The semantic store can be maintained in memory or on
+disk (per the database and path parameters; see
+[`smem` command](../reference/cli/cmd_smem.md)). If the store is located on
+disk, users can use any standard SQLite programs/components to access/query its
+contents. However, using a disk- based semantic store is very costly
+(performance is discussed in greater detail in Section
+[Performance](#performance)), and running in memory is recommended for most
+runs.
+
+Note that changes to storage parameters, for example database, path and append
+will not have an effect until the database is used after an initialization.
+This happens either shortly after launch (on first use) or after a database
+initialization command is issued. To switch databases or database storage types
+while running, set your new parameters and then perform an –init command.
+
+The **path** parameter specifies the file system path the database is stored
+in. When path is set to a valid file system path and database mode is set to
+file, then the SQLite database is written to that path.
+
+The **append** parameter will determine whether all existing facts stored in a
+database on disk will be erased when semantic memory loads. Note that this
+affects soar init also. In other words, if the append setting is off, all
+semantic facts stored to disk will be lost when a soar init is performed. For
+semantic memory,append mode is on by default.
+
+Note: As of version 9.3.3, Soar used a new schema for the semantic memory
+database. This means databases from 9.3.2 and below can no longer be loaded. A
+conversion utility is available in Soar 9.4 to convert from the old schema to
+the new one.
+
+The **lazy-commit** parameter is a performance optimization. If set to
+on(default), disk databases will not reflect semantic memory changes until the
+Soar kernel shuts down. This improves performance by avoiding disk writes. The
+optimization parameter (see Section [Performance](#performance)) will have an
+affect on whether databases on disk can be opened while the Soar kernel is
+running.
## Retrieving Semantic Knowledge
-An agent retrieves knowledge from semantic memory by creating an appropriate command
-(we detail the types of commands below) on the command link of a state’s smem structure.
-At the end of the output of each decision, semantic memory processes each state’s smem
-`^command` structure. Results, meta-data, and errors are added to the result structure of
-that state’s smems tructure.
+An agent retrieves knowledge from semantic memory by creating an appropriate
+command (we detail the types of commands below) on the `command` link of a
+state’s `smem` structure. At the end of the output of each decision, semantic
+memory processes each state’s smem `^command` structure. Results, meta-data,
+and errors are added to the result structure of that state’s `smem` structure.
-Only one type of retrieval command (which may include optional modifiers) can be issued
-per state in a single decision cycle. Malformed commands (including attempts at multiple
-retrieval types) will result in an error:
+Only one type of retrieval command (which may include optional modifiers) can
+be issued per state in a single decision cycle. Malformed commands (including
+attempts at multiple retrieval types) will result in an error:
```Soar
^smem.result.bad-cmd
```
-Where the `` variable refers to the command structure of the state.
+Where the `` variable refers to the `command` structure of the state.
-After a command has been processed, semantic memory will ignore it until some aspect of
-the command structure changes (via addition/removal of WMEs). When this occurs, the
-result structure is cleared and the new command (if one exists) is processed.
+After a command has been processed, semantic memory will ignore it until some
+aspect of the command structure changes (via addition/removal of WMEs). When
+this occurs, the result structure is cleared and the new command (if one
+exists) is processed.
### Non-Cue-Based Retrievals
-A non-cue-based retrieval is a request by the agent to reflect in working memory the current
-augmentations of an LTI in semantic memory. The command WME has a **retrieve**
-attribute and an LTI-linked identifier value:
+A non-cue-based retrieval is a request by the agent to reflect in working
+memory the current augmentations of an LTI in semantic memory. The command WME
+has a `retrieve` attribute and an LTI-linked identifier value:
```Soar
^smem.command.retrieve
@@ -250,54 +279,60 @@ Otherwise, two new WMEs will be placed on the result structure:
^smem.result.retrieved
```
-All augmentations of the long-term identifier in semantic memory will be created as new
-WMEs in working memory.
+All augmentations of the long-term identifier in semantic memory will be
+created as new WMEs in working memory.
### Cue-Based Retrievals
-A cue-based retrieval performs a search for a long-term identifier in semantic memory whose
-augmentations exactly match an agent-supplied cue, as well as optional cue modifiers.
+A cue-based retrieval performs a search for a long-term identifier in semantic
+memory whose augmentations exactly match an agent-supplied cue, as well as
+optional cue modifiers.
-A cue is composed of WMEs that describe the augmentations of a long-term identifier. A
-cue WME with a constant value denotes an exact match of both attribute and value. A
-cue WME with an LTI-linked identifier as its value denotes an exact match of attribute and
-linked LTI. A cue WME with a short-term identifier as its value denotes an exact match of
-attribute, but with any value (constant or identifier).
+A cue is composed of WMEs that describe the augmentations of a long-term
+identifier. A cue WME with a constant value denotes an exact match of both
+attribute and value. A cue WME with an LTI-linked identifier as its value
+denotes an exact match of attribute and linked LTI. A cue WME with a short-term
+identifier as its value denotes an exact match of attribute, but with any value
+(constant or identifier).
-A cue-based retrieval command has a **query** attribute and an identifier value, the cue:
+A cue-based retrieval command has a **query** attribute and an identifier
+value, the cue:
```Soar
^smem.command.query
```
-For instance, consider the following rule that creates a cue-based retrieval command:
+For instance, consider the following rule that creates a cue-based retrieval
+command:
```Soar
sp {smem*sample*query
-(state ^smem.command
-^lti
-^input-link.foo )
--->
-( ^query )
-( ^name
-^foo
-^associate
-^age 25)
-}
+ (state ^smem.command
+ ^lti
+ ^input-link.foo )
+ -->
+ ( ^query )
+ ( ^name
+ ^foo
+ ^associate
+ ^age 25)
+ }
```
-In this example, assume that the `` variable will match a short-term identifier which is
-linked to a long-term identifier and that the `` variable will match a constant. Thus,
-the query requests retrieval of a long-term memory with augmentations that satisfy ALL of
-the following requirements:
+In this example, assume that the `` variable will match a short-term
+identifier which is linked to a long-term identifier and that the ``
+variable will match a constant. Thus, the query requests retrieval of a
+long-term memory with augmentations that satisfy **ALL** of the following
+requirements:
-- Attribute name with ANY value
-- Attribute foo with value equal to that of variable `` at the time this rule fires
-- Attribute associate with value that is the same long-term identifier as that linked to
-by the `` STI at the time this rule fires
-- Attribute age with integer value 25
+- Attribute `name` with `ANY` value
+- Attribute `foo` with value equal to that of variable `` at the time this
+ rule fires
+- Attribute `associate` with value that is the same long-term identifier as
+ that linked to by the `` STI at the time this rule fires
+- Attribute `age` with integer value 25
-If no long-term identifier satisfies ALL of these requirements, an error is returned:
+If no long-term identifier satisfies **ALL** of these requirements, an error is returned:
```Soar
^smem.result.failure
@@ -310,88 +345,102 @@ Otherwise, two WMEs are added:
^smem.result.retrieved
```
-The result `` will be a new short-term identifier linked to the result LTI.
+The result `` will be a new short-term identifier linked to the
+result LTI.
-As with non-cue-based retrievals, all of the augmentations of the long-term identifier in
-semantic memory are added as new WMEs to working memory. If these augmentations
-include other LTIs in smem, they too are instantiated into new short-term identifiers in
-working memory.
+As with non-cue-based retrievals, all of the augmentations of the long-term
+identifier in semantic memory are added as new WMEs to working memory. If these
+augmentations include other LTIs in smem, they too are instantiated into new
+short-term identifiers in working memory.
-It is possible that multiple long-term identifiers match the cue equally well. In this case, se-
-mantic memory will retrieve the long-term identifier that was most recently stored/retrieved.
-(More accurately, it will retrieve the LTI with the greatest activation value. See below.)
+It is possible that multiple long-term identifiers match the cue equally well.
+In this case, se- mantic memory will retrieve the long-term identifier that was
+most recently stored/retrieved. (More accurately, it will retrieve the LTI with
+the greatest activation value. See below.)
The cue-based retrieval process can be further tempered using optional modifiers:
-The prohibit command requires that the retrieved long-term identifier is not equal
-to that linked with the supplied long-term identifier:
+- The prohibit command requires that the retrieved long-term identifier is not
+ equal to that linked with the supplied long-term identifier:
-```
- ^smem.command.prohibit
-```
+ ```Soar
+ ^smem.command.prohibit
+ ```
+ Multiple prohibit command WMEs may be issued as modifiers to a single cue-based
+ retrieval. This method can be used to iterate over all matching long-term
+ identifiers.
-Multiple prohibit command WMEs may be issued as modifiers to a single cue-based
-retrieval. This method can be used to iterate over all matching long-term identifiers.
+- The neg-query command requires that the retrieved long-term identifier does
+ NOT contain a set of attributes/attribute-value pairs:
-The neg-query command requires that the retrieved long-term identifier does NOT
-contain a set of attributes/attribute-value pairs:
+ ```Soar
+ ^smem.command.neg-query
+ ```
-```Soar
- ^smem.command.neg-query
-```
+ The syntax of this command is identical to that of regular/ positive query
+ command.
-The syntax of this command is identical to that of regular/ positive query command.
+- The math-query command requires that the retrieved long term identifier
+ contains an attribute value pair that meets a specified mathematical condition.
+ This condition can either be a conditional query or a superlative query.
+ Conditional queries are of the format:
-The math-query command requires that the retrieved long term identifier contains
-an attribute value pair that meets a specified mathematical condition. This condition
-can either be a conditional query or a superlative query.
-Conditional queries are of the format:
+ ```Soar
+ ^smem.command.math-query..
+ ```
-```Soar
- ^smem.command.math-query..
-```
+ Superlative queries do not use a value argument and are of the format:
-Superlative queries do not use a value argument and are of the format:
+ ```Soar
+ ^smem.command.math-query..
+ ```
-```Soar
- ^smem.command.math-query..
-```
+ Values used in math queries must be integer or float type values. Currently
+ supported condition names are:
-Values used in math queries must be integer or float type values. Currently supported
-condition names are:
+ - `less` A value less than the given argument
+ - `greater` A value greater than the given argument
+ - `less-or-equal` A value less than or equal to the given argument
+ - `greater-or-equal` A value greater than or equal to the given argument
+ - `max` The maximum value for the attribute
+ - `min` The minimum value for the attribute
-- less A value less than the given argument
-- greater A value greater than the given argument
-- less-or-equal A value less than or equal to the given argument
-- greater-or-equal A value greater than or equal to the given argument
-- max The maximum value for the attribute
-- min The minimum value for the attribute
+#### Activation
-#### Activation
+When an agent issues a cue-based retrieval and multiple LTIs match the cue, the
+LTI which semantic memory provides to working memory as the result is the LTI
+which not only matches the cue, but also has the highest `activation` value.
+Semantic memory has several activation methods available for this purpose.
-When an agent issues a cue-based retrieval and multiple LTIs match the cue, the LTI which
-semantic memory provides to working memory as the result is the LTI which not only
-matches the cue, but also has the highest **activation** value. Semantic memory has several
-activation methods available for this purpose.
+The simplest activation methods are `recency` and `frequency` activation.
+Recency activation attaches a time-stamp to each LTI and records the time of
+last retrieval. Using recency activation, the LTI which matches the cue and was
+also most-recently retrieved is the one which is returned as the result for a
+query. Frequency activation attaches a counter to each LTI and records the
+number of retrievals for that LTI. Using frequency activation, the LTI which
+matches the cue and also was most frequently used is returned as the result of
+the query. By default, Soar uses recency activation.
-The simplest activation methods are **recency** and **frequency** activation. Recency activa-
-tion attaches a time-stamp to each LTI and records the time of last retrieval. Using recency
-activation, the LTI which matches the cue and was also most-recently retrieved is the one
-which is returned as the result for a query. Frequency activation attaches a counter to each
-LTI and records the number of retrievals for that LTI. Using frequency activation, the LTI
-which matches the cue and also was most frequently used is returned as the result of the
-query. By default, Soar uses recency activation.
+**Base-level activation** can be thought of as a mixture of both recency and
+frequency. Soar makes use of the following equation (known as the Petrov
+approximation) for calculating base-level activation:
-**Base-level activation** can be thought of as a mixture of both recency and frequency.
-Soar makes use of the following equation (known as the Petrov approximation^2 ) for calculating base-level activation:
+???+ info
+ Petrov, Alexander A. “Computationally efficient approximation of the base-level
+ learning equation in ACT-R.” Proceedings of the seventh international
+ conference on cognitive modeling. 2006.
+$$
+BLA = \log \left[ \sum\limits_{i=1}^{k} t_i^{-d} + \dfrac{(n-k)(t_n^{1-d} -
+t_k^{1-d})}{(1-d)(t_n-t_k)} \right]
+$$
-
-where n is the number of activation boosts, tnis the time since the first boost, tkis the time
-of the kth boost, dis the decay factor, and kis the number of recent activation boosts which
-are stored. (In Soar,kis hard-coded to 10.) To use base-level activation, use the following
-CLI command when sourcing an agent:
+where $n$ is the number of activation boosts, $t_n$ is the time since the first
+boost, $t_k$ is the time of the $k$th boost, dis the decay factor, and $k$ is
+the number of recent activation boosts which are stored. (In Soar, $k$ is
+hard-coded to $10$.) To use base-level activation, use the following CLI command
+when sourcing an agent:
```bash
smem --set activation-mode base-level
@@ -402,29 +451,35 @@ activation beyond the previous methods. First, spreading activation requires tha
base-level activation is also being used. They are considered additive. This
value does not represent recency or frequency of use, but rather
context-relatedness. Spreading activation increases the activation of LTIs which
-are linked to by identifiers currently present in working memory.^3 Such LTIs
-may be thought of as spreading sources.
-
-Spreading activation values spread according to network structure. That is, spreading sources
-will add to the spreading activation values of any of their child LTIs, according to the directed
-graph structure with in smem(not working memory). The amount of spread is controlled by
-the
-**spreading-continue-probability** parameter. By default this value is set to0.9.
-This would mean that90%of an LTI’s spreading activation value would be divided among
-its direct children (without subtracting from its own value). This value is multiplicative with
-depth. A "grandchild" LTI, connected at a distance of two from a source LTI, would receive
-spreading according to 0. 9 × 0 .9 = 0.81 of the source spreading activation value.
-
-Spreading activation values are updated each decision cycle only as needed for specific
-smem retrievals. For efficiency, two limits exist for the amount of spread calculated. The
-**spreading-limit** parameter limits how many LTIs can receive spread from a given
-spreading source LTI. By default, this value is ( 300 ). Spread is distributed in a magnitude-
-first manner to all descendants of a source. (Without edge-weights, this simplifies to breadth-
-first.) Once the number of LTIs that have been given spread from a given source reaches
-the max value indicated by spreading-limit, no more is calculated for that source that
-update cycle, and the next spreading source’s contributions are calculated. The maximum
-depth of descendants that can receive spread contributions from a source is similarly given
-by the **spreading-depth-limit** parameter. By default, this value is ( 10 ).
+are linked to by identifiers currently present in working memory. Such LTIs
+may be thought of as _spreading sources_.
+
+???+ info
+ Specifically, linked to by STIs that have augmentations.
+
+Spreading activation values spread according to network structure. That is,
+spreading sources will add to the spreading activation values of any of their
+child LTIs, according to the directed graph structure with in smem(not working
+memory). The amount of spread is controlled by the
+`spreading-continue-probability` parameter. By default this value is set to
+0.9. This would mean that $90\ \%$ of an LTI’s spreading activation value would
+be divided among its direct children (without subtracting from its own value).
+This value is multiplicative with depth. A "grandchild" LTI, connected at a
+distance of two from a source LTI, would receive spreading according to
+$0. 9 \times 0 .9 = 0.81$ of the source spreading activation value.
+
+Spreading activation values are updated each decision cycle only as needed for
+specific smem retrievals. For efficiency, two limits exist for the amount of
+spread calculated. The `spreading-limit` parameter limits how many LTIs can
+receive spread from a given spreading source LTI. By default, this value is
+(300). Spread is distributed in a magnitude- first manner to all descendants of
+a source. (Without edge-weights, this simplifies to breadth- first.) Once the
+number of LTIs that have been given spread from a given source reaches the max
+value indicated by `spreading-limit`, no more is calculated for that source that
+update cycle, and the next spreading source’s contributions are calculated. The
+maximum depth of descendants that can receive spread contributions from a
+source is similarly given by the `spreading-depth-limit` parameter. By
+default, this value is (10).
In order to use spreading activation, use the following command:
@@ -432,89 +487,89 @@ In order to use spreading activation, use the following command:
smem --set spreading on
```
-(^2) Petrov, Alexander A. "Computationally efficient approximation of the base-level learning equation in
-ACT-R."Proceedings of the seventh international conference on cognitive modeling.2006.
-(^3) Specifically, linked to by STIs that have augmentations.
-
-Also, spreading activation can make use of working memory activation for adjusting edge
-weights and for providing nonuniform initial magnitude of spreading for sources of spread.
-This functionality is optional. To enable the updating of edge-weights, use the command:
+Also, spreading activation can make use of working memory activation for
+adjusting edge weights and for providing nonuniform initial magnitude of
+spreading for sources of spread. This functionality is optional. To enable the
+updating of edge-weights, use the command:
```bash
smem --set spreading-edge-updating on
```
-and to enable working memory activation to modulate the magnitude of spread from sources,
-use the command:
+and to enable working memory activation to modulate the magnitude of spread
+from sources, use the command:
```bash
smem --set spreading-wma-source on
```
-For most use-cases, base-level activation is sufficient to provide an agent with relevant knowl-
-edge in response to a query. However, to provide an agent with more context-relevant results
-as opposed to results based only on historical usage, one must use spreading activation.
+For most use-cases, base-level activation is sufficient to provide an agent
+with relevant knowledge in response to a query. However, to provide an agent
+with more context-relevant results as opposed to results based only on
+historical usage, one must use spreading activation.
### Retrieval with Depth
-For either cue-based or non-cue-based retrieval, it is possible to retrieve a long-term identifier
-with additional depth. Using the **depth** parameter allows the agent to retrieve a greater
-amount of the memory structure than it would have by retrieving not only the long-term
-identifier’s attributes and values, but also by recursively adding to working memory the
-attributes and values of that long-term identifier’s children.
+For either cue-based or non-cue-based retrieval, it is possible to retrieve a
+long-term identifier with additional depth. Using the **depth** parameter
+allows the agent to retrieve a greater amount of the memory structure than it
+would have by retrieving not only the long-term identifier’s attributes and
+values, but also by recursively adding to working memory the attributes and
+values of that long-term identifier’s children.
Depth is an additional command attribute, like query:
```Soar
^smem.command.query
-^smem.command.depth
+ ^smem.command.depth
```
For instance, the following rule uses depth with a cue-based retrieval:
```Soar
- ^smem.command.query
sp {smem*sample*query
-(state ^smem.command
-^input-link.foo )
--->
-( ^query
-^depth 2)
-( ^name
-^foo
-^associate
-^age 25)
+ (state ^smem.command
+ ^input-link.foo )
+ -->
+ ( ^query
+ ^depth 2)
+ ( ^name
+ ^foo
+ ^associate
+ ^age 25)
}
```
-In the example above and without using depth, the long-term identifier referenced by
+In the example above and without using depth, the long-term identifier
+referenced by
```Soar
^associate
```
-would not also have its attributes and values be retrieved. With a depth of 2 or more, that
-long-term identifier also has its attributes and values added to working memory.
+would not also have its attributes and values be retrieved. With a depth of 2
+or more, that long-term identifier also has its attributes and values added to
+working memory.
-Depth can incur a large cost depending on the specified depth and the structures stored in
-semantic memory.
+Depth can incur a large cost depending on the specified depth and the
+structures stored in semantic memory.
## Performance
Initial empirical results with toy agents show that semantic memory queries
-carry up to a 40% overhead as compared to comparable rete matching. However, the
-retrieval mechanism implements some basic query optimization: statistics are
-maintained about all stored knowledge. When a query is issued, semantic memory
-re-orders the cue such as to minimize expected query time. Because only perfect
-matches are acceptable, and there is no symbol variablization, semantic memory
-retrievals do not contend with the same combinatorial search space as the rete.
-Preliminary empirical study shows that semantic memory maintains sub-millisecond
-retrieval time for a large class of queries, even in very large stores (millions
-of nodes/edges).
-
-Once the number of long-term identifiers overcomes initial overhead (about 1000 WMEs),
-initial empirical study shows that semantic storage requires far less than 1KB per stored
-WME.
+carry up to a $40\ \%$ overhead as compared to comparable rete matching.
+However, the retrieval mechanism implements some basic query optimization:
+statistics are maintained about all stored knowledge. When a query is issued,
+semantic memory re-orders the cue such as to minimize expected query time.
+Because only perfect matches are acceptable, and there is no symbol
+variablization, semantic memory retrievals do not contend with the same
+combinatorial search space as the rete. Preliminary empirical study shows that
+semantic memory maintains sub-millisecond retrieval time for a large class of
+queries, even in very large stores (millions of nodes/edges).
+
+Once the number of long-term identifiers overcomes initial overhead (about 1000
+WMEs), initial empirical study shows that semantic storage requires far less
+than 1KB per stored WME.
### Math queries
@@ -528,11 +583,12 @@ iterate over any memory that matches all other involved cues.
### Performance Tweaking
-When using a database stored to disk, several parameters become crucial to performance.
-The first is **lazy-commit** , which controls when database changes are written to disk.
-The default setting (on) will keep all writes in memory and only commit to disk upon re-
-initialization (quitting the agent or issuing the init command). The off setting will write
-each change to disk and thus incurs massive I/O delay.
+When using a database stored to disk, several parameters become crucial to
+performance. The first is **lazy-commit** , which controls when database
+changes are written to disk. The default setting (`on`) will keep all writes in
+memory and only commit to disk upon re-initialization (quitting the agent or
+issuing the init command). The `off` setting will write each change to disk and
+thus incurs massive I/O delay.
The next parameter is **thresh**. This has to do with the locality of
storing/updating activation information with semantic augmentations. By default,
@@ -542,7 +598,7 @@ identifiers on demand, and thus retrieval time is independent of cue
selectivity. However, each activation update (such as after a retrieval) incurs
an update cost linear in the number of augmentations. If the number of
augmentations for a long-term identifier is large, this cost can dominate. Thus,
-the thresh parameter sets the upper bound of augmentations, after which
+the `thresh` parameter sets the upper bound of augmentations, after which
activation is stored with the long-term identifier. This allows the user to
establish a balance between cost of updating augmentation activation and the
number of long-term identifiers that must be pre-sorted during a cue-based
@@ -550,22 +606,26 @@ retrieval. As long as the threshold is greater than the number of augmentations
of most long-term identifiers, performance should be fine (as it will bound the
effects of selectivity).
-The next two parameters deal with the SQLite cache, which is a memory store used to speed
-operations like queries by keeping in memory structures like levels of index B+-trees. The
-first parameter, **page-size** , indicates the size, in bytes, of each cache page. The second
-parameter, **cache-size** , suggests to SQLite how many pages are available for the cache.
-Total cache size is the product of these two parameter settings. The cache memory is not pre-
-allocated, so short/small runs will not necessarily make use of this space. Generally speaking,
-a greater number of cache pages will benefit query time, as SQLite can keep necessary meta-
-data in memory. However, some documented situations have shown improved performance
-from decreasing cache pages to increase memory locality. This is of greater concern when
-dealing with file-based databases, versus in-memory. The size of each page, however, may be
-important whether databases are disk- or memory-based. This setting can have far-reaching
-consequences, such as index B+-tree depth. While this setting can be dependent upon a
-particular situation, a good heuristic is that short, simple runs should use small values of
-the page size (1k,2k,4k), whereas longer, more complicated runs will benefit from larger
-values (8k,16k,32k,64k). The episodic memory chapter (see Section 7.4 on page 163) has
-some further empirical evidence to assist in setting these parameters for very large stores.
+The next two parameters deal with the SQLite cache, which is a memory store
+used to speed operations like queries by keeping in memory structures like
+levels of index B+-trees. The first parameter, **page-size** , indicates the
+size, in bytes, of each cache page. The second parameter, **cache-size** ,
+suggests to SQLite how many pages are available for the cache. Total cache size
+is the product of these two parameter settings. The cache memory is not pre-
+allocated, so short/small runs will not necessarily make use of this space.
+Generally speaking, a greater number of cache pages will benefit query time, as
+SQLite can keep necessary meta- data in memory. However, some documented
+situations have shown improved performance from decreasing cache pages to
+increase memory locality. This is of greater concern when dealing with
+file-based databases, versus in-memory. The size of each page, however, may be
+important whether databases are disk- or memory-based. This setting can have
+far-reaching consequences, such as index B+-tree depth. While this setting can
+be dependent upon a particular situation, a good heuristic is that short,
+simple runs should use small values of the page size (1k, 2k, 4k), whereas
+longer, more complicated runs will benefit from larger values (8k, 16k, 32k, 64k).
+The episodic memory [chapter on performance](07_EpisodicMemory.md#performance)
+has some further empirical evidence to assist in setting these parameters for
+very large stores.
The next parameter is **optimization**. The safety parameter setting will use
SQLite default settings. If data integrity is of importance, this setting is
@@ -581,10 +641,11 @@ lock to the database (locking mode pragma), thus other applications/agents
cannot make simultaneous read/write calls to the database (thereby reducing the
need for potentially expensive system calls to secure/release file locks).
-Finally, maintaining accurate operation timers can be relatively expensive in Soar. Thus,
-these should be enabled with caution and understanding of their limitations. First, they
-will affect performance, depending on the level (set via the **timers** parameter). A level
-of three, for instance, times every modification to long-term identifier recency statistics.
-Furthermore, because these iterations are relatively cheap (typically a single step in the
-linked-list of a b+-tree), timer values are typically unreliable (depending upon the system,
-resolution is 1 microsecond or more).
+Finally, maintaining accurate operation timers can be relatively expensive in
+Soar. Thus, these should be enabled with caution and understanding of their
+limitations. First, they will affect performance, depending on the level (set
+via the **timers** parameter). A level of three, for instance, times every
+modification to long-term identifier recency statistics. Furthermore, because
+these iterations are relatively cheap (typically a single step in the
+linked-list of a `b+-`tree), timer values are typically unreliable (depending
+upon the system, resolution is 1 microsecond or more).
diff --git a/docs/soar_manual/Images/rl-optrace.svg b/docs/soar_manual/Images/rl-optrace.svg
index f0bb0470..d72e53fb 100644
--- a/docs/soar_manual/Images/rl-optrace.svg
+++ b/docs/soar_manual/Images/rl-optrace.svg
@@ -1,9 +1,8 @@
-
-
+
diff --git a/docs/soar_manual/Images/smem-concept.svg b/docs/soar_manual/Images/smem-concept.svg
new file mode 100644
index 00000000..5898468c
--- /dev/null
+++ b/docs/soar_manual/Images/smem-concept.svg
@@ -0,0 +1,109 @@
+
+
+
+
+
diff --git a/docs/tutorials/soar_tutorial/05.md b/docs/tutorials/soar_tutorial/05.md
index 96b2809f..70ad35c9 100644
--- a/docs/tutorials/soar_tutorial/05.md
+++ b/docs/tutorials/soar_tutorial/05.md
@@ -1,3 +1,4 @@
+
{{tutorial_wip_warning("Soar.Tutorial.Part.5.-.Planning.and.Learning.pdf")}}
# Part V: Planning and Learning
@@ -73,25 +74,25 @@ without the indifferent preference.
```Soar
sp {water-jug*propose*fill
- (state ^name water-jug
- ^jug )
- ( ^empty > 0)
--->
- ( ^operator +)
- ( ^name fill
- ^jug )}
+ (state ^name water-jug
+ ^jug )
+ ( ^empty > 0)
+ -->
+ ( ^operator +)
+ ( ^name fill
+ ^jug )}
```
Once that indifferent preference is removed, rerun your program and step
through the first decision. You will discover that Soar automatically
generates a new substate in situations where the operator preferences
are insufficient to pick a single operator. This is called a _tie
-impasse._
+impasse.
When a tie impasse arises, a new substate is created that is very
similar to the substate created in response to an operator no-change.
Print out the substate and examine the augmentations. The key
-differences are that an operator tie has `^choices multiple `, `^impasse
+differences are that an operator tie has `^choices multiple`, `^impasse
tie`, and `^item` augmentations for each of the tied operators.
For the operator no-change impasses you used in TankSoar, the goal was
@@ -139,7 +140,7 @@ Soar operators/rules that carry out the evaluation and comparison of
operators using this approach. This set of generic rules can be used in
a wide variety of problems for simple planning and they are available as
part of the Soar release. These rules are part of the default rules that
-come with the Soar release in the file Agents/default/selection.soar.
+come with the Soar release in the file `Agents/default/selection.soar`.
You should load these rules when you load in the Water-jug rules. You
can do this by adding the following commands to your files (assuming
that your program is in a subdirectory of the Agents directory):
@@ -184,7 +185,7 @@ but those do not have to be represented in the Selection problem.
If the evaluations had only a single value, namely the evaluation, then
they could be represented as simple augmentations of the state, such as:
-`^evaluation` success. However, the evaluations must also include the task
+`^evaluation success`. However, the evaluations must also include the task
operator that the evaluation refers to. Moreover, we will find it useful
to have different types of evaluations that can be compared in different
ways, such as symbolic evaluations (success, failure) or numeric
@@ -196,11 +197,12 @@ Therefore, the state should consist of a set of evaluation
augmentations, with each evaluation object having the following
structure:
-- operator `` the identifier of the task operator being evaluated
-- symbolic-value success/partial-success/partial-failure/failure/indifferent
-- numeric-value `[number]`
-- value true indicates that there is either a symbolic or numeric value.
-- desired `` the identifier of a desired state to use for evaluation if there is one
+- `operator ` the identifier of the task operator being evaluated
+- `symbolic-value success/partial-success/partial-failure/failure/indifferent`
+- `numeric-value [number]`
+- `value true` indicates that there is either a symbolic or numeric value.
+- `desired ` the identifier of a desired state to use for evaluation if
+ there is one
Another alternative to creating evaluations on the state is to create
evaluations on the operators themselves. The advantage of creating
@@ -219,12 +221,11 @@ but will allow us to test for the name of the state instead of the
impasse type in all of the remaining rules.
```Soar
-sp {default\*selection*elaborate*name
-
-:default
-(state ^type tie)
+sp {default*selection*elaborate*name
+ :default
+ (state ^type tie)
-->
-( ^name selection)}
+ ( ^name selection)}
```
This rule uses a new bit of syntax, the :default, which tells Soar that
@@ -241,12 +242,12 @@ does not have an evaluation with a value. The operator will first create
an evaluation and later compute the value.
```Soar
-Selection*propose*evaluate-operator
+selection*propose*evaluate-operator
```
-If the state is named selection and there is an item that does not have
-an evaluation with a value, then propose the evaluate-operator for that
-item.
+> If the state is named selection and there is an item that does not have
+> an evaluation with a value, then propose the evaluate-operator for that
+> item.
The tricky part of translating this into a rule is the test that there
is an item without an evaluation with a value. The item can be matched
@@ -259,16 +260,16 @@ match only if there does not exist an evaluate with a value, giving:
```Soar
sp {selection*propose*evaluate-operator
- :default
- (state ^name selection
- ^item )
- -{(state ^evaluation )
- ( ^operator
- ^value true)}
+ :default
+ (state ^name selection
+ ^item )
+ -{(state ^evaluation )
+ ( ^operator
+ ^value true)}
-->
( ^operator +, =)
( ^name evaluate-operator
- ^operator )}
+ ^operator )}
```
Given these conditions, once an evaluate-operator operator is selected,
@@ -296,14 +297,14 @@ application simpler. The end result is the following:
```Soar
( ^evaluation
- ^operator )
+ ^operator )
( ^superoperator
- ^desired )
+ ^desired