diff --git a/docs/soar_manual/05_ReinforcementLearning.md b/docs/soar_manual/05_ReinforcementLearning.md index f49665a3..7f1d959f 100644 --- a/docs/soar_manual/05_ReinforcementLearning.md +++ b/docs/soar_manual/05_ReinforcementLearning.md @@ -1,3 +1,4 @@ + {{manual_wip_warning}} # Reinforcement Learning @@ -7,81 +8,91 @@ knowledge based on a given reward function. This chapter describes the RL mechanism and how it is integrated with production memory, the decision cycle, and the state stack. We assume that the reader is familiar with basic reinforcement learning concepts and notation. If not, we recommend first -reading *Reinforcement Learning: An Introduction* (1998) by Richard S. Sutton and +reading _Reinforcement Learning: An Introduction_ (1998) by Richard S. Sutton and Andrew G. Barto. The detailed behavior of the RL mechanism is determined by -numerous parameters that can be controlled and configured via the `rl` command. -Please refer to the documentation for that command in section 9.4.2 on page 238. +numerous parameters that can be controlled and configured via the +[`rl` command](../reference/cli/cmd_rl.md). ## RL Rules -Soar’s RL mechanism learns Q-values for state-operator[1](#footnote1) -pairs. Q-values are stored as numeric-indifferent preferences created by -specially formulated productions called **RL rules**. RL rules are identified +???+ info + In this context, the term "state" refers to the state of the task or + environment, not a state identifier. For the rest of this chapter, bold capital + letter names such as `S1` will refer to identifiers and italic lowercase names + such as $s_1$ will refer to task states. + +Soar’s RL mechanism learns $Q$-values for state-operator +pairs. $Q$-values are stored as numeric-indifferent preferences created by +specially formulated productions called **RL rules**. RL rules are identified by syntax. A production is a RL rule if and only if its left hand side tests for a proposed operator, its right hand side creates a single numeric-indifferent -preference, and it is not a template rule (see Section 5.4.2 for template -rules). These constraints ease the technical requirements of identifying/ -updating RL rules and makes it easy for the agent programmer to add/ maintain RL -capabilities within an agent. We define an -**RL operator** as an operator with numeric-indifferent preferences created by -RL rules. +preference, and it is not a template rule (see +[rule templates](#rule-templates)). These constraints ease the technical +requirements of identifying/ updating RL rules and makes it easy for the agent +programmer to add/ maintain RL capabilities within an agent. We define an **RL +operator** as an operator with numeric-indifferent preferences created by RL +rules. The following is an RL rule: ```Soar sp {rl*3*12*left - (state ^name task-name - ^x 3 - ^y 12 - ^operator +) - ( ^name move - ^direction left) ---> -( ^operator = 1.5) + (state ^name task-name + ^x 3 + ^y 12 + ^operator +) + ( ^name move + ^direction left) + --> + ( ^operator = 1.5) } ``` - - Note that the LHS of the rule can test for anything as long as it contains a test for a proposed operator. The RHS is constrained to exactly one action: creating a numeric-indifferent preference for the proposed operator. The following are not RL rules: -```Soar +```Soar hl_lines="4" sp {multiple*preferences -(state ^operator +) ---> -( ^operator = 5, >) + (state ^operator +) + --> + ( ^operator = 5, >) # (1) } +``` + +1. Proposes multiple preferences for the proposed operator and thus does not + comply with the rule format +```Soar hl_lines="5" sp {variable*binding -(state ^operator + -^value ) ---> -( ^operator = ) + (state ^operator + + ^value ) + --> + ( ^operator = ) # (1) } +``` + +1. Does not provide a constant for the numeric-indifferent preference value +```Soar hl_lines="5" sp {invalid*actions -(state ^operator +) ---> -( ^operator = 5) -(write (crlf) |This is not an RL rule.|) + (state ^operator +) + --> + ( ^operator = 5) + (write (crlf) |This is not an RL rule.|) # (1) } ``` -The first rule proposes multiple preferences for the proposed operator and thus -does not comply with the rule format. The second rule does not comply because it -does not provide a constant for the numeric-indifferent preference value. The -third rule does not comply because it includes a RHS function action in addition -to the numeric-indifferent preference action. +1. Includes a RHS function action in addition to the numeric-indifferent + preference action. In the typical RL use case, the user intends for the agent to learn the best operator in each possible state of the environment. The most straightforward way to achieve this is to give the agent a set of RL rules, each matching exactly -one possible state-operator pair. This approach is equivalent to a table-based -RL algorithm, where the Q-value of each state- operator pair corresponds to the +one possible state-operator pair. This approach is equivalent to a table-based +RL algorithm, where the $Q$-value of each state-operator pair corresponds to the numeric-indifferent preference created by exactly one RL rule. In the more general case, multiple RL rules can match a single state-operator @@ -91,26 +102,34 @@ memory context, and multiple rules can modify the preferences for a single operator, and a single rule can be instantiated multiple ways to modify preferences for multiple operators. For RL in Soar, all numeric-indifferent preferences for an operator are summed when calculating the operator’s -Q-value[2](#footnote2). In this context, RL rules can be interpreted more generally as binary -features in a linear approximator of each state-operator pair’s Q-value, and -their numeric-indifferent preference values their weights. In other words, +$Q$-value. + +???+ info + This is assuming the value of + [**numeric-indifferent-mode**](../reference/cli/cmd_decide.md#decide-numeric-indifferent-mode) + is set to **sum**. In general, the RL mechanism only works correctly when + this is the case, and we assume this case in the rest of the chapter. + +In this context, RL rules can be interpreted +more generally as binary features in a linear approximator of each +state-operator pair’s $Q$-value, and their numeric-indifferent preference values +their weights. In other words, $$Q(s, a) = w_1 \phi_2 (s, a) + w_2 \phi_2 (s, a) + \ldots + w_n \phi_n (s, a)$$ where all RL rules in production memory are numbered $1 \dots n$, $Q(s, a)$ is -the Q-value of the state-operator pair $(s, a)$, $w_i$ is the +the $Q$-value of the state-operator pair $(s, a)$, $w_i$ is the numeric-indifferent preference value of RL rule $i$, $\phi_i (s, a) = 0$ if RL -rule $i$ does not match $(s, a)$, and $\phi_i (s, a) = 1$ if it does. This +rule $i$ does not match $(s, a)$, and $\phi_i (s, a) = 1$ if it does. This interpretation allows RL rules to simulate a number of popular function approximation schemes used in RL such as tile coding and sparse coding. - ## Reward Representation RL updates are driven by reward signals. In Soar, these reward signals are given to the RL mechanism through a working memory link called the **reward-link**. Each state in Soar’s state stack is automatically populated with -a reward-link structure upon creation. Soar will check each structure for a +a `reward-link` structure upon creation. Soar will check each structure for a numeric reward signal for the last operator executed in the associated state at the beginning of every decision phase. Reward is also collected when the agent is halted or a state is retracted. @@ -122,25 +141,24 @@ In order to be recognized, the reward signal must follow this pattern: ( ^value [val]) ``` -where `` is the reward-link identifier, `` is some intermediate +where `` is the `reward-link` identifier, `` is some intermediate identifier, and `[val]` is any constant numeric value. Any structure that does not match this pattern is ignored. If there are multiple valid reward signals, their -values are summed into a single reward signal. As an example, consider the +values are summed into a single reward signal. As an example, consider the following state: ```Soar (S1 ^reward-link R1) -(R1 ^reward R2) -(R2 ^value 1.0) -(R1 ^reward R3) -(R3 ^value -0.2) + (R1 ^reward R2) + (R2 ^value 1.0) + (R1 ^reward R3) + (R3 ^value -0.2) ``` In this state, there are two reward signals with values 1.0 and -0.2. They will be summed together for a total reward of 0.8 and this will be the value given to the RL update algorithm. - There are two reasons for requiring the intermediate identifier. The first is so that multiple reward signals with the same value can exist simultaneously. Since working memory is a set, multiple WMEs with identical values in all three @@ -148,21 +166,21 @@ positions (identifier, attribute, value) cannot exist simultaneously. Without an intermediate identifier, specifying two rewards with the same value would require a WME structure such as - ```Soar (S1 ^reward-link R1) -(R1 ^reward 1.0) -(R1 ^reward 1.0) + (R1 ^reward 1.0) + (R1 ^reward 1.0) ``` -which is invalid. With the intermediate identifier, the rewards would be specified as +which is invalid. With the intermediate identifier, the rewards would be +specified as ```Soar (S1 ^reward-link R1) -(R1 ^reward R2) -(R2 ^value 1.0) -(R1 ^reward R3) -(R3 ^value 1.0) + (R1 ^reward R2) + (R2 ^value 1.0) + (R1 ^reward R3) + (R3 ^value 1.0) ``` which is valid. The second reason for requiring an intermediate identifier in @@ -173,23 +191,23 @@ or programmer. For example: ```Soar (S1 ^reward-link R1) -(R1 ^reward R2) -(R2 ^value 1.0) -(R2 ^source environment) -(R1 ^reward R3) -(R3 ^value -0.2) -(R3 ^source intrinsic) -(R3 ^duration 5) + (R1 ^reward R2) + (R2 ^value 1.0) + (R2 ^source environment) + (R1 ^reward R3) + (R3 ^value -0.2) + (R3 ^source intrinsic) + (R3 ^duration 5) ``` The `(R2 ^source environment)`,`(R3 ^source intrinsic)`, and `(R3 ^duration 5)` WMEs are arbitrary and ignored by RL, but were added by the agent to keep track of where the rewards came from and for how long. -Note that the reward-link is not part of the io structure and is not modified +Note that the `reward-link` is not part of the io structure and is not modified directly by the environment. Reward information from the environment should be -copied, via rules, from the input-link to the reward-link. Also note that when -collecting rewards, Soar simply scans the reward-link and sums the values of all +copied, via rules, from the `output-link` to the reward-link. Also note that when +collecting rewards, Soar simply scans the `reward-link` and sums the values of all valid reward WMEs. The WMEs are not modified and no bookkeeping is done to keep track of previously seen WMEs. This means that reward WMEs that exist for multiple decision cycles will be collected multiple times if not removed or @@ -201,236 +219,279 @@ Soar’s RL mechanism is integrated naturally with the decision cycle and perfor online updates of RL rules. Whenever an RL operator is selected, the values of the corresponding RL rules will be updated. The update can be on-policy (Sarsa) or off-policy (Q-Learning), as controlled by the **learning-policy** parameter -of the rl command. (See page 238.) Let $\delta_t$ be the amount of change for the -Q-value of an RL operator in a single update. For Sarsa, we have +of the [`rl`](../reference/cli/cmd_rl.md) command. Let $\delta_t$ be the amount +of change for the $Q$-value of an RL operator in a single update. For Sarsa, we +have -$$ \delta_t = \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) -\right] $$ +$$ +\delta_t = \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) +\right] +$$ where -- $Q(s_t, a_t)$ is the Q-value of the state and chosen operator in decision cycle $t$. -- $Q(s_{t+1}, a_{t+1})$ is the Q-value of the state and chosen RL operator in the next decision cycle. -- $r_{t+1}$ is the total reward collected in the next decision cycle. -- $\alpha$ and $\gamma$ are the settings of the learning-rate and discount-rate parameters of the `rl` command, respectively. +- $Q(s_t, a_t)$ is the $Q$-value of the state and chosen operator in decision + cycle $t$. +- $Q(s_{t+1}, a_{t+1})$ is the $Q$-value of the state and chosen RL operator in + the next decision cycle. +- $r_{t+1}$ is the total reward collected in the next decision cycle. +- $\alpha$ and $\gamma$ are the settings of the `learning-rate` and + `discount-rate` parameters of the `rl` command, respectively. Note that since $\delta_t$ depends on $Q(s_{t+1}, a_{t+1})$, the update for the operator selected in decision cycle $t$ is not applied until the next RL -operator is chosen. For Q-Learning, we have -$$ \delta_t = \alpha \left[ r_{t+1} + \gamma \underset{a \in A_{t+1}}{\max} Q(s_{t+1}, a) - Q(s_t, a_t) \right] $$ +operator is chosen. For Q-Learning, we have + +$$ +\delta_t = \alpha \left[ r_{t+1} + \gamma \underset{a \in A_{t+1}}{\max} +Q(s_{t+1}, a) - Q(s_t, a_t) \right] +$$ + where $A_{t+1}$ is the set of RL operators proposed in the next decision cycle. -Finally, $\delta_t$ is divided by the number of RL rules comprising the Q-value +Finally, $\delta_t$ is divided by the number of RL rules comprising the $Q$-value for the operator and the numeric-indifferent values for each RL rule is updated by that amount. An example walkthrough of a Sarsa update with $\alpha = 0.3$ and $\gamma = 0.9$ (the default settings in Soar) follows. +1. In decision cycle $t$, an operator `O1` is proposed, and RL rules `rl-1` + and `rl-2` create the following numeric-indifferent preferences for it: -1. In decision cycle $t$, an operator `O1` is proposed, and RL rules **rl-1** - and **rl-2** create the following numeric-indifferent preferences for it: - ``` - rl-1: (S1 ^operator O1 = 2.3) - rl-2: (S1 ^operator O1 = -1) - ``` - The Q-value for `O1` is $Q(s_t, \textbf{O1}) = 2.3 - 1 = 1.3$. + ```Soar + rl-1: (S1 ^operator O1 = 2.3) + rl-2: (S1 ^operator O1 = -1) + ``` -2. `O1` is selected and executed, so $Q(s_t, a_t) = Q(s_t, \textbf{O1}) = 1.3$. + The $Q$-value for `O1` is $Q(s_t, \textbf{O1}) = 2.3 - 1 = 1.3$. -3. In decision cycle $t+1$, a total reward of 1.0 is collected on the - reward-link, an operator O2is proposed, and another RL rule **rl-3** creates the - following numeric-indifferent preference for it: - ``` - rl-3: (S1 ^operator O2 = 0.5) - ``` +2. `O1` is selected and executed, so $Q(s_t, a_t) = Q(s_t, \textbf{O1}) = 1.3$. - So $Q(s_{t+1}, \textbf{O2}) = 0.5$. +3. In decision cycle $t+1$, a total reward of 1.0 is collected on the + `reward-link`, an operator `O2` is proposed, and another RL rule `rl-3` + creates the following numeric-indifferent preference for it: -4. `O2` is selected, so $Q(s_{t+1}, a_{t+1}) = Q(s_{t+1}, \textbf{O2}) = 0.5$ Therefore, + ```Soar + rl-3: (S1 ^operator O2 = 0.5) + ``` -$$\delta_t = \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] = 0.3 \times [ 1.0 + 0.9 \times 0.5 - 1.3 ] = 0.045$$ + So $Q(s_{t+1}, \textbf{O2}) = 0.5$. -Since rl-1 and rl-2 both contributed to the Q-value of O1, $\delta_t$ is evenly divided -amongst them, resulting in updated values of +4. `O2` is selected, so $Q(s_{t+1}, a_{t+1}) = Q(s_{t+1}, \textbf{O2}) = 0.5$ Therefore, -``` -rl-1: ( ^operator = 2.3225) -rl-2: ( ^operator = -0.9775) -``` + $$ + \delta_t = \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, + a_t) \right] = 0.3 \times [ 1.0 + 0.9 \times 0.5 - 1.3 ] = 0.045 + $$ + + Since `rl-1` and `rl-2` both contributed to the $Q$-value of `O1`, + $\delta_t$ is evenly divided amongst them, resulting in updated values of -5. **rl-3** will be updated when the next RL operator is selected. + ```Soar + rl-1: ( ^operator = 2.3225) + rl-2: ( ^operator = -0.9775) + ``` + +5. **rl-3** will be updated when the next RL operator is selected. ### Gaps in Rule Coverage -The previous description had assumed that RL operators were selected in both decision -cyclestandt+ 1. If the operator selected int+ 1 is not an RL operator, thenQ(st+1,at+1) -would not be defined, and an update for the RL operator selected at timetwill be undefined. -We will call a sequence of one or more decision cycles in which RL operators are not selected -between two decision cycles in which RL operators are selected agap. Conceptually, it is -desirable to use the temporal difference information from the RL operator after the gap to -update the Q-value of the RL operator before the gap. There are no intermediate storage -locations for these updates. Requiring that RL rules support operators at every decision -can be difficult for agent programmers, particularly for operators that do not represent steps -in a task, but instead perform generic maintenance functions, such as cleaning processed -output-link structures. - -To address this issue, Soar's RL mechanism supports automatic propagation of updates over gaps. -For a gap of length $n$, the Sarsa update is -$$\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1} Q(s_{t+n+1}, a_{t+n+1}) - Q(s_t, a_t) \right]$$ +The previous description had assumed that RL operators were selected in both +decision cycles $t$ and $t+1$. If the operator selected in $t+1$ is not an RL operator, +then $Q(s_{t+1}, a_{t+1})$ would not be defined, and an update for the RL operator +selected at time $t$ will be undefined. We will call a sequence of one or more +decision cycles in which RL operators are not selected between two decision +cycles in which RL operators are selected a gap. Conceptually, it is desirable +to use the temporal difference information from the RL operator after the gap +to update the $Q$-value of the RL operator before the _gap_. There are no +intermediate storage locations for these updates. Requiring that RL rules +support operators at every decision can be difficult for agent programmers, +particularly for operators that do not represent steps in a task, but instead +perform generic maintenance functions, such as cleaning processed output-link +structures. + +To address this issue, Soar's RL mechanism supports automatic propagation of +updates over gaps. For a gap of length $n$, the Sarsa update is + +$$ +\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1} +Q(s_{t+n+1}, a_{t+n+1}) - Q(s_t, a_t) \right] +$$ + and the Q-Learning update is -$$\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1} \underset{a \in A_{t+n+1}}{\max} Q(s_{t+n+1}, a) - Q(s_t, a_t) \right]$$ -Note that rewards will still be collected during the gap, but they are discounted based on the number of decisions they are removed from the initial RL operator. +$$ +\delta_t = \alpha \left[ \sum_{i=t}^{t+n}{\gamma^{i-t} r_i} + \gamma^{n+1} +\underset{a \in A_{t+n+1}}{\max} Q(s_{t+n+1}, a) - Q(s_t, a_t) \right] +$$ -Gap propagation can be disabled by setting the **temporal-extension** parameter of the -rl command to off. When gap propagation is disabled, the RL rules preceding a gap are -updated usingQ(st+1,at+1) = 0. The rl setting of the watch command (see Section 9.6.1 -on page 259) is useful in identifying gaps. +Note that rewards will still be collected during the gap, but they are +discounted based on the number of decisions they are removed from the initial +RL operator. -![Example Soar substate operator trace.](Images/rl-optrace.svg) +Gap propagation can be disabled by setting the **temporal-extension** parameter +of the [`rl` command](../reference/cli/cmd_rl.md) to off. When gap propagation +is disabled, the RL rules preceding a gap are updated using $Q(s_{t+1}, a_{t+1}) += 0$. The rl setting of the [`watch`](../reference/cli/cmd_trace.md) command is +useful in identifying gaps. ### RL and Substates -When an agent has multiple states in its state stack, the RL mechanism will treat each -substate independently. As mentioned previously, each state has its own reward-link. -When an RL operator is selected in a stateS, the RL updates for that operator are only -affected by the rewards collected on the reward-link for Sand the Q-values of subsequent -RL operators selected inS. +When an agent has multiple states in its state stack, the RL mechanism will +treat each substate independently. As mentioned previously, each state has its +own `reward-link`. When an RL operator is selected in a state `S`, the RL updates +for that operator are only affected by the rewards collected on the `reward-link` +for Sand the $Q$-values of subsequent RL operators selected in `S`. The only exception to this independence is when a selected RL operator forces an operator- no-change impasse. When this occurs, the number of decision cycles the RL operator at the superstate remains selected is dependent upon the processing -in the impasse state. Consider the operator trace in Figure 5.1. - -- At decision cycle 1, RL operatorO1is selected inS1and causes an -operator-no-change impass for three decision cycles. -- In the substateS2, -operatorsO2,O3, andO4are selected and applied sequentially. -- Meanwhile inS1, -rewardsr 2 ,r 3 , andr 4 are put on thereward-linksequentially. - Finally, the -impasse is resolved by O4, the proposal for O1 is retracted, and RL operatorO5is -selected inS1. +in the impasse state. Consider the operator trace in the following figure: -In this scenario, only the RL update forQ(s 1 ,O1) will be different from the -ordinary case. Its value depends on the setting of the **hrl-discount** -parameter of the rlcommand. When this parameter is set to the default valueon, -the rewards onS1and the Q-value of O5are discounted by the number of decision -cycles they are removed from the selection of O1. In this case the update for $Q(s_1, \textbf{O1})$ is +![Example Soar substate operator trace.](Images/rl-optrace.svg) -$$\delta_1 = \alpha \left[ r_2 + \gamma r_3 + \gamma^2 r_4 + \gamma^3 Q(s_5, \textbf{O5}) - Q(s_1, \textbf{O1}) \right]$$ +- At decision cycle 1, RL operator `O1` is selected in `S1` and causes an + operator-no-change impass for three decision cycles. +- In the substate `S2`, operators `O2`, `O3`, and `O4` are selected and + applied sequentially. +- Meanwhile in `S1`, rewards $r_2$ ,$r_3$ , and $r_4$ are put on the + `reward-link` sequentially. +- Finally, the impasse is resolved by `O4`, the proposal for `O1` is retracted, + and RL operator `O5` is selected in `S1`. + +In this scenario, only the RL update for $Q(s_1, O1)$ will be different from the +ordinary case. Its value depends on the setting of the `hrl-discount` +parameter of the [`rl` command](../reference/cli/cmd_rl.md). When this +parameter is set to the default value on, the rewards on `S1` and the $Q$-value +of `O5` are discounted by the number of decision cycles they are removed from the +selection of `O1`. In this case the update for $Q(s_1, \textbf{O1})$ is + +$$ +\delta_1 = \alpha \left[ r_2 + \gamma r_3 + \gamma^2 r_4 + \gamma^3 Q(s_5, +\textbf{O5}) - Q(s_1, \textbf{O1}) \right] +$$ which is equivalent to having a three decision gap separating `O1` and `O5`. -When hrl-discount is set to off, the number of cycles O1has been impassed will be -ignored. Thus the update would be +When `hrl-discount` is set to `off`, the number of cycles `O1` has been impassed +will be ignored. Thus the update would be -$$\delta_1 = \alpha \left[ r_2 + r_3 + r_4 + \gamma Q(s_5, \textbf{O5}) - Q(s_1, \textbf{O1}) \right]$$ +$$ +\delta_1 = \alpha \left[ r_2 + r_3 + r_4 + \gamma Q(s_5, \textbf{O5}) - +Q(s_1, \textbf{O1}) \right] +$$ For impasses other than operator no-change, RL acts as if the impasse hadn’t -occurred. If O1is the last RL operator selected before the impasse,r 2 the -reward received in the decision cycle immediately following, and On, the first -operator selected after the impasse, thenO1 is updated with +occurred. If `O1` is the last RL operator selected before the impasse, $r_2$ +the reward received in the decision cycle immediately following, and $O_n$, the +first operator selected after the impasse, then `O1` is updated with -$$\delta_1 = \alpha \left[ r_2 + \gamma Q(s_n, \textbf{O}_\textbf{n}) - Q(s_1, \textbf{O1}) \right]$$ +$$ +\delta_1 = \alpha \left[ r_2 + \gamma Q(s_n, \textbf{O}_\textbf{n}) - Q(s_1, +\textbf{O1}) \right] +$$ If an RL operator is selected in a substate immediately prior to the state’s retraction, the RL rules will be updated based only on the reward signals -present and not on the Q-values of future operators. This point is not covered +present and not on the $Q$-values of future operators. This point is not covered in traditional RL theory. The retraction of a substate corresponds to a suspension of the RL task in that state rather than its termination, so the last update assumes the lack of information about future rewards rather than the discontinuation of future rewards. To handle this case, the numeric-indifferent preference value of each RL rule is stored as two separate values, the expected -current reward(ECR) and expected future reward (EFR). The ECR is an estimate of +current reward (ECR) and expected future reward (EFR). The ECR is an estimate of the expected immediate reward signal for executing the corresponding RL -operator. The EFR is an estimate of the time discounted Q-value of the next RL +operator. The EFR is an estimate of the time discounted $Q$-value of the next RL operator. Normal updates correspond to traditional RL theory (showing the Sarsa case for simplicity): -$$ \delta_{ECR} = \alpha \left[ r_t - ECR(s_t, a_t) \right] $$ - -$$ \delta_{EFR} = \alpha \left[ \gamma Q(s_{t+1}, a_{t+1}) - EFR(s_t, a_t) \right] $$ - -$$ \delta_t = \delta_{ECR} + \delta_{EFR} $$ - -$$ = \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - \left( ECR(s_t, a_t) + EFR(s_t, a_t) \right) \right] $$ - -$$ = \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] $$ +$$ +\begin{aligned} +\delta_{ECR} &= \alpha \left[ r_t - ECR(s_t, a_t) \right] \\ +\delta_{EFR} &= \alpha \left[ \gamma Q(s_{t+1}, a_{t+1}) - EFR(s_t, a_t)\right] \\ +\delta_t &= \delta_{ECR} + \delta_{EFR} \\ + &= \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - \left( ECR(s_t, a_t) + +EFR(s_t, a_t) \right) \right] \\ + &= \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] +\end{aligned} +$$ During substate retraction, only the ECR is updated based on the reward signals present at the time of retraction, and the EFR is unchanged. Soar’s automatic subgoaling and RL mechanisms can be combined to naturally implement -hierarchical reinforcement learning algorithms such as MAXQ and options. +hierarchical reinforcement learning algorithms such as `MAXQ` and options. ### Eligibility Traces The RL mechanism supports eligibility traces, which can improve the speed of -learning by updating RL rules across multiple sequential steps. The -**eligibility-trace-decay-rate** and **eligibility-trace-tolerance** parameters -*control this mechanism. By setting eligibility-trace-decay-rate to 0 -(de- fault), eligibility traces are in effect disabled. When eligibility traces +learning by updating RL rules across multiple sequential steps. + +The `eligibility-trace-decay-rate` and `eligibility-trace-tolerance` +parameters control this mechanism. By setting `eligibility-trace-decay-rate` to 0 +(default), eligibility traces are in effect disabled. When eligibility traces are enabled, the particular algorithm used is dependent upon the learning -policy. For Sarsa, the eligibility trace implementation isSarsa($\lambda$). For -Q-Learning, the eligibility trace implementation is *Watkin's Q($\lambda$)*. - -#### Exploration - -The decide indifferent-selection command (page 198) determines how operators are -selected based on their numeric-indifferent preferences. Although all the -indifferent selection settings are valid regardless of how the -numeric-indifferent preferences were arrived at, the epsilon-greedy and -boltzmann settings are specifically designed for use with RL and cor- respond to -the two most common exploration strategies. In an effort to maintain backwards -compatibility, the default exploration policy is soft max. As a result, one -should change to epsilon-greedy or boltzmann when the reinforcement learning -mechanism is enabled. - -### GQ($\lambda$) - -Sarsa($\lambda$) and Watkin’s Q($\lambda$) help agents to solve the temporal -credit assignment problem more quickly. However, if you wish to implement -something akin to CMACs to generalize from experience, convergence is not -guaranteed by these algorithms. GQ($\lambda$) is a gradient descent algorithm -designed to ensure convergence when learning off-policy. Soar’s learning-policy -can be set to -**on-policy-gq-lambda** or **off-policy-gq-lambda** to increase the likelihood -of convergence when learning under these conditions. If you should choose to use -one of these algorithms, we recommend setting the rl **step-size-parameter** to -something small, such as 0.01 in order to ensure that the secondary set of -weights used by GQ($\lambda$)change slowly enough for efficient convergence. +policy. For Sarsa, the eligibility trace implementation is $Sarsa(\lambda)$. For +Q-Learning, the eligibility trace implementation is $Watkin's Q(\lambda)$. + +#### Exploration + +The [`decide indifferent-selection` command](../reference/cli/cmd_decide.md#decide-indifferent-selection) +determines how operators are selected based on their +numeric-indifferent preferences. Although all the indifferent selection +settings are valid regardless of how the numeric-indifferent preferences were +arrived at, the `epsilon-greedy` and `boltzmann` settings are specifically designed +for use with RL and cor- respond to the two most common exploration strategies. +In an effort to maintain backwards compatibility, the default exploration +policy is `softmax`. As a result, one should change to `epsilon-greedy` or +`boltzmann` when the reinforcement learning mechanism is enabled. + +### GQ(λ) + + +_Sarsa($\lambda$)_ and _Watkin’s Q($\lambda$)_ help agents to solve the +temporal credit assignment problem more quickly. However, if you wish to +implement something akin to CMACs to generalize from experience, convergence is +not guaranteed by these algorithms. $GQ(\lambda)$ is a gradient descent +algorithm designed to ensure convergence when learning off-policy. Soar’s +`learning-policy` can be set to **on-policy-gq-lambda** or +**off-policy-gq-lambda** to increase the likelihood of convergence when +learning under these conditions. If you should choose to use one of these +algorithms, we recommend setting the `rl` **step-size-parameter** to something +small, such as 0.01 in order to ensure that the secondary set of weights used +by $GQ(\lambda)$ change slowly enough for efficient convergence. ## Automatic Generation of RL Rules The number of RL rules required for an agent to accurately approximate operator -Q-values is usually unfeasibly large to write by hand, even for small domains. +$Q$-values is usually unfeasibly large to write by hand, even for small domains. Therefore, several methods exist to automate this. ### The gp Command -The gp command can be used to generate productions based on simple patterns. -This is useful if the states and operators of the environment can be -distinguished by a fixed number of dimensions with finite domains. An example is -a grid world where the states are described by integer row/column coordinates, -and the available operators are to move north, south, east, or west. In this -case, a single gp command will generate all necessary RL rules: +The [`gp` command](../reference/cli/cmd_gp.md) can be used to generate +productions based on simple patterns. This is useful if the states and +operators of the environment can be distinguished by a fixed number of +dimensions with finite domains. An example is a grid world where the states are +described by integer row/column coordinates, and the available operators are to +move north, south, east, or west. In this case, a single `gp` command will +generate all necessary RL rules: ```Soar gp {gen*rl*rules -(state ^name gridworld -^operator + -^row [ 1 2 3 4 ] -^col [ 1 2 3 4 ]) -( ^name move -^direction [ north south east west ]) ---> -( ^operator = 0.0) -} + (state ^name gridworld + ^operator + + ^row [ 1 2 3 4 ] + ^col [ 1 2 3 4 ]) + ( ^name move + ^direction [ north south east west ]) + --> + ( ^operator = 0.0) + } ``` -For more information see the documentation for this command on page 205. - ### Rule Templates Rule templates allow Soar to dynamically generate new RL rules based on a @@ -443,30 +504,31 @@ with 1000 rows and columns. Attempting to generate RL rules for each grid cell and action a priori will result in $1000 \times 1000 \times 4 = 4 \times 10^6$ productions. However, if most of those cells are unreachable due to walls, then the agent will never fire or update most of those productions. Templates give -the programmer the convenience of the gp command without filling production -memory with unnecessary rules. +the programmer the convenience of the [`gp` +command](../reference/cli/cmd_gp.md) without filling production memory with +unnecessary rules. Rule templates have variables that are filled in to generate RL rules as the agent encounters novel combinations of variable values. A rule template is valid if and only if it is marked with the **:template** flag and, in all other -respects, adheres to the format of an RL rule. However, whereas an RL rule may +respects, adheres to the format of an RL rule. However, whereas an RL rule may only use constants as the numeric-indifference preference value, a rule template may use a variable. Consider the following rule template: ```Soar sp {sample*rule*template -:template -(state ^operator + -^value ) ---> -( ^operator = ) -} + :template + (state ^operator + + ^value ) + --> + ( ^operator = ) + } ``` -During agent execution, this rule template will match working memory and create new -productions by substituting all variables in the rule template that matched against constant -values with the values themselves. Suppose that the LHS of the rule template matched -against the state +During agent execution, this rule template will match working memory and create +new productions by substituting all variables in the rule template that matched +against constant values with the values themselves. Suppose that the LHS of the +rule template matched against the state ```Soar (S1 ^value 3.2) @@ -477,23 +539,23 @@ Then the following production will be added to production memory: ```Soar sp {rl*sample*rule*template*1 -(state ^operator + -^value 3.2) ---> -( ^operator = 3.2) -} + (state ^operator + + ^value 3.2) + --> + ( ^operator = 3.2) + } ``` -The variable `` is replaced by3.2on both the LHS and the RHS, but `` and -`` are not replaced because they matches against identifiers (S1andO1). As -with other RL rules, the value of3.2on the RHS of this rule may be updated later -by reinforcement learning, whereas the value of 3.2 on the LHS will remain -unchanged. If `` had matched against a non-numeric constant, it will be -replaced by that constant on the LHS, but the RHS numeric-indifference +The variable `` is replaced by `3.2` on both the LHS and the RHS, but `` +and `` are not replaced because they matches against identifiers (`S1` and +`O1`). As with other RL rules, the value of `3.2` on the RHS of this rule may +be updated later by reinforcement learning, whereas the value of `3.2` on the LHS +will remain unchanged. If `` had matched against a non-numeric constant, it +will be replaced by that constant on the LHS, but the RHS numeric-indifference preference value will be set to zero to make the new rule valid. -The new production’s name adheres to the following pattern:rl*template-name*id, -where template-name is the name of the originating rule template and id is +The new production’s name adheres to the following pattern: `rl*template-name*id`, +where `template-name` is the name of the originating rule template and id is monotonically increasing integer that guarantees the uniqueness of the name. If an identical production already exists in production memory, then the newly @@ -505,18 +567,7 @@ using the gp command or via custom scripting when possible. ### Chunking Since RL rules are regular productions, they can be learned by chunking just -like any other production. This method is more general than using the gp command -or rule templates, and is useful if the environment state consists of -arbitrarily complex relational structures that cannot be enumerated. - -## Footnotes - -- [1]: In this context, the term "state" refers to the -state of the task or environment, not a state identifier. For the rest of this -chapter, bold capital letter names such as S1 will refer to identifiers and italic -lowercase names such as $s_1$ will refer to task states. -- [2]: This is assuming the value of -**numeric-indifferent-mode** is set to -**sum**. In general, the RL mechanism only works correctly when this is the -case, and we assume this case in the rest of the chapter. See page 198 for more -information about this parameter. +like any other production. This method is more general than using the +[`gp` command](../reference/cli/cmd_gp.md) or rule templates, and is useful if +the environment state consists of arbitrarily complex relational structures +that cannot be enumerated. diff --git a/docs/soar_manual/06_SemanticMemory.md b/docs/soar_manual/06_SemanticMemory.md index eed10725..8069f2ef 100644 --- a/docs/soar_manual/06_SemanticMemory.md +++ b/docs/soar_manual/06_SemanticMemory.md @@ -1,14 +1,16 @@ + {{manual_wip_warning}} # Semantic Memory -Soar’s semantic memory is a repository for long-term declarative knowledge, supplement- -ing what is contained in short-term working memory (and production memory). Episodic -memory, which contains memories of the agent’s experiences, is described in [Chapter 7](./07_EpisodicMemory.md). The -knowledge encoded in episodic memory is organized temporally, and specific information is -embedded within the context of when it was experienced, whereas knowledge in semantic -memory is independent of any specific context, representing more general facts about the -world. +Soar’s semantic memory is a repository for long-term declarative knowledge, +supplement- ing what is contained in short-term working memory (and production +memory). Episodic memory, which contains memories of the agent’s experiences, +is described in [Chapter 7](./07_EpisodicMemory.md). The knowledge encoded in +episodic memory is organized temporally, and specific information is embedded +within the context of when it was experienced, whereas knowledge in semantic +memory is independent of any specific context, representing more general facts +about the world. This chapter is organized as follows: [semantic memory structures in working memory](#working-memory-structure); [representation of knowledge in semantic @@ -17,53 +19,60 @@ knowledge](#storing-semantic-knowledge); [retrieving semantic knowledge](#retrieving-semantic-knowledge); and a [discussion of performance](#performance). The detailed behavior of semantic memory is determined by numerous parameters that can be controlled and configured via the -**smem** command. Please refer to the documentation for that command in Section -9.5.1 on page 243. +[`smem` command](../reference/cli/cmd_smem.md). ## Working Memory Structure -Upon creation of a new state in working memory (see Section 2.7.1 on page 28; Section 3.4 on -page 85), the architecture creates the following augmentations to facilitate agent interaction -with semantic memory: +Upon creation of a new state in working memory (see +[Impasse Types](02_TheSoarArchitecture.md#impasse-types); +[Impasses in Working Memory and in Production](03_SyntaxOfSoarPrograms.md#impasses-in-working-memory-and-in-productions)) +, the architecture creates the following augmentations +to facilitate agent interaction with semantic memory: ```Soar ( ^smem ) -( ^command ) -( ^result ) + ( ^command ) + ( ^result ) ``` -As rules augment the command structure in order to access/change semantic knowledge (6.3, -6.4), semantic memory augments the result structure in response. Production actions -should not remove augmentations of the result structure directly, as semantic memory will -maintain these WMEs. - -Figure 6.1: Example long-term identifier with four augmentations. +As rules augment the command structure in order to access/change semantic +knowledge +([Storing Semantic Knowledge](#storing-semantic-knowledge), +[Retrieving Semantic Knowledge](#retrieving-semantic-knowledge)) +, semantic memory augments the `result` structure in response. Production actions +should not remove augmentations of the `result` structure directly, as semantic +memory will maintain these WMEs. ## Knowledge Representation -The representation of knowledge in semantic memory is similar to that in working memory -(see Section 2.2 on page 14) – both include graph structures that are composed of symbolic -elements consisting of an identifier, an attribute, and a value. It is important to note, -however, key differences: +The representation of knowledge in semantic memory is similar to that in +[working memory](02_TheSoarArchitecture.md#working-memory-the-current-situation) +– both include graph structures that are composed of symbolic elements +consisting of an identifier, an attribute, and a value. It is important to +note, however, key differences: -Currently semantic memory only supports attributes that are symbolic constants -(string, integer, or decimal), but not attributes that are identifiers +- Currently semantic memory only supports attributes that are symbolic constants + (string, integer, or decimal), but not attributes that are identifiers -Whereas working memory is a single, connected, directed graph, semantic memory can -be disconnected, consisting of multiple directed, connected sub-graphs +- Whereas working memory is a single, connected, directed graph, semantic + memory can be disconnected, consisting of multiple directed, connected + sub-graphs -From Soar 9.6 onward, **Long-termidentifiers** (LTIs) are defined as identifiers that -exist in semantic memory only. Each LTI is permanently associated with a specific number -that labels it (e.g. @5 or @7). Instances of an LTI can be loaded into working memory as -regular short-term identifiers (STIs) linked with that specific LTI. For clarity, when printed, -a short-term identifier associated with an LTI is followed with the label of that LTI. For -example, if the working memory ID L7 is associated with the LTI named `@29`, printing that -STI would appear as `L7` (`@29`). +From Soar 9.6 onward, **Long-term identifiers** (LTIs) are defined as +identifiers that exist in semantic memory _only_. Each LTI is permanently +associated with a specific number that labels it (e.g. `@5` or `@7`). Instances of +an LTI can be loaded into working memory as regular short-term identifiers +(STIs) linked with that specific LTI. For clarity, when printed, a short-term +identifier associated with an LTI is followed with the label of that LTI. For +example, if the working memory `ID L7` is associated with the LTI named `@29`, +printing that STI would appear as `L7 (@29)`. When presented in a figure, long-term identifiers will be indicated by a -double-circle. For instance, Figure 6.1 depicts the long-term identifier @68, -with four augmentations, representing the addition fact of `6 + 7 = 13` (or, -rather, 3, carry 1, in context of multi-column arithmetic). +double-circle. For instance, the following figure depicts the long-term +identifier `@1`, with four augmentations, representing the addition fact of +$6 + 7 = 13$ (or, rather, 3, carry 1, in context of multi-column arithmetic). + +![Example long-term identifier with four augmentations.](Images/smem-concept.svg) ### Integrating Long-Term Identifiers with Soar @@ -71,97 +80,109 @@ Integrating long-term identifiers in Soar presents a number of theoretical and implementation challenges. This section discusses the state of integration with each of Soar’s memories/learning mechanisms. - -#### Working Memory - -Long-term identifiers themselves never exist in working memory. Rather, instances of long -term memories are loaded into working memory as STIs through queries or retrievals, and -manipulated just like any other WMEs. Changes to any STI augmentations do not directly -have any effect upon linked LTIs in semantic memory. Changes to LTIs themselves only -occur though store commands on the command link or through command-line directives -such as `smem --add` (see below). - -Each time an agent loads an instance of a certain LTI from semantic memory into working -memory using queries or retrievals, the instance created will always be a new unique STI. -This means that if same long-term memory is retrieved multiple times in succession, each -retrieval will result in a different STI instance, each linked to the same LTI. A benefit of this -is that a retrieved long-term memory can be modified without compromising the ability to -recall what the actual stored memory is.^1 - -(^1) Before Soar 9.6, LTIs were themselves retrieved into working memory. This meant all augmentations -to such IDs, whether from the original retrieval or added after retrieval, would always be merged under the -same ID, unless deep-copy was used to make a duplicate short-term memory. - -#### Procedural Memory - -Soar productions can use various conditions to test whether an STI is associated with an -LTI or whether two STIs are linked to the same LTI (see Section 3.3.5.3 on page 53). LTI +#### Working Memory + +Long-term identifiers themselves never exist in working memory. Rather, +instances of long term memories are loaded into working memory as STIs through +queries or retrievals, and manipulated just like any other WMEs. Changes to any +STI augmentations do not directly have any effect upon linked LTIs in semantic +memory. Changes to LTIs themselves only occur though `store` commands on the +command link or through command-line directives such as `smem --add` (see +below). + +Each time an agent loads an instance of a certain LTI from semantic memory into +working memory using queries or retrievals, the instance created will always be +a new unique STI. This means that if same long-term memory is retrieved +multiple times in succession, each retrieval will result in a different STI +instance, each linked to the same LTI. A benefit of this is that a retrieved +long-term memory can be modified without compromising the ability to recall +what the actual stored memory is. + +???+ info + Before Soar 9.6, LTIs were themselves retrieved into working memory. This + meant all augmentations to such IDs, whether from the original retrieval or + added after retrieval, would always be merged under the same ID, unless + deep-copy was used to make a duplicate short-term memory. + +#### Procedural Memory + +Soar productions can use various conditions to test whether an STI is +associated with an LTI or whether two STIs are linked to the same LTI (see +[Predicates for Values](03_SyntaxOfSoarPrograms.md#predicates-for-values)). LTI names (e.g. `@6`) may not appear in the action side of productions. -#### Episodic Memory +#### Episodic Memory -Episodic memory (see Section 7 on page 157) faithfully captures LTI-linked STIs, including -the episode of transition. Retrieved episodes contain STIs as they existed during the episode, -regardless of any changes to linked LTIs that transpired since the episode occurred. +[Episodic memory](07_EpisodicMemory.md) faithfully captures LTI-linked +STIs, including the episode of transition. Retrieved episodes contain STIs as +they existed during the episode, regardless of any changes to linked LTIs that +transpired since the episode occurred. ## Storing Semantic Knowledge -###Store command +### Store command -An agent stores a long-term identifier in semantic memory by creating a **^store** command: -this is a WME whose identifier is the command link of a state’s smem structure, the attribute -is store, and the value is a short-term identifier. +An agent stores a long-term identifier in semantic memory by creating a +`^store` command: this is a WME whose identifier is the command link of a +state’s smem structure, the attribute is store, and the value is a short-term +identifier. ```Soar ^smem.command.store ``` -Semantic memory will encode and store all WMEs whose identifier is the value of the store -command. Storing deeper levels of working memory is achieved through multiple store commands. +Semantic memory will encode and store all WMEs whose identifier is the value of +the store command. Storing deeper levels of working memory is achieved through +multiple store commands. -Multiple store commands can be issued in parallel. Storage commands are processed on -every state at the end of every phase of every decision cycle. Storage is guaranteed to -succeed and a status WME will be created, where the identifier is the **^result** link of the -smem structure of that state, the attribute is success, and the value is the value of the store -command above. +Multiple store commands can be issued in parallel. Storage commands are +processed on every state at the end of every phase of every decision cycle. +Storage is guaranteed to succeed and a status WME will be created, where the +identifier is the `^result` link of the smem structure of that state, the +attribute is success, and the value is the value of the store command above. ```Soar ^smem.result.success ``` -If the identifier used in the store command is not linked to any existing LTIs, a new LTI -will be created in smem and the stored STI will be linked to it. If the identifier used in -the store command is already linked to an LTI, the store will overwrite that long-term -memory. For example, if an existing LTI@5had augmentations^A do ^B re ^C mi, and a -storecommand stored short-term identifierL35which was linked to@5but had only the -augmentation^D fa, the LTI@5would be changed to only have^D fa. +If the identifier used in the store command is not linked to any existing LTIs, +a new LTI will be created in smem and the stored STI will be linked to it. If +the identifier used in the store command is already linked to an LTI, the store +will overwrite that long-term memory. For example, if an existing LTI `@5` had +augmentations `^A do` `^B re` `^C mi`, and a `store` command stored short-term +identifier `L35` which was linked to `@5` but had only the augmentation +`^D fa`, the LTI `@5` would be changed to only have `^D fa`. ### Store-new command -The **^store-new** command structure is just like the ^store command, except that smem -will always store the given memory as an entirely new structure, regardless of whether the -given STI was linked to an existing LTI or not. Any STIs that don’t already have links will -get linked to the newly created LTIs. But if a stored STI was already linked to some LTI, -Soar will not re-link it to the newly created LTI. +The `^store-new` command structure is just like the `^store` command, except +that smem will always store the given memory as an entirely new structure, +regardless of whether the given STI was linked to an existing LTI or not. Any +STIs that don’t already have links will get linked to the newly created LTIs. +But if a stored STI was already linked to some LTI, Soar will not re-link it to +the newly created LTI. -If this behavior is not desired, the agent can add a **^link-to-new-LTM yes** augmentation -to override this behavior. One use for this setting is to allow chunking to backtrace through -a stored memory in a manner that will be consistent with a later state of memory when the -newly stored LTI is retrieved again. +If this behavior is not desired, the agent can add a `^link-to-new-LTM yes` +augmentation to override this behavior. One use for this setting is to allow +chunking to backtrace through a stored memory in a manner that will be +consistent with a later state of memory when the newly stored LTI is retrieved +again. ### User-Initiated Storage -Semantic memory provides agent designers the ability to store semantic knowledge via the -**add** switch of the **smem** command (see Section 9.5.1 on page 243). The format of the -command is nearly identical to the working memory manipulation components of the RHS -of a production (i.e. no RHS-functions; see Section 3.3.6 on page 67). For instance: +Semantic memory provides agent designers the ability to store semantic +knowledge via the `add` switch of the [`smem` command](../reference/cli/cmd_smem.md). +The format of the command is nearly identical to the working memory +manipulation components of the RHS of a production (i.e. no RHS-functions; see +[The action side of productions](03_SyntaxOfSoarPrograms.md#the-action-side-of-productions-or-rhs)). +For instance: ```Soar smem --add { -( ^add10-facts ) -( ^digit1 1 ^digit-10 11) -( ^digit1 2 ^digit-10 12) -( ^digit1 3 ^digit-10 13) + ( ^add10-facts ) + ( ^digit1 1 ^digit-10 11) + ( ^digit1 2 ^digit-10 12) + ( ^digit1 3 ^digit-10 13) } ``` @@ -170,68 +191,76 @@ command instance will add a new long-term identifier (represented by the temporary ’arithmetic’ variable) with three augmentations. The value of each augmentation will each become an LTI with two constant attribute/value pairs. Manual storage can be arbitrarily complex and use standard dot-notation. The add -command also supports hardcoded LTI ids such as@1in place of variables. +command also supports hardcoded LTI ids such as `@1` in place of variables. ### Storage Location -Semantic memory uses SQLite to facilitate efficient and standardized storage and querying of -knowledge. The semantic store can be maintained in memory or on disk (per the database -and path parameters; see Section 9.5.1). If the store is located on disk, users can use any -standard SQLite programs/components to access/query its contents. However, using a disk- -based semantic store is very costly (performance is discussed in greater detail in Section 6.5 -on page 155), and running in memory is recommended for most runs. - -Note that changes to storage parameters, for example database, path and append will -not have an effect until the database is used after an initialization. This happens either -shortly after launch (on first use) or after a database initialization command is issued. To -switch databases or database storage types while running, set your new parameters and then -perform an –init command. - -The **path** parameter specifies the file system path the database is stored in. When path is -set to a valid file system path and database mode is set to file, then the SQLite database is -written to that path. - -The **append** parameter will determine whether all existing facts stored in a database on -disk will be erased when semantic memory loads. Note that this affects soar init also. In -other words, if the append setting is off, all semantic facts stored to disk will be lost when -a soar init is performed. For semantic memory,append mode is on by default. - -Note: As of version 9.3.3, Soar used a new schema for the semantic memory database. -This means databases from 9.3.2 and below can no longer be loaded. A conversion utility is -available in Soar 9.4 to convert from the old schema to the new one. - -The **lazy-commit** parameter is a performance optimization. If set to on(default), disk -databases will not reflect semantic memory changes until the Soar kernel shuts down. This -improves performance by avoiding disk writes. The optimization parameter (see Section - +Semantic memory uses SQLite to facilitate efficient and standardized storage +and querying of knowledge. The semantic store can be maintained in memory or on +disk (per the database and path parameters; see +[`smem` command](../reference/cli/cmd_smem.md)). If the store is located on +disk, users can use any standard SQLite programs/components to access/query its +contents. However, using a disk- based semantic store is very costly +(performance is discussed in greater detail in Section +[Performance](#performance)), and running in memory is recommended for most +runs. + +Note that changes to storage parameters, for example database, path and append +will not have an effect until the database is used after an initialization. +This happens either shortly after launch (on first use) or after a database +initialization command is issued. To switch databases or database storage types +while running, set your new parameters and then perform an –init command. + +The **path** parameter specifies the file system path the database is stored +in. When path is set to a valid file system path and database mode is set to +file, then the SQLite database is written to that path. + +The **append** parameter will determine whether all existing facts stored in a +database on disk will be erased when semantic memory loads. Note that this +affects soar init also. In other words, if the append setting is off, all +semantic facts stored to disk will be lost when a soar init is performed. For +semantic memory,append mode is on by default. + +Note: As of version 9.3.3, Soar used a new schema for the semantic memory +database. This means databases from 9.3.2 and below can no longer be loaded. A +conversion utility is available in Soar 9.4 to convert from the old schema to +the new one. + +The **lazy-commit** parameter is a performance optimization. If set to +on(default), disk databases will not reflect semantic memory changes until the +Soar kernel shuts down. This improves performance by avoiding disk writes. The +optimization parameter (see Section [Performance](#performance)) will have an +affect on whether databases on disk can be opened while the Soar kernel is +running. ## Retrieving Semantic Knowledge -An agent retrieves knowledge from semantic memory by creating an appropriate command -(we detail the types of commands below) on the command link of a state’s smem structure. -At the end of the output of each decision, semantic memory processes each state’s smem -`^command` structure. Results, meta-data, and errors are added to the result structure of -that state’s smems tructure. +An agent retrieves knowledge from semantic memory by creating an appropriate +command (we detail the types of commands below) on the `command` link of a +state’s `smem` structure. At the end of the output of each decision, semantic +memory processes each state’s smem `^command` structure. Results, meta-data, +and errors are added to the result structure of that state’s `smem` structure. -Only one type of retrieval command (which may include optional modifiers) can be issued -per state in a single decision cycle. Malformed commands (including attempts at multiple -retrieval types) will result in an error: +Only one type of retrieval command (which may include optional modifiers) can +be issued per state in a single decision cycle. Malformed commands (including +attempts at multiple retrieval types) will result in an error: ```Soar ^smem.result.bad-cmd ``` -Where the `` variable refers to the command structure of the state. +Where the `` variable refers to the `command` structure of the state. -After a command has been processed, semantic memory will ignore it until some aspect of -the command structure changes (via addition/removal of WMEs). When this occurs, the -result structure is cleared and the new command (if one exists) is processed. +After a command has been processed, semantic memory will ignore it until some +aspect of the command structure changes (via addition/removal of WMEs). When +this occurs, the result structure is cleared and the new command (if one +exists) is processed. ### Non-Cue-Based Retrievals -A non-cue-based retrieval is a request by the agent to reflect in working memory the current -augmentations of an LTI in semantic memory. The command WME has a **retrieve** -attribute and an LTI-linked identifier value: +A non-cue-based retrieval is a request by the agent to reflect in working +memory the current augmentations of an LTI in semantic memory. The command WME +has a `retrieve` attribute and an LTI-linked identifier value: ```Soar ^smem.command.retrieve @@ -250,54 +279,60 @@ Otherwise, two new WMEs will be placed on the result structure: ^smem.result.retrieved ``` -All augmentations of the long-term identifier in semantic memory will be created as new -WMEs in working memory. +All augmentations of the long-term identifier in semantic memory will be +created as new WMEs in working memory. ### Cue-Based Retrievals -A cue-based retrieval performs a search for a long-term identifier in semantic memory whose -augmentations exactly match an agent-supplied cue, as well as optional cue modifiers. +A cue-based retrieval performs a search for a long-term identifier in semantic +memory whose augmentations exactly match an agent-supplied cue, as well as +optional cue modifiers. -A cue is composed of WMEs that describe the augmentations of a long-term identifier. A -cue WME with a constant value denotes an exact match of both attribute and value. A -cue WME with an LTI-linked identifier as its value denotes an exact match of attribute and -linked LTI. A cue WME with a short-term identifier as its value denotes an exact match of -attribute, but with any value (constant or identifier). +A cue is composed of WMEs that describe the augmentations of a long-term +identifier. A cue WME with a constant value denotes an exact match of both +attribute and value. A cue WME with an LTI-linked identifier as its value +denotes an exact match of attribute and linked LTI. A cue WME with a short-term +identifier as its value denotes an exact match of attribute, but with any value +(constant or identifier). -A cue-based retrieval command has a **query** attribute and an identifier value, the cue: +A cue-based retrieval command has a **query** attribute and an identifier +value, the cue: ```Soar ^smem.command.query ``` -For instance, consider the following rule that creates a cue-based retrieval command: +For instance, consider the following rule that creates a cue-based retrieval +command: ```Soar sp {smem*sample*query -(state ^smem.command -^lti -^input-link.foo ) ---> -( ^query ) -( ^name -^foo -^associate -^age 25) -} + (state ^smem.command + ^lti + ^input-link.foo ) + --> + ( ^query ) + ( ^name + ^foo + ^associate + ^age 25) + } ``` -In this example, assume that the `` variable will match a short-term identifier which is -linked to a long-term identifier and that the `` variable will match a constant. Thus, -the query requests retrieval of a long-term memory with augmentations that satisfy ALL of -the following requirements: +In this example, assume that the `` variable will match a short-term +identifier which is linked to a long-term identifier and that the `` +variable will match a constant. Thus, the query requests retrieval of a +long-term memory with augmentations that satisfy **ALL** of the following +requirements: -- Attribute name with ANY value -- Attribute foo with value equal to that of variable `` at the time this rule fires -- Attribute associate with value that is the same long-term identifier as that linked to -by the `` STI at the time this rule fires -- Attribute age with integer value 25 +- Attribute `name` with `ANY` value +- Attribute `foo` with value equal to that of variable `` at the time this + rule fires +- Attribute `associate` with value that is the same long-term identifier as + that linked to by the `` STI at the time this rule fires +- Attribute `age` with integer value 25 -If no long-term identifier satisfies ALL of these requirements, an error is returned: +If no long-term identifier satisfies **ALL** of these requirements, an error is returned: ```Soar ^smem.result.failure @@ -310,88 +345,102 @@ Otherwise, two WMEs are added: ^smem.result.retrieved ``` -The result `` will be a new short-term identifier linked to the result LTI. +The result `` will be a new short-term identifier linked to the +result LTI. -As with non-cue-based retrievals, all of the augmentations of the long-term identifier in -semantic memory are added as new WMEs to working memory. If these augmentations -include other LTIs in smem, they too are instantiated into new short-term identifiers in -working memory. +As with non-cue-based retrievals, all of the augmentations of the long-term +identifier in semantic memory are added as new WMEs to working memory. If these +augmentations include other LTIs in smem, they too are instantiated into new +short-term identifiers in working memory. -It is possible that multiple long-term identifiers match the cue equally well. In this case, se- -mantic memory will retrieve the long-term identifier that was most recently stored/retrieved. -(More accurately, it will retrieve the LTI with the greatest activation value. See below.) +It is possible that multiple long-term identifiers match the cue equally well. +In this case, se- mantic memory will retrieve the long-term identifier that was +most recently stored/retrieved. (More accurately, it will retrieve the LTI with +the greatest activation value. See below.) The cue-based retrieval process can be further tempered using optional modifiers: -The prohibit command requires that the retrieved long-term identifier is not equal -to that linked with the supplied long-term identifier: +- The prohibit command requires that the retrieved long-term identifier is not + equal to that linked with the supplied long-term identifier: -``` - ^smem.command.prohibit -``` + ```Soar + ^smem.command.prohibit + ``` + Multiple prohibit command WMEs may be issued as modifiers to a single cue-based + retrieval. This method can be used to iterate over all matching long-term + identifiers. -Multiple prohibit command WMEs may be issued as modifiers to a single cue-based -retrieval. This method can be used to iterate over all matching long-term identifiers. +- The neg-query command requires that the retrieved long-term identifier does + NOT contain a set of attributes/attribute-value pairs: -The neg-query command requires that the retrieved long-term identifier does NOT -contain a set of attributes/attribute-value pairs: + ```Soar + ^smem.command.neg-query + ``` -```Soar - ^smem.command.neg-query -``` + The syntax of this command is identical to that of regular/ positive query + command. -The syntax of this command is identical to that of regular/ positive query command. +- The math-query command requires that the retrieved long term identifier + contains an attribute value pair that meets a specified mathematical condition. + This condition can either be a conditional query or a superlative query. + Conditional queries are of the format: -The math-query command requires that the retrieved long term identifier contains -an attribute value pair that meets a specified mathematical condition. This condition -can either be a conditional query or a superlative query. -Conditional queries are of the format: + ```Soar + ^smem.command.math-query.. + ``` -```Soar - ^smem.command.math-query.. -``` + Superlative queries do not use a value argument and are of the format: -Superlative queries do not use a value argument and are of the format: + ```Soar + ^smem.command.math-query.. + ``` -```Soar - ^smem.command.math-query.. -``` + Values used in math queries must be integer or float type values. Currently + supported condition names are: -Values used in math queries must be integer or float type values. Currently supported -condition names are: + - `less` A value less than the given argument + - `greater` A value greater than the given argument + - `less-or-equal` A value less than or equal to the given argument + - `greater-or-equal` A value greater than or equal to the given argument + - `max` The maximum value for the attribute + - `min` The minimum value for the attribute -- less A value less than the given argument -- greater A value greater than the given argument -- less-or-equal A value less than or equal to the given argument -- greater-or-equal A value greater than or equal to the given argument -- max The maximum value for the attribute -- min The minimum value for the attribute +#### Activation -#### Activation +When an agent issues a cue-based retrieval and multiple LTIs match the cue, the +LTI which semantic memory provides to working memory as the result is the LTI +which not only matches the cue, but also has the highest `activation` value. +Semantic memory has several activation methods available for this purpose. -When an agent issues a cue-based retrieval and multiple LTIs match the cue, the LTI which -semantic memory provides to working memory as the result is the LTI which not only -matches the cue, but also has the highest **activation** value. Semantic memory has several -activation methods available for this purpose. +The simplest activation methods are `recency` and `frequency` activation. +Recency activation attaches a time-stamp to each LTI and records the time of +last retrieval. Using recency activation, the LTI which matches the cue and was +also most-recently retrieved is the one which is returned as the result for a +query. Frequency activation attaches a counter to each LTI and records the +number of retrievals for that LTI. Using frequency activation, the LTI which +matches the cue and also was most frequently used is returned as the result of +the query. By default, Soar uses recency activation. -The simplest activation methods are **recency** and **frequency** activation. Recency activa- -tion attaches a time-stamp to each LTI and records the time of last retrieval. Using recency -activation, the LTI which matches the cue and was also most-recently retrieved is the one -which is returned as the result for a query. Frequency activation attaches a counter to each -LTI and records the number of retrievals for that LTI. Using frequency activation, the LTI -which matches the cue and also was most frequently used is returned as the result of the -query. By default, Soar uses recency activation. +**Base-level activation** can be thought of as a mixture of both recency and +frequency. Soar makes use of the following equation (known as the Petrov +approximation) for calculating base-level activation: -**Base-level activation** can be thought of as a mixture of both recency and frequency. -Soar makes use of the following equation (known as the Petrov approximation^2 ) for calculating base-level activation: +???+ info + Petrov, Alexander A. “Computationally efficient approximation of the base-level + learning equation in ACT-R.” Proceedings of the seventh international + conference on cognitive modeling. 2006. +$$ +BLA = \log \left[ \sum\limits_{i=1}^{k} t_i^{-d} + \dfrac{(n-k)(t_n^{1-d} - +t_k^{1-d})}{(1-d)(t_n-t_k)} \right] +$$ - -where n is the number of activation boosts, tnis the time since the first boost, tkis the time -of the kth boost, dis the decay factor, and kis the number of recent activation boosts which -are stored. (In Soar,kis hard-coded to 10.) To use base-level activation, use the following -CLI command when sourcing an agent: +where $n$ is the number of activation boosts, $t_n$ is the time since the first +boost, $t_k$ is the time of the $k$th boost, dis the decay factor, and $k$ is +the number of recent activation boosts which are stored. (In Soar, $k$ is +hard-coded to $10$.) To use base-level activation, use the following CLI command +when sourcing an agent: ```bash smem --set activation-mode base-level @@ -402,29 +451,35 @@ activation beyond the previous methods. First, spreading activation requires tha base-level activation is also being used. They are considered additive. This value does not represent recency or frequency of use, but rather context-relatedness. Spreading activation increases the activation of LTIs which -are linked to by identifiers currently present in working memory.^3 Such LTIs -may be thought of as spreading sources. - -Spreading activation values spread according to network structure. That is, spreading sources -will add to the spreading activation values of any of their child LTIs, according to the directed -graph structure with in smem(not working memory). The amount of spread is controlled by -the -**spreading-continue-probability** parameter. By default this value is set to0.9. -This would mean that90%of an LTI’s spreading activation value would be divided among -its direct children (without subtracting from its own value). This value is multiplicative with -depth. A "grandchild" LTI, connected at a distance of two from a source LTI, would receive -spreading according to 0. 9 × 0 .9 = 0.81 of the source spreading activation value. - -Spreading activation values are updated each decision cycle only as needed for specific -smem retrievals. For efficiency, two limits exist for the amount of spread calculated. The -**spreading-limit** parameter limits how many LTIs can receive spread from a given -spreading source LTI. By default, this value is ( 300 ). Spread is distributed in a magnitude- -first manner to all descendants of a source. (Without edge-weights, this simplifies to breadth- -first.) Once the number of LTIs that have been given spread from a given source reaches -the max value indicated by spreading-limit, no more is calculated for that source that -update cycle, and the next spreading source’s contributions are calculated. The maximum -depth of descendants that can receive spread contributions from a source is similarly given -by the **spreading-depth-limit** parameter. By default, this value is ( 10 ). +are linked to by identifiers currently present in working memory. Such LTIs +may be thought of as _spreading sources_. + +???+ info + Specifically, linked to by STIs that have augmentations. + +Spreading activation values spread according to network structure. That is, +spreading sources will add to the spreading activation values of any of their +child LTIs, according to the directed graph structure with in smem(not working +memory). The amount of spread is controlled by the +`spreading-continue-probability` parameter. By default this value is set to +0.9. This would mean that $90\ \%$ of an LTI’s spreading activation value would +be divided among its direct children (without subtracting from its own value). +This value is multiplicative with depth. A "grandchild" LTI, connected at a +distance of two from a source LTI, would receive spreading according to +$0. 9 \times 0 .9 = 0.81$ of the source spreading activation value. + +Spreading activation values are updated each decision cycle only as needed for +specific smem retrievals. For efficiency, two limits exist for the amount of +spread calculated. The `spreading-limit` parameter limits how many LTIs can +receive spread from a given spreading source LTI. By default, this value is +(300). Spread is distributed in a magnitude- first manner to all descendants of +a source. (Without edge-weights, this simplifies to breadth- first.) Once the +number of LTIs that have been given spread from a given source reaches the max +value indicated by `spreading-limit`, no more is calculated for that source that +update cycle, and the next spreading source’s contributions are calculated. The +maximum depth of descendants that can receive spread contributions from a +source is similarly given by the `spreading-depth-limit` parameter. By +default, this value is (10). In order to use spreading activation, use the following command: @@ -432,89 +487,89 @@ In order to use spreading activation, use the following command: smem --set spreading on ``` -(^2) Petrov, Alexander A. "Computationally efficient approximation of the base-level learning equation in -ACT-R."Proceedings of the seventh international conference on cognitive modeling.2006. -(^3) Specifically, linked to by STIs that have augmentations. - -Also, spreading activation can make use of working memory activation for adjusting edge -weights and for providing nonuniform initial magnitude of spreading for sources of spread. -This functionality is optional. To enable the updating of edge-weights, use the command: +Also, spreading activation can make use of working memory activation for +adjusting edge weights and for providing nonuniform initial magnitude of +spreading for sources of spread. This functionality is optional. To enable the +updating of edge-weights, use the command: ```bash smem --set spreading-edge-updating on ``` -and to enable working memory activation to modulate the magnitude of spread from sources, -use the command: +and to enable working memory activation to modulate the magnitude of spread +from sources, use the command: ```bash smem --set spreading-wma-source on ``` -For most use-cases, base-level activation is sufficient to provide an agent with relevant knowl- -edge in response to a query. However, to provide an agent with more context-relevant results -as opposed to results based only on historical usage, one must use spreading activation. +For most use-cases, base-level activation is sufficient to provide an agent +with relevant knowledge in response to a query. However, to provide an agent +with more context-relevant results as opposed to results based only on +historical usage, one must use spreading activation. ### Retrieval with Depth -For either cue-based or non-cue-based retrieval, it is possible to retrieve a long-term identifier -with additional depth. Using the **depth** parameter allows the agent to retrieve a greater -amount of the memory structure than it would have by retrieving not only the long-term -identifier’s attributes and values, but also by recursively adding to working memory the -attributes and values of that long-term identifier’s children. +For either cue-based or non-cue-based retrieval, it is possible to retrieve a +long-term identifier with additional depth. Using the **depth** parameter +allows the agent to retrieve a greater amount of the memory structure than it +would have by retrieving not only the long-term identifier’s attributes and +values, but also by recursively adding to working memory the attributes and +values of that long-term identifier’s children. Depth is an additional command attribute, like query: ```Soar ^smem.command.query -^smem.command.depth + ^smem.command.depth ``` For instance, the following rule uses depth with a cue-based retrieval: ```Soar - ^smem.command.query sp {smem*sample*query -(state ^smem.command -^input-link.foo ) ---> -( ^query -^depth 2) -( ^name -^foo -^associate -^age 25) + (state ^smem.command + ^input-link.foo ) + --> + ( ^query + ^depth 2) + ( ^name + ^foo + ^associate + ^age 25) } ``` -In the example above and without using depth, the long-term identifier referenced by +In the example above and without using depth, the long-term identifier +referenced by ```Soar ^associate ``` -would not also have its attributes and values be retrieved. With a depth of 2 or more, that -long-term identifier also has its attributes and values added to working memory. +would not also have its attributes and values be retrieved. With a depth of 2 +or more, that long-term identifier also has its attributes and values added to +working memory. -Depth can incur a large cost depending on the specified depth and the structures stored in -semantic memory. +Depth can incur a large cost depending on the specified depth and the +structures stored in semantic memory. ## Performance Initial empirical results with toy agents show that semantic memory queries -carry up to a 40% overhead as compared to comparable rete matching. However, the -retrieval mechanism implements some basic query optimization: statistics are -maintained about all stored knowledge. When a query is issued, semantic memory -re-orders the cue such as to minimize expected query time. Because only perfect -matches are acceptable, and there is no symbol variablization, semantic memory -retrievals do not contend with the same combinatorial search space as the rete. -Preliminary empirical study shows that semantic memory maintains sub-millisecond -retrieval time for a large class of queries, even in very large stores (millions -of nodes/edges). - -Once the number of long-term identifiers overcomes initial overhead (about 1000 WMEs), -initial empirical study shows that semantic storage requires far less than 1KB per stored -WME. +carry up to a $40\ \%$ overhead as compared to comparable rete matching. +However, the retrieval mechanism implements some basic query optimization: +statistics are maintained about all stored knowledge. When a query is issued, +semantic memory re-orders the cue such as to minimize expected query time. +Because only perfect matches are acceptable, and there is no symbol +variablization, semantic memory retrievals do not contend with the same +combinatorial search space as the rete. Preliminary empirical study shows that +semantic memory maintains sub-millisecond retrieval time for a large class of +queries, even in very large stores (millions of nodes/edges). + +Once the number of long-term identifiers overcomes initial overhead (about 1000 +WMEs), initial empirical study shows that semantic storage requires far less +than 1KB per stored WME. ### Math queries @@ -528,11 +583,12 @@ iterate over any memory that matches all other involved cues. ### Performance Tweaking -When using a database stored to disk, several parameters become crucial to performance. -The first is **lazy-commit** , which controls when database changes are written to disk. -The default setting (on) will keep all writes in memory and only commit to disk upon re- -initialization (quitting the agent or issuing the init command). The off setting will write -each change to disk and thus incurs massive I/O delay. +When using a database stored to disk, several parameters become crucial to +performance. The first is **lazy-commit** , which controls when database +changes are written to disk. The default setting (`on`) will keep all writes in +memory and only commit to disk upon re-initialization (quitting the agent or +issuing the init command). The `off` setting will write each change to disk and +thus incurs massive I/O delay. The next parameter is **thresh**. This has to do with the locality of storing/updating activation information with semantic augmentations. By default, @@ -542,7 +598,7 @@ identifiers on demand, and thus retrieval time is independent of cue selectivity. However, each activation update (such as after a retrieval) incurs an update cost linear in the number of augmentations. If the number of augmentations for a long-term identifier is large, this cost can dominate. Thus, -the thresh parameter sets the upper bound of augmentations, after which +the `thresh` parameter sets the upper bound of augmentations, after which activation is stored with the long-term identifier. This allows the user to establish a balance between cost of updating augmentation activation and the number of long-term identifiers that must be pre-sorted during a cue-based @@ -550,22 +606,26 @@ retrieval. As long as the threshold is greater than the number of augmentations of most long-term identifiers, performance should be fine (as it will bound the effects of selectivity). -The next two parameters deal with the SQLite cache, which is a memory store used to speed -operations like queries by keeping in memory structures like levels of index B+-trees. The -first parameter, **page-size** , indicates the size, in bytes, of each cache page. The second -parameter, **cache-size** , suggests to SQLite how many pages are available for the cache. -Total cache size is the product of these two parameter settings. The cache memory is not pre- -allocated, so short/small runs will not necessarily make use of this space. Generally speaking, -a greater number of cache pages will benefit query time, as SQLite can keep necessary meta- -data in memory. However, some documented situations have shown improved performance -from decreasing cache pages to increase memory locality. This is of greater concern when -dealing with file-based databases, versus in-memory. The size of each page, however, may be -important whether databases are disk- or memory-based. This setting can have far-reaching -consequences, such as index B+-tree depth. While this setting can be dependent upon a -particular situation, a good heuristic is that short, simple runs should use small values of -the page size (1k,2k,4k), whereas longer, more complicated runs will benefit from larger -values (8k,16k,32k,64k). The episodic memory chapter (see Section 7.4 on page 163) has -some further empirical evidence to assist in setting these parameters for very large stores. +The next two parameters deal with the SQLite cache, which is a memory store +used to speed operations like queries by keeping in memory structures like +levels of index B+-trees. The first parameter, **page-size** , indicates the +size, in bytes, of each cache page. The second parameter, **cache-size** , +suggests to SQLite how many pages are available for the cache. Total cache size +is the product of these two parameter settings. The cache memory is not pre- +allocated, so short/small runs will not necessarily make use of this space. +Generally speaking, a greater number of cache pages will benefit query time, as +SQLite can keep necessary meta- data in memory. However, some documented +situations have shown improved performance from decreasing cache pages to +increase memory locality. This is of greater concern when dealing with +file-based databases, versus in-memory. The size of each page, however, may be +important whether databases are disk- or memory-based. This setting can have +far-reaching consequences, such as index B+-tree depth. While this setting can +be dependent upon a particular situation, a good heuristic is that short, +simple runs should use small values of the page size (1k, 2k, 4k), whereas +longer, more complicated runs will benefit from larger values (8k, 16k, 32k, 64k). +The episodic memory [chapter on performance](07_EpisodicMemory.md#performance) +has some further empirical evidence to assist in setting these parameters for +very large stores. The next parameter is **optimization**. The safety parameter setting will use SQLite default settings. If data integrity is of importance, this setting is @@ -581,10 +641,11 @@ lock to the database (locking mode pragma), thus other applications/agents cannot make simultaneous read/write calls to the database (thereby reducing the need for potentially expensive system calls to secure/release file locks). -Finally, maintaining accurate operation timers can be relatively expensive in Soar. Thus, -these should be enabled with caution and understanding of their limitations. First, they -will affect performance, depending on the level (set via the **timers** parameter). A level -of three, for instance, times every modification to long-term identifier recency statistics. -Furthermore, because these iterations are relatively cheap (typically a single step in the -linked-list of a b+-tree), timer values are typically unreliable (depending upon the system, -resolution is 1 microsecond or more). +Finally, maintaining accurate operation timers can be relatively expensive in +Soar. Thus, these should be enabled with caution and understanding of their +limitations. First, they will affect performance, depending on the level (set +via the **timers** parameter). A level of three, for instance, times every +modification to long-term identifier recency statistics. Furthermore, because +these iterations are relatively cheap (typically a single step in the +linked-list of a `b+-`tree), timer values are typically unreliable (depending +upon the system, resolution is 1 microsecond or more). diff --git a/docs/soar_manual/Images/rl-optrace.svg b/docs/soar_manual/Images/rl-optrace.svg index f0bb0470..d72e53fb 100644 --- a/docs/soar_manual/Images/rl-optrace.svg +++ b/docs/soar_manual/Images/rl-optrace.svg @@ -1,9 +1,8 @@ - - + - - - - - - - - - - - - - - - - - r - - 2 - - r - - 3 - - r - - 4 - - O1 - - O1 - - O1 - - O1 - - O5 - - O2 - - O3 - - O4 - - - - - - - - - - - - - - - - - - - - S1 - - S2 - + + + + + + + + + + + + + + + + + + + r + + 2 + + r + + 3 + + r + + 4 + + O1 + + O1 + + O1 + + O1 + + O5 + + O2 + + O3 + + O4 + + + + + + + + + + + + + + + + + + + + S1 + + S2 + diff --git a/docs/soar_manual/Images/smem-concept.svg b/docs/soar_manual/Images/smem-concept.svg new file mode 100644 index 00000000..5898468c --- /dev/null +++ b/docs/soar_manual/Images/smem-concept.svg @@ -0,0 +1,109 @@ + + + + + + + + + + %3 + + + + temp1 + + + 6 + + + + @1 + + + @1 + + + + @1->temp1 + + + + digit1 + + + + temp0 + + + 1 + + + + @1->temp0 + + + + carry-borrow + + + + temp3 + + + 3 + + + + @1->temp3 + + + + sum + + + + temp2 + + + 7 + + + + @1->temp2 + + + + digit2 + + + diff --git a/docs/tutorials/soar_tutorial/05.md b/docs/tutorials/soar_tutorial/05.md index 96b2809f..70ad35c9 100644 --- a/docs/tutorials/soar_tutorial/05.md +++ b/docs/tutorials/soar_tutorial/05.md @@ -1,3 +1,4 @@ + {{tutorial_wip_warning("Soar.Tutorial.Part.5.-.Planning.and.Learning.pdf")}} # Part V: Planning and Learning @@ -73,25 +74,25 @@ without the indifferent preference. ```Soar sp {water-jug*propose*fill - (state ^name water-jug - ^jug ) - ( ^empty > 0) ---> - ( ^operator +) - ( ^name fill - ^jug )} + (state ^name water-jug + ^jug ) + ( ^empty > 0) + --> + ( ^operator +) + ( ^name fill + ^jug )} ``` Once that indifferent preference is removed, rerun your program and step through the first decision. You will discover that Soar automatically generates a new substate in situations where the operator preferences are insufficient to pick a single operator. This is called a _tie -impasse._ +impasse. When a tie impasse arises, a new substate is created that is very similar to the substate created in response to an operator no-change. Print out the substate and examine the augmentations. The key -differences are that an operator tie has `^choices multiple `, `^impasse +differences are that an operator tie has `^choices multiple`, `^impasse tie`, and `^item` augmentations for each of the tied operators. For the operator no-change impasses you used in TankSoar, the goal was @@ -139,7 +140,7 @@ Soar operators/rules that carry out the evaluation and comparison of operators using this approach. This set of generic rules can be used in a wide variety of problems for simple planning and they are available as part of the Soar release. These rules are part of the default rules that -come with the Soar release in the file Agents/default/selection.soar. +come with the Soar release in the file `Agents/default/selection.soar`. You should load these rules when you load in the Water-jug rules. You can do this by adding the following commands to your files (assuming that your program is in a subdirectory of the Agents directory): @@ -184,7 +185,7 @@ but those do not have to be represented in the Selection problem. If the evaluations had only a single value, namely the evaluation, then they could be represented as simple augmentations of the state, such as: -`^evaluation` success. However, the evaluations must also include the task +`^evaluation success`. However, the evaluations must also include the task operator that the evaluation refers to. Moreover, we will find it useful to have different types of evaluations that can be compared in different ways, such as symbolic evaluations (success, failure) or numeric @@ -196,11 +197,12 @@ Therefore, the state should consist of a set of evaluation augmentations, with each evaluation object having the following structure: -- operator `` the identifier of the task operator being evaluated -- symbolic-value success/partial-success/partial-failure/failure/indifferent -- numeric-value `[number]` -- value true indicates that there is either a symbolic or numeric value. -- desired `` the identifier of a desired state to use for evaluation if there is one +- `operator ` the identifier of the task operator being evaluated +- `symbolic-value success/partial-success/partial-failure/failure/indifferent` +- `numeric-value [number]` +- `value true` indicates that there is either a symbolic or numeric value. +- `desired ` the identifier of a desired state to use for evaluation if + there is one Another alternative to creating evaluations on the state is to create evaluations on the operators themselves. The advantage of creating @@ -219,12 +221,11 @@ but will allow us to test for the name of the state instead of the impasse type in all of the remaining rules. ```Soar -sp {default\*selection*elaborate*name - -:default -(state ^type tie) +sp {default*selection*elaborate*name + :default + (state ^type tie) --> -( ^name selection)} + ( ^name selection)} ``` This rule uses a new bit of syntax, the :default, which tells Soar that @@ -241,12 +242,12 @@ does not have an evaluation with a value. The operator will first create an evaluation and later compute the value. ```Soar -Selection*propose*evaluate-operator +selection*propose*evaluate-operator ``` -If the state is named selection and there is an item that does not have -an evaluation with a value, then propose the evaluate-operator for that -item. +> If the state is named selection and there is an item that does not have +> an evaluation with a value, then propose the evaluate-operator for that +> item. The tricky part of translating this into a rule is the test that there is an item without an evaluation with a value. The item can be matched @@ -259,16 +260,16 @@ match only if there does not exist an evaluate with a value, giving: ```Soar sp {selection*propose*evaluate-operator - :default - (state ^name selection - ^item ) - -{(state ^evaluation ) - ( ^operator - ^value true)} + :default + (state ^name selection + ^item ) + -{(state ^evaluation ) + ( ^operator + ^value true)} --> ( ^operator +, =) ( ^name evaluate-operator - ^operator )} + ^operator )} ``` Given these conditions, once an evaluate-operator operator is selected, @@ -296,14 +297,14 @@ application simpler. The end result is the following: ```Soar ( ^evaluation - ^operator ) + ^operator ) ( ^superoperator - ^desired ) + ^desired ) ( ^name evaluate-operator - ^superoperator - ^evaluation - ^superstate - ^superproblem-space ) + ^superoperator + ^evaluation + ^superstate + ^superproblem-space ) ``` The desired augmentation is an object that describes the desired state @@ -337,10 +338,7 @@ applied to that state. If that new state can be evaluated, then the evaluation is returned as a result and added to the appropriate evaluation object. If no evaluation is possible, then problem solving in the _task_ (such as the Water Jug) continues until an evaluation can be -made. That may lead to more tie impasses, more substates - - - +made. That may lead to more tie impasses, more substates, ... The key computations that have to be performed are: 1. Creating the initial state @@ -348,7 +346,7 @@ The key computations that have to be performed are: 1. Applying the selected operator 1. Evaluating the result -### Creating the Initial State +#### Creating the Initial State The initial state must be a copy of the state in which the tie impasse arose. Copying down all of the appropriate augmentations can be done by @@ -362,59 +360,59 @@ augmentations dealing with state copying and their meaning: - default-state-copy no: Do not copy any augmentations automatically. - one-level-attributes: copies augmentations of the state and preserves their value. -Example: + Example: -```Soar -(p1 ^one-level-attributes color) (s1 ^color c1) (c1 ^hue green) -> -(s2 ^color c1) -``` + ```Soar + (p1 ^one-level-attributes color) (s1 ^color c1) (c1 ^hue green) -> + (s2 ^color c1) + ``` - two-level-attributes: copies augmentations of the state and creates new identifiers for values. Shared identifiers replaced with same new identifier. -Example: + Example: -```Soar -(p1 ^two-level-attributes color) (s1 ^color c1) (c1 ^hue green) -> -(s2 ^color c5) (c5 ^hue green) -``` + ```Soar + (p1 ^two-level-attributes color) (s1 ^color c1) (c1 ^hue green) -> + (s2 ^color c5) (c5 ^hue green) + ``` - all-attributes-at-level one: copies all attributes of state as one-level-attributes (except dont-copy ones and Soar created ones such as impasse, operator, superstate) -Example: + Example: -```Soar -(p1 ^all-attributes-at-level one) (s1 ^color c1) (s1 ^size big) -> -(s2 ^color c1) (s2 ^size big) -``` + ```Soar + (p1 ^all-attributes-at-level one) (s1 ^color c1) (s1 ^size big) -> + (s2 ^color c1) (s2 ^size big) + ``` - all-attributes-at-level two: copies all attributes of state as two-level-attributes (except dont-copy ones and Soar created ones such as impasse, operator, superstate) -Example: + Example: -```Soar -(p1 ^all-attributes-at-level two) (s1 ^color c1) (c1 ^hue green) -> -(s2 ^color c5) (c5 ^hue green) -``` + ```Soar + (p1 ^all-attributes-at-level two) (s1 ^color c1) (c1 ^hue green) -> + (s2 ^color c5) (c5 ^hue green) + ``` - dont-copy: will not copy that attribute. -Example: + Example: -```Soar -(p1 ^dont-copy size) -``` + ```Soar + (p1 ^dont-copy size) + ``` - don’t-copy-anything: will not copy any attributes -```Soar -(p1 ^dont-copy-anything yes) -``` + ```Soar + (p1 ^dont-copy-anything yes) + ``` If no augmentations relative to copying are included, the default is to do all-attributes-at-level one. The desired state is also copied over, @@ -428,11 +426,11 @@ empty: ```Soar (s1 ^jug j1 j2) (j1 ^volume 3 - ^contents 0 - ^empty 3) + ^contents 0 + ^empty 3) (j2 ^volume 5 - ^contents 0 - ^empty 5) + ^contents 0 + ^empty 5) ``` Is it sufficient to copy the augmentations of the state, or do the @@ -444,20 +442,22 @@ is: ```Soar sp {water-jug*apply*fill - (state ^operator - ^jug ) - ( ^name fill - ^jug ) - ( ^volume - ^contents 0 - ^empty ) ---> - ( ^contents - 0 - - ^empty 0 - - )} + (state ^operator + ^jug ) + ( ^name fill + ^jug ) + ( ^volume + ^contents 0 + ^empty ) + --> + ( ^contents + 0 - + ^empty 0 + - ) # (1)} ``` +1. To remove a working memory element, use `-`. + This rule modifies the contents and empty augmentations. So if this is used to apply the fill operator to the three-gallon jug using the working memory elements above, `(j1 ^contents 0)` and `(j1 ^empty 3)` will @@ -473,12 +473,12 @@ of original state initialization rule for Water Jug: ```Soar sp {water-jug*elaborate*problem-space - (state ^name water-jug) ---> - ( ^problem-space

) - (

^name water-jug - ^default-state-copy yes - ^two-level-attributes jug)} + (state ^name water-jug) + --> + ( ^problem-space

) + (

^name water-jug + ^default-state-copy yes + ^two-level-attributes jug)} ``` You could use ^all-attributes-at-level two instead, but it is best to @@ -502,19 +502,19 @@ would be: ```Soar sp {water-jug*apply*fill - (state ^operator - ^jug ) - ( ^name fill - ^jug ) - ( ^volume - ^contents 0 - ^empty ) ---> - ( ^jug - - ^jug ) - ( ^volume - ^contents - ^empty 0)} + (state ^operator + ^jug ) + ( ^name fill + ^jug ) + ( ^volume + ^contents 0 + ^empty ) + --> + ( ^jug - + ^jug ) + ( ^volume + ^contents + ^empty 0)} ``` The disadvantage of this approach is that it requires more changes to @@ -525,7 +525,7 @@ jug is created as opposed to just modifying the contents of the existing jug. For these two reasons, the first approach (two-level-attribute copying) is preferred. -### Selecting the operator being evaluated +#### Selecting the operator being evaluated Once the initial state of the evaluate-operator operator is created, a copy of the operator being evaluated needs to be selected. The reason a @@ -533,19 +533,19 @@ copy is necessary is two fold. First, the original operator may have augmentations that refer to objects in the original (non-copied) state. Therefore, a copy of the operator must be made with those new objects. In the Water Jug, the operator for fill has an augmentation for the jug -being filled: ( ^name fill ^jug j1). The second reason is that in +being filled: (` ^name fill ^jug j1`). The second reason is that in course of applying some operators, augmentations are added to the operator, which in this case would modify the original operator before it is applied. For these reasons, a duplicate of the operator is -automatically made (unless ^default-operator-copy no is an augmentation +automatically made (unless `^default-operator-copy no` is an augmentation on the problem space). The copying replaces any identifiers in the operator structure that were changed in the state copying process (in -this case j1 would be replaced by j3). +this case `j1` would be replaced by `j3`). Once a copy is made, additional rules reject all of the other proposed operators so that the copy will be selected. -### Applying the selected operator +#### Applying the selected operator Once the operator is selected, the rules for applying it will fire and modify the copied state. You do not need to add any addition rules or @@ -558,11 +558,11 @@ be only approximate because the eater does not have access to the complete world map so it cannot update sensors for squares it has not seen. -### Evaluating the result +#### Evaluating the result Once the new state is created, an evaluation can be made. An augmentation is added to the state in parallel to the operator being -applied: ^tried-tied-operator . This augmentation can be tested by +applied: `^tried-tied-operator `. This augmentation can be tested by rules to ensure that they are evaluating the result of applying the operator as opposed to the copy of the original state – although this will not work for operators that apply as a sequence of rules. @@ -580,14 +580,14 @@ detecting success should be modified to be as follows: ```Soar sp {water-jug*evaluate*state*success -(state ^desired -^problem-space.name water-jug -^jug ) -( ^jug ) -( ^volume ^contents ) -( ^volume ^contents ) ---> -( ^success )} + (state ^desired + ^problem-space.name water-jug + ^jug ) + ( ^jug ) + ( ^volume ^contents ) + ( ^volume ^contents ) + --> + ( ^success )} ``` In addition to success and failure, the selection rules can process @@ -611,13 +611,13 @@ preferences that are processed are as follows: - partial-failure: All paths from this state lead to failure. This is translated into a worst preference. -For numeric evaluations, an augmentation named ^numeric-value should be +For numeric evaluations, an augmentation named `^numeric-value` should be created for the evaluation object for an operator. We will discuss numeric evaluations in more detail in a future section. If you include your original rules, the selection rules, and the two new -rules described above (water-jug*elaborate*problem-space, -water-jug*evaluate*state\*success), your system will start doing a +rules described above (`water-jug*elaborate*problem-space`, +`water-jug*evaluate*state*success`), your system will start doing a look-ahead search. You will notice that at each operator selection there is a tie and a recursive stack of substates is created. If the solution is ever found, it is only used locally to select the operator for the @@ -655,37 +655,37 @@ following: ```Soar sp {Impasse__Operator_Tie*elaborate*superstate-set -(state ^superstate ) ---> -( ^superstate-set )} + (state ^superstate ) + --> + ( ^superstate-set )} ``` ```Soar sp {Impasse__Operator_Tie*elaborate*superstate-set2 -(state ^superstate.superstate-set ) ---> -( ^superstate-set )} + (state ^superstate.superstate-set ) + --> + ( ^superstate-set )} ``` The rule to detect failure is then as follows: ```Soar sp {water-jug*evaluate*state*failure*duplicate -(state ^name water-jug -^superstate-set -^jug -^jug -^tried-tied-operator) -( ^volume 5 ^contents ) -( ^volume 3 ^contents ) -( ^name water-jug -^desired -^jug -^jug ) -( ^volume 5 ^contents ) -( ^volume 3 ^contents ) ---> -( ^failure )} + (state ^name water-jug + ^superstate-set + ^jug + ^jug + ^tried-tied-operator) + ( ^volume 5 ^contents ) + ( ^volume 3 ^contents ) + ( ^name water-jug + ^desired + ^jug + ^jug ) + ( ^volume 5 ^contents ) + ( ^volume 3 ^contents ) + --> + ( ^failure )} ``` With this rule added, the problem solving will be more directed, but it @@ -699,7 +699,7 @@ should be included in the rule are templates of the working memory elements that were tested in the processing that produced the result. This is exactly what chunking does. -### Chunking +#### Chunking Chunking is invoked when a result is produced in a substate. The result will be the action of a new rule. Chunking examines the working memory @@ -727,10 +727,10 @@ working memory elements that were not necessary to produce the result, then the chunk will also test for them and the new rule will be overly specific. -To invoke chunking, type in: `learn –-on` at the prompt of the interaction -window or add `learn –-on` as a line in the file containing your rules. +To invoke chunking, type in: `chunk always` at the prompt of the interaction +window or add `chunk always` as a line in the file containing your rules. You will also want to see when chunks are created and fire, which is -enabled using: `watch -–chunks –-print`. Before rerunning your program, +enabled using: `watch --chunks`. Before rerunning your program, try to predict what types of rules will be learned. There should be two different types of rules because there are two different types of results: preferences created by comparing evaluations in the selection @@ -759,7 +759,7 @@ into trouble, try to figure it out, and then read the rest of this section. Hint: Do not forget to alter your move-mac-boat proposals so that they no longer have an indifferent preference. -### Add selection rules +#### Add selection rules The first step is to add the following to your program so that the selection rules are added in: @@ -770,7 +770,7 @@ source selection.soar popd ``` -### Add problem space and state copying information +#### Add problem space and state copying information The second step is to add the rule that defines the problem space and specifies how the state copying should be done for evaluation. If you @@ -781,15 +781,15 @@ is necessary. ```Soar sp {mac*elaborate*problem-space - (state ^name mac) ---> - ( ^problem-space

) - (

^name missionaries-and-cannibals - ^default-state-copy yes - ^two-level-attributes right-bank left-bank)} + (state ^name mac) + --> + ( ^problem-space

) + (

^name missionaries-and-cannibals + ^default-state-copy yes + ^two-level-attributes right-bank left-bank)} ``` -### Modify goal detection +#### Modify goal detection The third step is to modify the goal detection rule so that it creates a symbolic evaluation of success instead of printing a message and @@ -797,18 +797,18 @@ halting. ```Soar sp {mac*detect*state*success -(state ^desired -^ ) -( ^missionaries -^cannibals ) -( ^{ << right-bank left-bank >> } ) -( ^missionaries -^cannibals ) ---> -( ^success )} + (state ^desired + ^ ) + ( ^missionaries + ^cannibals ) + ( ^{ << right-bank left-bank >> } ) + ( ^missionaries + ^cannibals ) + --> + ( ^success )} ``` -### Modify failure detection +#### Modify failure detection Although there were no failure states in the Water Jug, there are in the Missionaries and Cannibals problem. The action of the rule that detects @@ -816,15 +816,15 @@ failure must be modified to create a symbolic value of failure. ```Soar sp {mac*evaluate*state*failure*more*cannibals - (state ^desired - ^<< right-bank left-bank >> ) - ( ^missionaries { > 0 } - ^cannibals > ) ---> - ( ^failure )} + (state ^desired + ^<< right-bank left-bank >> ) + ( ^missionaries { > 0 } + ^cannibals > ) + --> + ( ^failure )} ``` -### Add duplicate state detection rule +#### Add duplicate state detection rule The fifth step is to add a rule that detects when there are duplicate states in the state stack and evaluates the most recent one as a @@ -832,23 +832,23 @@ failure. ```Soar sp {mac*evaluate*state*failure*duplicate - (state ^desired - ^right-bank - ^left-bank ) - ( ^missionaries ^cannibals ^boat ) - ( ^missionaries ^cannibals ^boat ) - (state { <> } - ^right-bank - ^left-bank - ^tried-tied-operator) - ( ^missionaries ^cannibals ^boat ) - ( ^missionaries ^cannibals ^boat ) - -(state ^superstate ) + (state ^desired + ^right-bank + ^left-bank ) + ( ^missionaries ^cannibals ^boat ) + ( ^missionaries ^cannibals ^boat ) + (state { <> } + ^right-bank + ^left-bank + ^tried-tied-operator) + ( ^missionaries ^cannibals ^boat ) + ( ^missionaries ^cannibals ^boat ) + -(state ^superstate ) --> - ( ^failure )} + ( ^failure )} ``` -### Remove last-operator rules +#### Remove last-operator rules The last-operator rules are no longer necessary because the planning and duplicate state detect replace (and improve) the processing they were @@ -868,9 +868,9 @@ current rules for operator implementation, chunking can learn some control rules that are overgeneral, rejecting an operator that is on the path to the goal. Specifically, chunking can learn a rule that states: -If the current state has the boat on the right bank and two missionaries -and two cannibals, reject an operator that moves one missionary and the -boat to the left. +> If the current state has the boat on the right bank and two missionaries +> and two cannibals, reject an operator that moves one missionary and the +> boat to the left. This rule is learned when the operator to move one missionary to the left is evaluated in a evaluation subgoal and discovered to lead to a @@ -886,26 +886,26 @@ application rules for the number of types being modified by the operator ```Soar sp {mac*apply*move-mac-boat -(state ^operator ) -( ^name move-mac-boat -^{ << missionaries cannibals boat >> } -^bank -^types ) -( ^ -^other-bank ) -( ^ ) ---> -( ^ - -(- )) -( ^ - -(+ ))} + (state ^operator ) + ( ^name move-mac-boat + ^{ << missionaries cannibals boat >> } + ^bank + ^types ) + ( ^ + ^other-bank ) + ( ^ ) + --> + ( ^ - + (- )) + ( ^ - + (+ ))} ``` Once this is included, the learned chunked changes to be: -If the current state has the boat on the right bank and two missionaries -and two cannibals, reject an operator that moves _only_ one missionary -and the boat to the left. +> If the current state has the boat on the right bank and two missionaries +> and two cannibals, reject an operator that moves _only_ one missionary +> and the boat to the left. Now the system should successfully solve the problem, and after chunking, solving it in the minimal number of steps. @@ -932,43 +932,45 @@ initialization rule: ```Soar sp {mac*apply*initialize-mac -(state ^operator.name initialize-mac) ---> -( ^right-bank -^left-bank -^desired ) -( ^missionaries 0 -^cannibals 0 -^boat 0 -^other-bank ) -( ^missionaries 3 -^cannibals 3 -^boat 1 -^other-bank ) -( ^right-bank

-^better higher) -(
^missionaries 3 -^cannibals 3 -^boat 1)} + (state ^operator.name initialize-mac) + --> + ( ^right-bank + ^left-bank + ^desired ) + ( ^missionaries 0 + ^cannibals 0 + ^boat 0 + ^other-bank ) + ( ^missionaries 3 + ^cannibals 3 + ^boat 1 + ^other-bank ) + ( ^right-bank
+ ^better higher) # (1) + (
^missionaries 3 + ^cannibals 3 + ^boat 1)} ``` +1. New augmentation + The rule that computes the evaluation must test the desired state to determine which bank is the desired one. It must also match the number of missionaries and cannibals on that bank, as well as test that the -operator being evaluated has applied (^tried-tied-operator). The action +operator being evaluated has applied (`^tried-tied-operator`). The action of the operator is to create an augmentation on the state with the computed evaluation. ```Soar sp {mac*evaluate*state*number -(state ^desired -^tried-tied-operator -^ ) -( ^missionaries -^cannibals ) -( ^{ << right-bank left-bank >> } ) ---> -( ^numeric-value (+ ))} + (state ^desired + ^tried-tied-operator + ^ ) + ( ^missionaries + ^cannibals ) + ( ^{ << right-bank left-bank >> } ) + --> + ( ^numeric-value (+ ))} ``` Now run your system. Without chunking, the solution is found much faster diff --git a/docs/tutorials/soar_tutorial/06.md b/docs/tutorials/soar_tutorial/06.md index 7bcc105d..796cb48f 100644 --- a/docs/tutorials/soar_tutorial/06.md +++ b/docs/tutorials/soar_tutorial/06.md @@ -1,3 +1,4 @@ + {{tutorial_wip_warning("Soar.Tutorial.Part.6.-.Reinforcement.Learning.pdf")}} # Part VI: Reinforcement Learning @@ -19,7 +20,7 @@ one-shot form of learning that increases agent execution performance by summarizing sub-goal results, RL is an incremental form of learning that probabilistically alters agent behavior. -### Reinforcement Learning in Action +## Reinforcement Learning in Action Before we get to the nuts and bolts of RL, consider first an example of its effects. Left-Right is a simple agent that can choose to move either @@ -40,119 +41,110 @@ these actions. As you are unfamiliar with the particulars of RL agent design, either type the following code into your favorite editor or open the VisualSoar *left-right* project in the *Agents* directory: -Initialization +#### Initialization The agent stores directions and associated reward on the state ```Soar -sp {propose\*initialize-left-right -(state ^superstate nil - --^name) ---> -( ^operator +) -( ^name initialize-left-right) - +sp {propose*initialize-left-right + (state ^superstate nil + -^name) + --> + ( ^operator +) + ( ^name initialize-left-right) } ``` ```Soar -sp {apply\*initialize-left-right -(state ^operator ) -( ^name initialize-left-right) ---> -( ^name left-right -^direction -^location start) -( ^name left ^reward -1) -( ^name right ^reward 1) - +sp {apply*initialize-left-right + (state ^operator ) + ( ^name initialize-left-right) + --> + ( ^name left-right + ^direction + ^location start) + ( ^name left ^reward -1) + ( ^name right ^reward 1) } ``` -Move +#### Move The agent can move in any available direction. The chosen direction is stored on the state. ```Soar sp {left-right*propose*move -(state ^name left-right -^direction.name -^location start) ---> -( ^operator +) -( ^name move -^dir ) - + (state ^name left-right + ^direction.name + ^location start) + --> + ( ^operator +) + ( ^name move + ^dir ) } ``` ```Soar -sp {left-right\*rl\*left -(state ^name left-right -^operator +) -( ^name move -^dir left) ---> -( ^operator = 0) - +sp {left-right*rl*left + (state ^name left-right + ^operator +) + ( ^name move + ^dir left) + --> + ( ^operator = 0) } ``` ```Soar -sp {left-right\*rl\*right -(state ^name left-right -^operator +) -( ^name move -^dir right) ---> -( ^operator = 0) - +sp {left-right*rl*right + (state ^name left-right + ^operator +) + ( ^name move + ^dir right) + --> + ( ^operator = 0) } ``` ```Soar sp {apply*move -(state ^operator -^location start) -( ^name move -^dir ) ---> -( ^location start - ) -(write (crlf) |Moved: | ) - + (state ^operator + ^location start) + ( ^name move + ^dir ) + --> + ( ^location start - ) + (write (crlf) |Moved: | ) } ``` -Reward +#### Reward When an agent chooses a direction, it is afforded the respective reward. ```Soar -sp {elaborate\*reward -(state ^name left-right -^reward-link -^location -^direction ) -( ^name ^reward ) ---> -( ^reward.value ) - +sp {elaborate*reward + (state ^name left-right + ^reward-link + ^location + ^direction ) + ( ^name ^reward ) + --> + ( ^reward.value ) } ``` -Conclusion +#### Conclusion When an agent chooses a direction, the task is over and the agent halts. ```Soar -sp {elaborate\*done -(state ^name left-right -^location {<> start}) ---> -(halt) - +sp {elaborate*done + (state ^name left-right + ^location {<> start}) + --> + (halt) } ``` @@ -167,9 +159,10 @@ Start the Soar Debugger and load the source for the Left-Right agent. By default, reinforcement learning is disabled. To enable this learning mechanism, enter the following commands: +```shell rl --set learning on - indifferent-selection –-epsilon-greedy +``` Note that these commands have been added to the *_firstload* file of the included *left-right* project. The first command turns learning on, @@ -179,33 +172,37 @@ Next, click the “Step" button. This will run Soar through the first cycle. You will note initialization has been chosen, no surprise. In the debugger, execute the following command: +```shell print --rl +``` This command shows you the numerical indifferent preferences in procedural memory subject to RL updating. The output is presented here: -left-right\*rl\*right 0. 0 - -left-right\*rl\*left 0. 0 +```shell +left-right*rl*right 0. 0 +left-right*rl*left 0. 0 +``` This result shows that the preference for the two operator instances after 0 updates have a value of 0. Click “Step” two more times, then execute *print --rl* again, to see RL in action: -left-right\*rl\*right 1. 0.3 - -left-right\*rl\*left 0. 0 +```shell +left-right*rl*right 1. 0.3 +left-right*rl*left 0. 0 +``` After applying the move operator, the numerical indifference value for the rule associated with moving right has now been updated 1 time to a value of 0.3. Note that since the move preferences are indifferent, and thus the decision process is made probabilistically, your agent may have decided to move left instead of right. In this case the -*left-right\*rl\*left* preference would have been updated 1 time with a +`*left-right*rl*left*` preference would have been updated 1 time with a value of -0.3. Now click the “Init-soar” button. This will reinitialize the agent. -Execute *print --rl*. Notice that the numeric indifferent values have +Execute `print --rl`. Notice that the numeric indifferent values have not changed from the previous run. Storing these values between runs is the method by which RL agents learn. Run the agent 20 more times, clicking the “Init-soar” button after each halted execution. You should @@ -213,7 +210,7 @@ notice the numeric indifference value for moving right increasing, while the value for moving left decreases. Correspondingly, you should notice the agent choosing to move left less frequently. -### Building a Learning Agent +## Building a Learning Agent Conversion of most agents to take advantage of reinforcement learning takes part in three stages: (1) use RL rules, (2) implement one or more @@ -224,23 +221,25 @@ the *Agents* directory. ### RL Rules -Rules that are recognized as updateable by the RL mechanism must abide +Rules that are recognized as updatable by the RL mechanism must abide by a specific syntax: ```Soar -sp {my\*proposal\*rule -(state ^operator + -^condition ) ---> -( ^operator = 2.3) - +sp {my*proposal*rule + (state ^operator + + ^condition ) + --> + ( ^operator = 2.3) } ``` The name of the rule can be anything and the left-hand side (LHS) of the rule, the conditions, may take any form. However, the right-hand side (RHS) must take the following form: + +```Soar ( ^operator = number) +``` To be specific, the RHS can only have one action, asserting a single numeric indifferent preference, and *number* must be a numeric constant @@ -260,39 +259,39 @@ creating new RL rules. Modification of the existing proposal rule is trivial: simply remove the “=” (equal) sign from the operator preference assertion action on the RHS: -```Soar +```Soar hl_lines="6" sp {water-jug*propose*empty -(state ^name water-jug -^jug ) -( ^contents > 0) ---> -( ^operator +) -( ^name empty -^empty-jug )} + (state ^name water-jug + ^jug ) + ( ^contents > 0) + --> + ( ^operator +) + ( ^name empty + ^empty-jug )} ``` -```Soar +```Soar hl_lines="6" sp {water-jug*propose*fill -(state ^name water-jug -^jug ) -( ^empty > 0) ---> -( ^operator +) -( ^name fill -^fill-jug )} + (state ^name water-jug + ^jug ) + ( ^empty > 0) + --> + ( ^operator +) + ( ^name fill + ^fill-jug )} ``` -```Soar +```Soar hl_lines="7" sp {water-jug*propose*pour -(state ^name water-jug -^jug { <> }) -( ^contents > 0 ) -( ^empty > 0) ---> -( ^operator +) -( ^name pour -^empty-jug -^fill-jug )} + (state ^name water-jug + ^jug { <> }) + ( ^contents > 0 ) + ( ^empty > 0) + --> + ( ^operator +) + ( ^name pour + ^empty-jug + ^fill-jug )} ``` To be clear, these modified rules propose their respective operators @@ -310,81 +309,80 @@ storing 2 units) when the 5-unit jug has 4 units could be written as follows: ```Soar -sp {water-jug*empty\*3\*2\*4 -(state ^name water-jug -^operator + -^jug ) -( ^name empty -^empty-jug.volume 3) -( ^volume 3 -^contents 2) -( ^volume 5 -^contents 4) ---> -( ^operator = 0) - +sp {water-jug*empty*3*2*4 + (state ^name water-jug + ^operator + + ^jug ) + ( ^name empty + ^empty-jug.volume 3) + ( ^volume 3 + ^contents 2) + ( ^volume 5 + ^contents 4) + --> + ( ^operator = 0) } ``` For simple agents, like the Left-Right agent above, enumerating all -state-action pair as RL rules by hand is plausible. However, the -Water-Jug agent requires $$ (3 * 2 * 4 * 6) = 144\ RL $$ rules to fully -represent this space. Since we can express these rules as the output of -a simple combinatorial pattern, we will use the Soar *gp* command to -generate all the rules we need: - -gp {rl\*water-jug*empty -(state ^name water-jug -^operator + -^jug ) -( ^name empty -^empty-jug.volume \[3 5\]) -( ^volume 3 -^contents \[0 1 2 3\]) -( ^volume 5 -^contents \[0 1 2 3 4 5\]) ---> -( ^operator = 0) +state-action pair as RL rules by hand is plausible. However, the Water-Jug +agent requires $(3 \cdot 2 \cdot 4 \cdot 6) = 144$ RL rules to fully represent +this space. Since we can express these rules as the output of a simple +combinatorial pattern, we will use the Soar +[`gp`](../../reference/cli/cmd_gp.md) command to generate all the rules we +need: +```Soar +gp {rl*water-jug*empty + (state ^name water-jug + ^operator + + ^jug ) + ( ^name empty + ^empty-jug.volume [3 5]) + ( ^volume 3 + ^contents [0 1 2 3]) + ( ^volume 5 + ^contents [0 1 2 3 4 5]) + --> + ( ^operator = 0) } ``` -gp {rl\*water-jug*fill -(state ^name water-jug -^operator + -^jug ) -( ^name fill -^fill-jug.volume \[3 5\]) -( ^volume 3 -^contents \[0 1 2 3\]) -( ^volume 5 -^contents \[0 1 2 3 4 5\]) ---> -( ^operator = 0) - -} +```Soar +gp {rl*water-jug*fill + (state ^name water-jug + ^operator + + ^jug ) + ( ^name fill + ^fill-jug.volume [3 5]) + ( ^volume 3 + ^contents [0 1 2 3]) + ( ^volume 5 + ^contents [0 1 2 3 4 5]) + --> + ( ^operator = 0) + } ``` -gp {rl\*water-jug*pour -(state ^name water-jug -^operator + -^jug ) -( ^name pour -^empty-jug.volume \[3 5\]) -( ^volume 3 -^contents \[0 1 2 3\]) -( ^volume 5 -^contents \[0 1 2 3 4 5\]) ---> -( ^operator = 0) - +```Soar +gp {rl*water-jug*pour + (state ^name water-jug + ^operator + + ^jug ) + ( ^name pour + ^empty-jug.volume [3 5]) + ( ^volume 3 + ^contents [0 1 2 3]) + ( ^volume 5 + ^contents [0 1 2 3 4 5]) + --> + ( ^operator = 0) } ``` -Note that had the rules required a more complex pattern for generation, -or had we not known all required rules at agent design time, we would -have made use of rule *templates* (see the Soar Manual for more -details). +Note that had the rules required a more complex pattern for generation, or had +we not known all required rules at agent design time, we would have made use of +[rule templates](../../soar_manual/05_ReinforcementLearning.md#rule-templates). ### Reward Rules @@ -392,7 +390,7 @@ Reward rules are just like any other Soar rule, except that they modify the *reward-link* structure of the state to reflect reward associated with the agent’s current operator decision. Reward values must be stored on the *value* element of the *reward* attribute of the *reward-link* -identifier (state.reward-link.reward.value). +identifier (`state.reward-link.reward.value`). Of significant note, Soar does not remove or modify structures within the *reward-link*, including old reward values. It is the agent’s @@ -407,16 +405,16 @@ For the Water-Jug agent, we will provide reward only when the agent has achieved the goal. This entails making a minor modification to the goal-test rule: -```Soar -sp {water-jug*detect*goal\*achieved -(state ^name water-jug -^jug -^reward-link ) -( ^volume 3 ^contents 1) ---> -(write (crlf) |The problem has been solved.|) -( ^reward.value 10) -(halt)} +```Soar hl_lines="4 8" +sp {water-jug*detect*goal*achieved + (state ^name water-jug + ^jug + ^reward-link ) + ( ^volume 3 ^contents 1) + --> + (write (crlf) |The problem has been solved.|) + ( ^reward.value 10) + (halt)} ``` Now load this code into the debugger and run it a few times (if loading @@ -427,42 +425,33 @@ command to see the numeric indifferent values of the RL rules generated by the *gp* command. You can right-click and print any of these rules to see their conditions. -### Further Exploration +## Further Exploration -Consider the following output from a run (watch level 0) of the learning +Consider the following output from a run (`watch level 0`) of the learning left-right agent from section 1: +```shell run - Moved: right - This Agent halted. - An agent halted during the run. init-soar - Agent reinitialized. run - Moved: right - This Agent halted. - An agent halted during the run. init-soar - Agent reinitialized. run - Moved: left - This Agent halted. - An agent halted during the run. +``` You should notice that at run 3 moving left is selected. By this point moving right has an obvious advantage in numerical preference values, @@ -473,24 +462,21 @@ considered less-than-preferred may lead you down a useful path. Soar allows you to tune your level of exploring these alternate paths using the *indifferent-selection* command. -In the Soar Debugger, type “*indifferent-selection --stats”* (sans -quotes). The result should look like this: +In the Soar Debugger, type `indifferent-selection --stats`. The result should +look like this: +```shell Exploration Policy: epsilon-greedy - Automatic Policy Parameter Reduction: off epsilon: 0.1 - epsilon Reduction Policy: exponential - epsilon Reduction Rate (exponential/linear): 1/0 temperature: 25 - temperature Reduction Policy: exponential - temperature Reduction Rate (exponential/linear): 1/0 +``` This command prints the current exploration policy as well as a number of tuning parameters. There are five exploration policies: *boltzmann*, @@ -498,7 +484,9 @@ of tuning parameters. There are five exploration policies: *boltzmann*, exploration policy by issuing the following command (where “policy_name” should be replaced with one of the policies above): +```shell indifferent-selection --policy_name +``` This tutorial will only discuss the *epsilon-greedy* policy. For information on the other policies you should read the Soar manual. @@ -507,9 +495,11 @@ experimentation to allow parameter-controlled exploration of operators not currently recognized as most preferred. This policy is controlled by the *epsilon* parameter. The policy is summarized as such: +```shell With ( 1 - epsilon ) probability, the most preferred operator is to be chosen. With epsilon probability, a random selection of all indifferent operators is made. +``` When Soar is first started, the default exploration policy is *softmax*. However, the first time RL is enabled, the architecture automatically @@ -520,7 +510,9 @@ value is chosen, while the remaining 10% of the time a random selection is made from all acceptable proposed operators. You can change the *epsilon* value by issuing the following command: +```shell indifferent-selection --epsilon +``` Acceptable values for *epsilon* are numbers between 0 and 1 (inclusive). You may note, by the definition, that a value of 0 will eliminate the diff --git a/docs/tutorials/soar_tutorial/07.md b/docs/tutorials/soar_tutorial/07.md index c46f0214..978fc73c 100644 --- a/docs/tutorials/soar_tutorial/07.md +++ b/docs/tutorials/soar_tutorial/07.md @@ -1,3 +1,4 @@ + {{tutorial_wip_warning("Soar.Tutorial.Part.7.-.SMem.pdf")}} # Part VII: Semantic Memory @@ -15,11 +16,11 @@ example of preloading knowledge and viewing the contents of the memory. First, open the Soar Debugger. Then, execute the following command (this can be loaded from a source file just as any other Soar command): -```bash +```shell smem --add { -( ^name alice ^friend ) -( ^name bob ^friend ) -( ^name charley) + ( ^name alice ^friend ) + ( ^name bob ^friend ) + ( ^name charley) } ``` @@ -29,13 +30,16 @@ preload the contents of large knowledge bases in Soar. We can view the contents of semantic memory using the following command: -```bash +```shell print @ +``` Which will output the following result: -(@1 ^friend @2 ^name alice \[+0.000\]) -(@2 ^friend @1 ^name bob \[+0.000\]) -(@3 ^name charley \[+0.000\]) + +```Soar +(@1 ^friend @2 ^name alice [+0.000]) +(@2 ^friend @1 ^name bob [+0.000]) +(@3 ^name charley [+0.000]) ``` Note first that the variables from the _smem --add_ command have been @@ -54,22 +58,21 @@ To pictorially view the contents of semantic memory, we can use the visualize command to render the contents of semantic memory to an image. For example, execute the following command: -```bash +```shell visualize smem ``` -If you have graphviz and DOT installed (see for -more detail), it should launch your system viewer to show a diagram -similar to: +If you have [graphviz](http://graphviz.org) and DOT installed it should launch +your system viewer to show a diagram similar to: -![](Images/07/image1.png) +![Example visualization of SMem](Images/07/image1.png) Now that we have seen the contents of semantic memory, you can confirm that none of this knowledge is present in any of the agent’s other memories. For instance, execute the following commands to print the contents of working and procedural memories: -```bash +```shell print --depth 100 print ``` @@ -81,14 +84,14 @@ can access and modify the store over time. We are now done with this example and wish to clear the semantic store. To do this we issue a special command: -```bash +```shell smem --clear ``` The agent is now reinitialized, as you can verify by printing the contents of working memory, procedural memory, and now semantic memory. -### Agent Interaction +## Agent Interaction Agents interact with semantic memory via special structures in working memory. Soar automatically creates an _smem_ link on each state, and @@ -96,14 +99,14 @@ each _smem_ link has specialized substructure: a _command_ link for agent-initiated actions and a _result_ link for feedback from semantic memory. For instance, issue the following command: -```bash +```shell print --depth 10 ``` If you read the output carefully you will notice a WME that can be -generally represented as ( ^smem ) and two additional -WMEs that can be represented as ( ^command ) and -( ^result ). +generally represented as (` ^smem `) and two additional +WMEs that can be represented as (` ^command `) and +(` ^result `). As described in the following sections, the agent, via rules, populates and maintains the _command_ link and the architecture populates and @@ -113,11 +116,11 @@ For the agent to interact with semantic memory, this mechanism must be enabled. By default, all learning mechanisms in Soar are disabled. To enable semantic memory, issue the following command: -```bash +```shell smem --enable ``` -### Agent Storage and Modification +## Agent Storage and Modification An agent stores an object to semantic memory by issuing a _store_ command. The syntax of a store command is (` ^store `) where @@ -142,126 +145,115 @@ Let’s see an example. Source the following rules into the Soar Debugger directory). ```Soar -sp {propose\*init -(state ^superstate nil - --^name) ---> -( ^operator +) -( ^name init)} +sp {propose*init + (state ^superstate nil + -^name) + --> + ( ^operator +) + ( ^name init)} ``` ```Soar -sp {apply\*init -(state ^operator.name init -^smem.command ) ---> -( ^name friends) -( ^store ) -( ^name alice ^friend ) -( ^name bob ^friend ) -( ^name charley)} +sp {apply*init + (state ^operator.name init + ^smem.command ) + --> + ( ^name friends) + ( ^store ) + ( ^name alice ^friend ) + ( ^name bob ^friend ) + ( ^name charley)} ``` ```Soar -sp {propose\*mod -(state ^name friends -^smem.command ) -( ^store ) -( ^name alice) -( ^name bob) -( ^name charley) ---> -( ^operator +) -( ^name mod)} +sp {propose*mod + (state ^name friends + ^smem.command ) + ( ^store ) + ( ^name alice) + ( ^name bob) + ( ^name charley) + --> + ( ^operator +) + ( ^name mod)} ``` ```Soar -sp {apply\*mod -(state ^operator.name mod -^smem.command ) -( ^store ) -( ^name alice) -( ^name bob) -( ^name charley) ---> -( ^name alice -) -( ^name anna -^friend ) -( ^store -) -( ^store -)} +sp {apply*mod + (state ^operator.name mod + ^smem.command ) + ( ^store ) + ( ^name alice) + ( ^name bob) + ( ^name charley) + --> + ( ^name alice -) + ( ^name anna + ^friend ) + ( ^store -) + ( ^store -)} ``` -Now click the “Step” button to run till the decision phase and notice -that the _init_ operator is selected. Now, click the “Watch 5” button -and then the “Run 1 -p” button to watch as the operator is applied. +Now click the "Step" button to run till the decision phase and notice +that the _init_ operator is selected. Now, click the "Watch 5" button +and then the "Run 1 -p" button to watch as the operator is applied. Below is part of the trace that should be produced. If you do not see this part of this trace in your run, be sure that you enabled semantic memory (see section above). -```bash +```Soar --- apply phase --- --- Firing Productions (PE) For State At Depth 1 --- -Firing apply\*init -\+ (C3 ^name charley + :O ) (apply\*init) -\+ (B1 ^friend A1 + :O ) (apply\*init) -\+ (B1 ^name bob + :O ) (apply\*init) -\+ (A1 ^friend B1 + :O ) (apply\*init) -\+ (A1 ^name alice + :O ) (apply\*init) -\+ (C2 ^store C3 + :O ) (apply\*init) -\+ (C2 ^store B1 + :O ) (apply\*init) -\+ (C2 ^store A1 + :O ) (apply\*init) -\+ (S1 ^name friends + :O ) (apply\*init) ---- Change Working Memory (PE) --- -\=>WM: (25: C3 ^name charley) -\=>WM: (24: B1 ^friend A1) -\=>WM: (23: B1 ^name bob) -\=>WM: (22: A1 ^friend B1) -\=>WM: (21: A1 ^name alice) -\=>WM: (20: C2 ^store A1) -\=>WM: (19: C2 ^store B1) -\=>WM: (18: C2 ^store C3) -\=>WM: (17: S1 ^name friends) +Firing apply*init ++ (C3 ^name charley + :O ) (apply*init) ++ (B1 ^friend A1 + :O ) (apply*init) ++ (B1 ^name bob + :O ) (apply*init) ++ (A1 ^friend B1 + :O ) (apply*init) ++ (A1 ^name alice + :O ) (apply*init) ++ (C2 ^store C3 + :O ) (apply*init) ++ (C2 ^store B1 + :O ) (apply*init) ++ (C2 ^store A1 + :O ) (apply*init) ++ (S1 ^name friends + :O ) (apply*init) + --- Change Working Memory (PE) --- +=>WM: (25: C3 ^name charley) +=>WM: (24: B1 ^friend A1) +=>WM: (23: B1 ^name bob) +=>WM: (22: A1 ^friend B1) +=>WM: (21: A1 ^name alice) +=>WM: (20: C2 ^store A1) +=>WM: (19: C2 ^store B1) +=>WM: (18: C2 ^store C3) +=>WM: (17: S1 ^name friends) ``` -Notice that the _apply\*init_ rule fired and added 3 _store_ commands to -working memory, where the identifiers to be stored are, initially, not -long-term, and whose augmentations mirror the contents of the _smem ---add_ command in Part 1 of this tutorial. Then, at the end of the -elaboration phase, semantic memory processed the command, converted the -identifiers to long-term, and added status for each command. +Notice that the `apply*init` rule fired and added 3 _store_ commands to working +memory, where the identifiers to be stored are, initially, not long-term, and +whose augmentations mirror the contents of the `smem --add` command in +[Part 1](./01.md) of this tutorial. Then, at the end of the elaboration phase, +semantic memory processed the command, converted the identifiers to long-term, +and added status for each command. -Now, try printing the contents of semantic memory using the _print @_ +Now, try printing the contents of semantic memory using the `print @` command. You will see that semantic memory now has the same contents as -after using the _smem --add_ command in Part 1. +after using the `smem --add` command in Part 1. Application of the next operator modifies the contents of semantic memory by overriding the contents of an existing long-term identifier -(`@1`). Click the “Step” button to select the next operator (_mod_) and -then click the “Run 1 -p" button to apply the operator: - -``` -Firing apply\*mod - -- (A1 ^name alice + :O ) (apply\*init) - -- (C2 ^store B1 + :O ) (apply\*init) - -- (C2 ^store C3 + :O ) (apply\*init) - -\+ (A1 ^friend C3 + :O ) (apply\*mod) +(`@1`). Click the "Step" button to select the next operator (_mod_) and +then click the "Run 1 -p" button to apply the operator: -\+ (A1 ^name anna + :O ) (apply\*mod) +```Soar +Firing apply*mod +- (A1 ^name alice + :O ) (apply*init) +- (C2 ^store B1 + :O ) (apply*init) +- (C2 ^store C3 + :O ) (apply*init) ++ (A1 ^friend C3 + :O ) (apply*mod) ++ (A1 ^name anna + :O ) (apply*mod) --- Change Working Memory (PE) --- - -\=>WM: (33: A1 ^name anna) - -\=>WM: (32: A1 ^friend C3) - +=>WM: (33: A1 ^name anna) +=>WM: (32: A1 ^friend C3) <=WM: (21: A1 ^name alice) - <=WM: (18: C2 ^store C3) - <=WM: (19: C2 ^store B1) ``` @@ -270,24 +262,24 @@ removed by the application rule, and that augmentations of `@1` are removed and added. Then, at the end of the elaboration phase, semantic memory cleans up the status information for the old _store_ commands. -Now, print the contents of semantic memory using the _print @_ command: +Now, print the contents of semantic memory using the `print @` command: -``` -(@1 ^friend @2 @3 ^name anna \[+1.000\]) -(@2 ^friend @1 ^name bob \[+1.000\]) -(@3 ^name charley \[+1.000\]) +```Soar +(@1 ^friend @2 @3 ^name anna [+1.000]) +(@2 ^friend @1 ^name bob [+1.000]) +(@3 ^name charley [+1.000]) ``` Notice that the augmentations of `@1` have indeed changed in semantic memory to reflect the new _store_ command, while `@2` and `@3` remain unchanged. -### Non-Cue-Based Retrieval +## Non-Cue-Based Retrieval The first way an agent can retrieve knowledge from semantic memory is called a non-cue-based retrieval: the agent requests from semantic memory all of the augmentations of a known long-term identifier. The -syntax of the command is ( ^retrieve ) where is a +syntax of the command is (` ^retrieve `) where `` is a short-term identifier that is linked to a long-term identifier. In other words, it is a short-term identifier that was previously used in a store command or recalled via a retrieve or query command. @@ -297,39 +289,39 @@ this tutorial (these rules are already part of the _smem-tutorial.soar_ file in the _Agents_ directory): ```Soar -sp {propose\*ncb-retrieval -(state ^name friends -^smem.command ) -( ^store ) -( ^name anna -^friend ) ---> -( ^operator + =) -( ^name ncb-retrieval -^friend )} +sp {propose*ncb-retrieval + (state ^name friends + ^smem.command ) + ( ^store ) + ( ^name anna + ^friend ) + --> + ( ^operator + =) + ( ^name ncb-retrieval + ^friend )} ``` ```Soar -sp {apply\*ncb-retrieval\*retrieve -(state ^operator -^smem.command ) -( ^name ncb-retrieval -^friend ) -( ^store ) ---> -( ^store - -^retrieve )} +sp {apply*ncb-retrieval*retrieve + (state ^operator + ^smem.command ) + ( ^name ncb-retrieval + ^friend ) + ( ^store ) + --> + ( ^store - + ^retrieve )} ``` ```Soar -sp {apply\*ncb-retrieval\*clean -(state ^operator -^smem.command ) -( ^name ncb-retrieval -^friend ) -( ^ ) ---> -( ^ -)} +sp {apply*ncb-retrieval*clean + (state ^operator + ^smem.command ) + ( ^name ncb-retrieval + ^friend ) + ( ^ ) + --> + ( ^ -)} ``` These rules retrieve all the information about one of `@1`’s two friends @@ -340,15 +332,15 @@ Unlike _store_ commands, all retrievals are processed during the agent’s output phase and only one retrieval command can be issued per state per decision. -Now click the “Step” button and notice that one of the two _ncb_ -operators is selected. Click “Run 1 -p" to see the application rule +Now click the "Step" button and notice that one of the two _ncb_ +operators is selected. Click "Run 1 -p" to see the application rule create a _retrieve_ command, requesting information about one of the two friends, as well as remove that friend’s augmentations from working -memory. Then click the “Run 1 -p" button again to proceed through the +memory. Then click the "Run 1 -p" button again to proceed through the output phase. Finally, print the full contents of the _smem_ link -(_print --depth 10 L1_): +(`print --depth 10 L1`): -```bash +```Soar (L1 ^command C2 ^result R3) (C2 ^depth 3 ^retrieve B1 (@2)) (R3 ^retrieved L2 (@2) ^success B1 (@2)) @@ -369,7 +361,7 @@ retrieved knowledge is limited to the augmentations of the long-term identifier: like the _store_ command, the _retrieve_ command is not recursive. -### Cue-Based Retrieval +## Cue-Based Retrieval The second way an agent can retrieve knowledge from semantic memory is called a cue-based retrieval: the agent requests from semantic memory @@ -386,57 +378,57 @@ short-term identifier, then any retrieval is required to have an augmentation that has the same attribute, but the value is unconstrained. -As an example, add the following two rules to our agent from Part 4 of +As an example, add the following two rules to our agent from [Part 4](04.md) of this tutorial (these rules are already part of the _smem-tutorial.soar_ file in the _Agents_ directory): ```Soar -sp {propose\*cb-retrieval -(state ^name friends -^smem.command ) -( ^retrieve) ---> -( ^operator + =) -( ^name cb-retrieval)} +sp {propose*cb-retrieval + (state ^name friends + ^smem.command ) + ( ^retrieve) + --> + ( ^operator + =) + ( ^name cb-retrieval)} ``` ```Soar -sp {apply\*cb-retrieval -(state ^operator -^smem.command ) -( ^name cb-retrieval) -( ^retrieve ) ---> -( ^retrieve - -^query ) -( ^name -^friend )} +sp {apply*cb-retrieval + (state ^operator + ^smem.command ) + ( ^name cb-retrieval) + ( ^retrieve ) + --> + ( ^retrieve - + ^query ) + ( ^name + ^friend )} ``` These rules retrieve an identifier that meets two constraints: (1) it -has an augmentation where the attribute is “name”, but the value can be +has an augmentation where the attribute is "name", but the value can be any symbol, and (2) it has an augmentation where the attribute is -“friend” and the value is the long-term identifier retrieved as a -result of applying the operator in Part 3. +"friend" and the value is the long-term identifier retrieved as a +result of applying the operator in [Part 3](03.md). As a reminder, all retrievals are processed during the agent’s output phase and only one retrieval command can be issued per state per decision. -So now click the “Step” button and then click the “Run 1 -p" to see the +So now click the "Step" button and then click the "Run 1 -p" to see the application rule create a _query_ command, as well as remove the -previous _retrieve_ command from working memory. Then click the “Run 1 +previous _retrieve_ command from working memory. Then click the "Run 1 -p" button again to proceed through the output phase. Finally print the -contents of the _smem_ link (_print --depth 10 L1_): +contents of the _smem_ link (`print --depth 10 L1`): -``` +```Soar (L1 ^command C2 ^result R3) -(C2 ^depth 3 ^query C4) -(C4 ^friend B1 (@2) ^name A2) -(R3 ^retrieved L5 (@1) ^success C4) -(L7 ^name charley) -(L6 ^friend L5 (@1) ^name bob) -(L5 ^friend L6 (@2) ^friend L7 (@3) ^name anna) + (C2 ^depth 3 ^query C4) + (C4 ^friend B1 (@2) ^name A2) + (R3 ^retrieved L5 (@1) ^success C4) + (L7 ^name charley) + (L6 ^friend L5 (@1) ^name bob) + (L5 ^friend L6 (@2) ^friend L7 (@3) ^name anna) ``` We see that semantic memory has retrieved and added to working memory @@ -450,10 +442,10 @@ _failure_ and there would be no retrieved structure. Note also that retrieved knowledge is limited to the augmentations of the long-term identifier: like the store command, retrievals are not recursive. We see this in the outputs above as one friend has augmentations (as a result -of the _retrieve_ command in Part 4), whereas the other does not. +of the _retrieve_ command in [Part 4](04.md)), whereas the other does not. If multiple identifiers had satisfied the constraints of the cue (such -as if the cue had only a WME with “name” as the attribute and a +as if the cue had only a WME with "name" as the attribute and a short-term identifier as the value), then the long-term identifier with the largest bias value is returned. By default, the bias value is a monotonically increasing integer, reflecting the recency of the last diff --git a/docs/tutorials/soar_tutorial/08.md b/docs/tutorials/soar_tutorial/08.md index 9d2a478b..0e0ef95f 100644 --- a/docs/tutorials/soar_tutorial/08.md +++ b/docs/tutorials/soar_tutorial/08.md @@ -1,3 +1,4 @@ + {{tutorial_wip_warning("Soar.Tutorial.Part.8.-.EpMem.pdf")}} # Part VIII: Episodic Memory @@ -11,35 +12,42 @@ procedural memory. ## A Short Demonstration -Before we delve into how an agent can use episodic memory, let’s see an +Before we delve into how an agent can use episodic memory, let's see an example of capturing an episode and viewing the contents of the memory. First, open the Soar Debugger. Then, execute the following command (this can be loaded from a source file just as any other Soar command): +```shell epmem --set trigger dc - epmem --set learning on - watch --epmem +``` -Now, click the “Step” button twice. If we inspect the trace, and ignore +Now, click the "Step" button twice. If we inspect the trace, and ignore the state no-change impasses, we see the following message: +```shell NEW EPISODE: 1 +``` This is an indication that a new episode, with id 1, has been automatically stored by the architecture within the episodic store. -We can view the contents of episodic memory using the _epmem --print_ +We can view the contents of episodic memory using the `epmem --print` command, which expects an episode id as an argument. For example, execute the following command: +```shell epmem --print 1 +``` Which will output the following result: + +```Soar ( ^io ^reward-link ^superstate nil ^type state) ( ^input-link ^output-link ) +``` To pictorially view the contents of semantic memory, we can use the visualize command to render the contents of semantic memory to an image. @@ -49,11 +57,10 @@ For example, execute the following command: visualize epmem ``` -If you have graphviz and DOT installed (see for -more detail), it should launch your system viewer to show a diagram -similar to: +If you have [graphviz](http://graphviz.org) and DOT installed, it should launch +your system viewer to show a diagram similar to: -![](Images/08/08/image1.png) +![Example visualization of episodic memory](Images/08/image1.png) From both the trace output as well as the Graphviz rendering we can see that episodic memory has stored most of the top-state of the agent’s @@ -61,15 +68,15 @@ working memory at a particular moment in time. In the following sections we’ll examine in more detail how to control automatic storage and how agents can retrieve episodic knowledge. -### Episodic Storage +## Episodic Storage -As we saw in Part 1 of this tutorial, episodic storage is automatic and +As we saw in [Part 1](01.md) of this tutorial, episodic storage is automatic and captures the top state of the agent’s working memory. To enable storage, episodic memory must be enabled. By default, all learning mechanisms in Soar are disabled. To enable episodic memory, issue the following command: -```bash +```shell epmem --set learning on ``` @@ -77,11 +84,11 @@ There are a few architectural parameters that are important to control episodic storage. The first is the event that triggers storage. By default, episodic memory stores new episodes whenever a WME is added to working memory that has the _output-link_ as its identifier. However, -Soar also supports storing episodes each decision cycle (“dc”), which is -enabled using the following command (which we used in Part 1 of this +Soar also supports storing episodes each decision cycle ("dc"), which is +enabled using the following command (which we used in [Part 1](01.md) of this tutorial): -```bash +```shell epmem --set trigger dc ``` @@ -91,7 +98,7 @@ default, this processing occurs at the end of the _output_ phase. However, Soar also supports this processing occurring at the end of the _decision_ phase, which is enabled using the following command: -```bash +```shell epmem --set phase selection ``` @@ -103,34 +110,34 @@ working memory, it does not store that WME, nor any substructure if it was the case that the value of the WME was an identifier. To view the current excluded set, issue the following command: -```bash +```shell epmem --get exclusions ``` To change the excluded set, issue the following command: -```bash +```shell epmem --set exclusions ``` This command toggles the state of an _attribute_ within the set: thus if this command is executed with an _attribute_ that is already in the excluded set, it is removed from the set, otherwise it is added. By -default, “epmem” and “smem” are in the excluded set, which is why we do -not see these architectural links in the trace/visualization in Part 1 +default, "epmem" and "smem" are in the excluded set, which is why we do +not see these architectural links in the trace/visualization in [Part 1](01.md) of this tutorial. -In Part 1, we also enabled trace output that is useful for understanding +In [Part 1](01.md), we also enabled trace output that is useful for understanding episodic memory via the following command: -```bash +```shell watch --epmem ``` This trace option indicates when new episodes are recorded, as well as debugging information for retrievals, as discussed later. -### Agent Interaction +## Agent Interaction Agents interact with episodic memory via special structures in working memory. Soar automatically creates an _epmem_ link on each state, and @@ -143,9 +150,9 @@ print --depth 10 ``` If you read the output carefully you will notice a WME that can be -generally represented as ( ^epmem ) and three -additional WMEs that can be represented as ( ^command ), -( ^result ), and ( ^present-id ) +generally represented as (` ^epmem `) and three +additional WMEs that can be represented as (` ^command `), +(` ^result `), and (` ^present-id `) As described in the following sections, the agent, via rules, populates and maintains the _command_ link and the architecture populates and @@ -154,12 +161,12 @@ augmentation updates to indicate the current episode id, the value of which is a positive integer. For the agent to interact with episodic memory, this mechanism must be -enabled. As mentioned in Part 2, by default, all learning mechanisms in +enabled. As mentioned in [Part 2](02.md), by default, all learning mechanisms in Soar are disabled and so you must enable episodic memory via the command -in Part 2. +in [Part 2](02.md). By default, all commands are processed during the agent’s output phase -(this can be changed using the _phase_ parameter, as described in Part 2 +(this can be changed using the _phase_ parameter, as described in [Part 2](02.md) of this tutorial) and only one command can be issued per state per decision. @@ -168,7 +175,7 @@ decision. The primary method that an agent can retrieve knowledge from episodic memory is called a cue-based retrieval: the agent requests from episodic memory an episode that most closely matches a cue of working-memory elements. The syntax of the -command is ( ^query ), where forms the root of the cue. +command is (` ^query `), where `` forms the root of the cue. Conceptually, episodic memory compares the cue to all episodes in the store, scoring each one, and returns the most recent episode with the maximal score. @@ -188,33 +195,38 @@ number of satisfied leaf WMEs. Let us consider an example cue, composed of the following WMEs, where N1 is the value of the _query_ command, as described above: + +```Soar (N1 ^feature value -^id N2) + ^id N2) (N2 ^sub-feature value2 -^sub-id N3) + ^sub-id N3) +``` Or, visually: -![](Images/08/image2.png) +![Example cue](Images/08/image2.png) -This cue has three leaf WMEs: (N1 ^feature value), (N2 ^sub-feature -value2), and (N2 ^id N3). Now consider the following episode: +This cue has three leaf WMEs: (`N1 ^feature value`), +(`N2 ^sub-feature value2`), and (`N2 ^id N3`). Now consider the following +episode: -![](Images/08/image3.png) +![Episode to match cue](Images/08/image3.png) -The first leaf WME of the cue, (N1 ^feature value), is not satisfied by this -episode, as there is no (E1 ^feature value) WME: (E1 ^feature2 value) has a -different attribute and (E1 ^feature value3) has a different value. Both other -leaf WMEs, however, are satisfied. (N2 ^sub-feature value2) is satisfied by -variablizing E1 as N1 and E2 as N2: (E1 ^id E2) and (E2 ^sub-feature value2). -(N2 ^id N3) is satisfied by variablizing E1 as N1, E3 as N2, and E5 as N3: (E1 -^id E3), (E3 ^sub-id E5). Note that the substructure of E4 in the episode -matches that of N2 in the cue, but there is no WME (E1 ^id E4), and so E4 is not -considered. Thus, this episode, with respect to the cue, has a score of 2. +The first leaf WME of the cue, (`N1 ^feature value`), is not satisfied by this +episode, as there is no (`E1 ^feature value`) WME: (`E1 ^feature2 value`) has a +different attribute and (`E1 ^feature value3`) has a different value. Both +other leaf WMEs, however, are satisfied. (`N2 ^sub-feature value2`) is +satisfied by variablizing E1 as N1 and E2 as N2: (`E1 ^id E2`) and (`E2 +^sub-feature value2`). (`N2 ^id N3`) is satisfied by variablizing `E1` as `N1`, +`E3` as `N2`, and `E5` as `N3`: (`E1 ^id E3`), (`E3 ^sub-id E5`). Note that the +substructure of E4 in the episode matches that of N2 in the cue, but there is +no WME (`E1 ^id E4`), and so `E4` is not considered. Thus, this episode, with +respect to the cue, has a score of 2. Note, however, that it is not possible to _unify_ the cue with the episode: -there is no single identifier in the episode that, when bound as N2 in the cue, -satisfies both (N2 ^sub-feature value2) and (N2 ^sub-id N3). If an episode gets +there is no single identifier in the episode that, when bound as `N2` in the cue, +satisfies both (`N2 ^sub-feature value2`) and (`N2 ^sub-id N3`). If an episode gets a perfect score, such that all leaf WMEs are satisfied, episodic memory attempts to graph match the cue with the episode (i.e. determine if there exists an isomorphism between the cue and the episode). So in response to a cue-based @@ -228,58 +240,57 @@ the following rules (these rules are already part of the epmem-tutorial.soar file in the _Agents_ directory): ```Soar -sp {propose\*init -(state ^superstate nil --^name) ---> -( ^operator + =) -( ^name init)} +sp {propose*init + (state ^superstate nil + -^name) + --> + ( ^operator + =) + ( ^name init)} ``` ```Soar -sp {apply\*init -(state ^operator ) -( ^name init) ---> -( ^name epmem -^feature2 value -^feature value3 -^id -^id -^other-id ) -( ^sub-feature value2) -( ^sub-id ) -( ^sub-id -^sub-feature value2)} +sp {apply*init + (state ^operator ) + ( ^name init) + --> + ( ^name epmem + ^feature2 value + ^feature value3 + ^id + ^id + ^other-id ) + ( ^sub-feature value2) + ( ^sub-id ) + ( ^sub-id + ^sub-feature value2)} ``` ```Soar sp {epmem*propose*cbr -(state ^name epmem - --^epmem.command.) ---> -( ^operator + =) -( ^name cbr)} + (state ^name epmem + -^epmem.command.) + --> + ( ^operator + =) + ( ^name cbr)} ``` ```Soar sp {epmem*apply*cbr-clean -(state ^operator -^feature2 -^feature -^id -^id -^other-id ) -( ^sub-feature value2) -( ^sub-id) -( ^name cbr) ---> -( ^feature2 - -^feature - -^id - -^id - -^other-id -)} + (state ^operator + ^feature2 + ^feature + ^id + ^id + ^other-id ) + ( ^sub-feature value2) + ( ^sub-id) + ( ^name cbr) + --> + ( ^feature2 - + ^feature - + ^id - + ^id - + ^other-id -)} ``` ```Soar @@ -303,65 +314,65 @@ epmem --set learning on watch --epmem ``` -Then click the “Step” button and then the “Run 1 -p” button. Now print -out the top state of working memory (_print --depth 10 s1_). Notice that +Then click the "Step" button and then the "Run 1 -p" button. Now print +out the top state of working memory (`print --depth 10 s1`). Notice that the top state contains the structures of the sample episode above (such -as _^feature value_), as well as other WMEs (such as _^superstate nil_). +as `^feature value`), as well as other WMEs (such as `^superstate nil`). -Now click the “Step” button. You should notice in the trace that episode -\#1 was stored. Click the “Run 1 -p” button to apply the _cbr_ operator -and print the top state of working memory (_print --depth 10 s1_). +Now click the "Step" button. You should notice in the trace that episode +`#1` was stored. Click the "Run 1 -p" button to apply the _cbr_ operator +and print the top state of working memory (`print --depth 10 s1`). Notice that the structures of the sample episode have been removed and that the sample cue has been added to the _command_ structure of the _epmem_ link. -Now click the “Run 1 -p” button. Episodic memory stored another episode -(\#2) and then processed the cue-based query. The trace contains the +Now click the "Run 1 -p" button. Episodic memory stored another episode +(`#2`) and then processed the cue-based query. The trace contains the following text: +```shell CONSIDERING EPISODE (time, cardinality, score): (1, 2, 2.000000) - NEW KING (perfect, graph-match): (false, false) +``` The first line indicates that episodic memory compared the cue to -episode \#1 (i.e. time=1), found that the cardinality of the set of +episode `#1` (i.e. $time=1$), found that the cardinality of the set of satisfied leaf WMEs was 2, and thus the episode was scored as 2. Since -this was the first considered episode, it is indicated as “king” \[of +this was the first considered episode, it is indicated as "king" \[of the mountain\]. However, since the episode did not have a perfect score (2 out of 3), graph-match was not attempted and was thus not successful. -Since episode \#2 did not have any features in common with the cue +Since episode `#2` did not have any features in common with the cue (application of the _cbr_ operator removed these structures), episodic memory did not consider it as a performance optimization. -Now print the full contents of the episodic memory link: +Now print the full contents of the episodic memory link (`print --depth 10 e1`): -```bash -$ print --depth 10 e1 +```Soar (E1 ^command C1 ^present-id 3 ^result R2) -(C1 ^query N1) -(N1 ^feature value ^id N2) -(N2 ^sub-feature value2 ^sub-id N3) -(R2 ^cue-size 3 ^graph-match 0 ^match-cardinality 2 -^match-score 2.^memory-id 1 -^normalized-match-score 0.6666666666666666 ^present-id 3 -^retrieved R4 ^success N1) -(R4 ^feature value3 ^feature2 value -^id I5 ^id I6 ^io I4 ^name epmem -^operator\* O5 ^other-id O4 ^reward-link R5 -^superstate nil ^type state) -(I5 ^sub-feature value2) -(I6 ^sub-id S3) -(I4 ^input-link I7 ^output-link O6) -(O5 ^name cbr) -(O4 ^sub-feature value2 ^sub-id S4) + (C1 ^query N1) + (N1 ^feature value ^id N2) + (N2 ^sub-feature value2 ^sub-id N3) + (R2 ^cue-size 3 ^graph-match 0 ^match-cardinality 2 + ^match-score 2.^memory-id 1 + ^normalized-match-score 0.6666666666666666 ^present-id 3 + ^retrieved R4 ^success N1) + (R4 ^feature value3 ^feature2 value + ^id I5 ^id I6 ^io I4 ^name epmem + ^operator* O5 ^other-id O4 ^reward-link R5 + ^superstate nil ^type state) + (I5 ^sub-feature value2) + (I6 ^sub-id S3) + (I4 ^input-link I7 ^output-link O6) + (O5 ^name cbr) + (O4 ^sub-feature value2 ^sub-id S4) ``` The _result_ structure indicates that the retrieval was successful, has a link to the full episode contents (rooted at R4), and has meta-data about the cue-matching process, with respect to the retrieved episode. Details of these augmentations are in the Episodic Memory chapter of the -Soar Manual. Note that a WME with an “operator\*” attribute (such as: R4 -^operator\* R5) in a retrieved episode represents an acceptable +Soar Manual. Note that a WME with an `operator*` attribute (such as: `R4 +^operator* R5`) in a retrieved episode represents an acceptable preference WME in the original episode. There are optional modifiers to cue-based queries, including the ability @@ -374,56 +385,59 @@ the Soar Manual. Another way the agent can gain access to episodes is by retrieving the episode that came temporally before/after the last episode that was -retrieved. The syntax of these commands, respectively, are ( -^previous ) and ( ^next ), where is any +retrieved. The syntax of these commands, respectively, are +(` ^previous `) and (` ^next `), where `` is any identifier. -As an example, add the following rules to our agent from Part 4 of this +As an example, add the following rules to our agent from [Part 4](04.md) of this tutorial (these rules are already part of the _epmem-tutorial.soar_ file in the _Agents_ directory): ```Soar sp {epmem*propose*next -(state ^name epmem -^epmem.command.query) ---> -( ^operator + =) -( ^name next)} + (state ^name epmem + ^epmem.command.query) + --> + ( ^operator + =) + ( ^name next)} ``` ```Soar sp {epmem*apply*next -(state ^operator -^epmem.command ) -( ^name next) -( ^query ) ---> -( ^query - -^next )} + (state ^operator + ^epmem.command ) + ( ^name next) + ( ^query ) + --> + ( ^query - + ^next )} ``` These rules will retrieve the episode that temporally proceeds the episode retrieved in the previous part of this tutorial. -Click the “Step” button, then the “Run 1 -p” button. Now print the -episodic memory link (_print --depth 10 e1_). Notice that the _query_ +Click the "Step" button, then the "Run 1 -p" button. Now print the +episodic memory link (`print --depth 10 e1`). Notice that the _query_ command has been replaced with a _next_ command. Note that the results of the previous commands are still in working memory: these will be automatically cleaned by episodic memory when the _next_ command is processed. -Now click the “Run 1 -p” button and print the episodic memory link -(_print --depth 10 e1_): +Now click the "Run 1 -p" button and print the episodic memory link +(`print --depth 10 e1`): + +```Soar (E1 ^command C1 ^present-id 4 ^result R2) -(C1 ^next N4) -(R2 ^memory-id 2 ^present-id 4 ^retrieved R6 ^success N4) -(R6 ^io I8 ^name epmem ^operator\* O7 ^reward-link R -^superstate nil ^type state) -(I8 ^input-link I9 ^output-link O8) -(O7 ^name next) + (C1 ^next N4) + (R2 ^memory-id 2 ^present-id 4 ^retrieved R6 ^success N4) + (R6 ^io I8 ^name epmem ^operator* O7 ^reward-link R + ^superstate nil ^type state) + (I8 ^input-link I9 ^output-link O8) + (O7 ^name next) +``` The result structure has been cleaned of old structures and now shows -that the command was successful and episode \#2 was retrieved (with all +that the command was successful and episode `#2` was retrieved (with all of its original contents). You now have some basic understanding of using episodic memory. Read the diff --git a/includes/abbreviations.md b/includes/abbreviations.md index 751e7e2c..083368b1 100644 --- a/includes/abbreviations.md +++ b/includes/abbreviations.md @@ -5,3 +5,12 @@ *[DARPA]: Defense Advanced Research Projects Agency *[SML]: Soar Markup Language *[GDS]: Goal Dependency Set +*[RL]: Reinforcement Learning +*[LHS]: Left-Hand Side +*[RHS]: Right-Hand Side +*[WME]: Working Memory Element +*[WMEs]: Working Memory Elements +*[SMem]: Semantic Memory +*[EpMem]: Episodic Memory +*[ECR]: Expected Current Reward +*[EFR]: Expected Future Reward diff --git a/mkdocs.yml b/mkdocs.yml index 00f8cdcb..ae984d86 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,6 +11,21 @@ theme: logo: Images/soar.png favicon: Images/soar.png custom_dir: overrides + icon: + admonition: + note: octicons/tag-16 + abstract: octicons/checklist-16 + info: octicons/info-16 + tip: octicons/squirrel-16 + success: octicons/check-16 + question: octicons/question-16 + warning: octicons/alert-16 + failure: octicons/x-circle-16 + danger: octicons/zap-16 + bug: octicons/bug-16 + example: octicons/beaker-16 + quote: octicons/quote-16 + features: - content.code.copy - content.code.annotate @@ -46,6 +61,8 @@ extra_javascript: markdown_extensions: - extra + - admonition + - pymdownx.details - tables - toc: permalink: true