'Humans', is a project consisting of a series studies on semantics and meaning. There are two major components of the topic:
- The structure of the word(linguistic) semantics
- The relationship and interactions between world semantics and word semantics
Word semantics and world semantics are the two paramount constituents of the study of meaning. While the former concerns more about the meanings embedded and represented within language (a formal system), latter is more about the reality of the 'physical' world (and mental representation in other modules, e.g. visual representation) that gives rise to the word semantics. The relation between these two components of meaning can be summarized as:
-
World semantics gives rise to word semantics, and word semantics is a 'formal' (here 'formal' broadly refers something with a form, but not the canonical sense of 'rule based') representation, and also an approximate 'homomorphism' with some specific concentration.
-
Word semantics, while founded on World semantics, does not always adhere to it. As it is embedded in language, word semantics can be combined, composed recursively, creating infinite meanings that does not, or may not exists in the realistic world.
In the project, we investigate both components of meaning, and speak to both relations (between world and word semantics) mentioned above.
Corresponding to the first relation, 'Humans' project collect all data from a simulated world, which an artificial language built on. More details are provided in specific sessions, and here is a brief introduction. We create a world, consisting of creatures (humans, animals, plants) and 'physical' rules(law of the world), and generate events as the world runs automatically under some designed algorithm. Such a simulated running world equipped with physical rules provides a 'world semantics'. A pseudo language is generated as the description of the world events, with a preposed syntax. As a description of the simulated world, the language should be able to capture the preposed world rules, and featuring what actually happened in the world.
Speaking to the second relation, the artificially generated language, given its characteristic of being a language, should not only serve as mere record of the world events, but also has the potential to generate new phrases, sentences and concepts. Moreover, these new linguistic creations (generated by the language, not what actually happened in the world) should not only follow the syntax, but also 'make sense'. That is, the language, created as a description of the what happened in the world, should be able to make analogies, and infer what may be possible, less possible, and impossible, in other words, creating 'new meanings'.
And whether or not the 'new meaning' can be generated and in what 'form' is such a meaning? It depends on how we model the word semantics, which is one of the main themes of project Humans: how do different semantic models differ in terms of their representational power and characteristic? It might be affected by the initial world semantic, by the form of the input (linguistic/non linguistic), by what and how to encode the information, what data structures is used to represent the concepts (spacial, in terms of vector space/graphical, like semantic network/connectionism like SRN), and factors are the aspects when it is to talk about semantic models.
By creating a world and deriving a language from the world, we are explicitly binding the world semantics and word semantics, thus making the relationship between the two controlled, and as a result, controlling many factors mentioned above. And the goal of the study is, given the artificial corpus, deriving from a world, with the embedding world semantics controlled, which, or what type of semantic models can:
- Capture the world semantics
- Basing on the model semantic representation, making 'reasonable' inference which provide possibility beyond the physical world reality (and this is the critical representational power that we are examining here).
Notice that this project goal seems to coincide with the second relation between world semantic and word semantic, however there is a minor distinction. In the project goal, since we have controlled factors including the input, the underlying world semantics, the resultant representational difference can only be attributed to the representational characteristic of the semantic model investigated.
In the following sections, there are more detailed information about the project, which is organized as:
- Project goals in detail: the recent project goals, parsed into steps.
- World: Versions of simulated World corresponding to different steps of the project goals.
- Corpus: How the corpora correspond to the simulated world, and the theoretical and empirical validity of the corpora as the input for the semantic models
- Semantic models: Different semantic models and representations investigated in the series of study.
The current goal of the project is to figure out a semantic representational model accounting for the compositionality of language and concept. Here we don't give a rigorous definition of the term 'compositionality', it roughly refers to the notion that concepts functioning on each other forming new concepts, e.g. 'apple' and 'red' compose 'red apple', 'chase' and 'dog' form 'chase dog'. The goal of the project is to find semantic models, which
- Provide a compatible semantic representation for both the lexical items (words), and the composed structures (phrase, sentence).
- Provide a semantic representation which articulate the relations among linguistic structures of all levels (word, phrase, sentence, morpheme?)
For example, the model should be able to:
- represent the meaning of 'dog', 'chase', 'cat', 'cheat','mouse'
- represent the meaning of composed structure 'chase cat', 'chase mouse', 'cheat dog', 'dog chase cat'
- represent the semantic relations between the (meanings of) 'cat' and 'dog', 'chase cat' and 'chase dog', 'chase' and 'cheat', 'chase cat' and 'cheat cat', 'cat cheat dog' and 'cat chase mouse', 'dog chase cat', and 'cat chase mouse'. The semantic relations mentioned can refer to 'similarity' or 'semantic' relatedness in more general sense. For example, for 'cat' and 'dog', both relatedness and similarity can be measured, and this is the same for 'chase cat' and 'chase mouse', while it makes more sense to measure the general relatedness between 'dog chase cat' and 'chase mouse'.
Such a goal concerning the representation of compositionality can be parsed into two steps:
- first order dependency(D1): combinatorial properties of the semantics of concepts
- second order dependency(D2): representation of real composed concepts.
We spell out D1 and D2 in the example below:
D1 concerns about the combinatorial properties of words. For example, we may say 'drink coffee' and 'eat an apple', but not the other way round. Here, while 'drink an apple' and 'eat coffee' make little sense, in other words, semantically bad, they are grammatically plausible, 'drink eat' on the other hand is syntactically bad. Throughout the discussion, we only talk about linguistic structures with correct syntax while varies in semantic soundness.
Within the scope of grammatical structures, the semantic soundness of the items may not be a binary 'yes or no' standard, but a spectrum. For example, while it is weird to say 'eat coffee', it is even worse to say 'eat a table', and 'eat the song' and 'eat concentration' sound extremely weird, on the other hand, 'eat a feast' would be better than 'eat coffee' while not as good as 'eat an apple'.
This shows some words are more semantically plausible to be coupled than other pairs, which resulting in the generic combinatorial properties of lexical items: while the verb 'drink' takes 'drinks' as its canonical argument, 'eat' takes solid foods. What follows from this, is a generic classification (grouping, distancing) of words(usually within one grammatical category) by its environment (the words it couples) with. This general philosophy of carving out the meaning of a word by its context is followed by a wide range of semantic models, including feature norms, and spatial models encoding contextual information(LSA, topic model).
And this combinatorial property of words: how likely a word is to be coupled with another word (under the condition of being grammatical) is what we refer as D1 here: the default semantic dependency, 'eat' goes with food, and 'drink' goes with liquid. Going back to the general goal discussed in the introduction, a semantic representation should be able to:
- represent D1 embedded in the input (world semantics)
- form inference of other D1 relation, which is not observed in the input
For example, suppose in the corpus, we observed lots of 'eat apple' and 'eat banana', and another noun 'lingou' which has never been coupled with 'eat'in the corpus. The semantic model should first in the representation account for the 'fact' that 'eat' can go with 'apple' and banana, while infer whether 'eat lingou' make sense, basing on how similar is 'lingou' to 'apple' and 'banana'. If in the corpus, 'lingou' always appear in the exact same position as 'apple' and 'banana' wihtin a sentence, and it couples with lots of other verbs and adjectives which also goes with the two fruits, the semantic representation should be able to form the induction that 'lingou' might be something that similar to a fruit, therefore edible and could be an argument for 'eat'.
While the description above seems to be a specific type of analogy task, it is actually a paramount property that give rise to the fruitful word semantics and language. First, although in the example, we use 'verb-noun' pair, it can be generalized to all word pairs under grammaticality, e.g. (verb-preposition phrase, adjective-noun, verb-adverb). Second, the capability of the semantic models to form novice combinations (compositions) based on observed data, equipped with the recursive syntax, leads to the creation of infinite concepts and therefore infinite plausible meanings basing on a finite lexicon. Notice that the creation of new concepts is not a free combination due to generative syntactic rules, but also under the semantic constraint provided by the semantic representations, therefore, the new concepts are not only grammatical, but also 'meaningful'. (e.g. eat lingou).
Thus, the first goal of the project, is to figure out the representational power of semantic models in representing D1. In other words, what type of semantic models, and to what extent can they first represent the 'coupling' observed in the corpus, and then form inference on possible couplings which are not observed. Such representational power of semantic models has the benefits of:
- Forming semantic classification for words
- Giving the potential to equip the generative language with generative meanings (in other words, making the generative language meaningful, otherwise, the resultant generative language is only a set of formal symbols)
While D1 is about the combinatorial property of words, D2 is about the same property with constraint. For example, in D1 while 'drink an apple' is bad, both 'drink coffee' and 'drink wine' are fine. In D2, we talk about 'drink wine' and 'drink coffee' under constraint. Suppose that the location of the 'drink' is a bar, than 'drink wine' make more sense than 'drink coffee'. To spell it out, the phrase 'drink wine in the bar' make more sense than the phrase 'drink wine in the cafe'. Here we see that the argument of the predicate 'drink' does not only depend on the verb itself, but also the prepositional phrase 'in the X' modifying the verb. Therefore, we call it 'second order dependence', referring to the choice of coupling depend on more than one factor.
Compared to D1, what D2 brings is the representation of composed concepts and meanings. In D1, only lexical items and the relation between lexical items are investigated, such as 'drink coffee' concerns the semantic soundness or relation between 'drink' and 'coffee', while in D2, the items to be investigate are not only words, but also phrases, which are compositional structures. In D2, we consider the semantic soundness of 'drink coffee in the bar', that is the relation among 'drink', 'coffee' and 'bar', and to be more precise, the relation between 'drink coffee' and 'bar'. In this case, models have to come up with representations not only for the words, but also for the phrase 'drink coffee', which should be derived from the representation of the two words consisting it, and this is the problem of 'representing compositional structure'.
In this scenario, there are approaches which leave aside compositionality, such as some type of baysian models, considering the likelihood of pairing 'drink' and 'coffee' conditioning on the constraint of 'bar'. While this type of model might work for this example, it take the cost of missing syntactic information, which could be paramount to representing semantics. And if we take the syntactic structure seriously, it is not avoidable to deal with the problem of representing composed structures, like 'drink coffee', or 'chase cat'.
And similar to the first goal, the second goal of the project, is to figure out the representational power of semantic models accounting for D2. In other words, what type of semantic models, and to what extent can they first represent the 'constrained coupling' observed in the corpus, and then form inference on possible couplings which are not observed. For example, in the corpus, both 'coffee', and 'wine' couple with 'drink', while the former happens 'in the cafe', and the latter 'in the bar' a semantic model need to:
- say ok to both 'drink coffee' and 'drink wine', but take 'drink wine in the bar' while reject 'drink coffee in the bar'.
- for 'drink tea', which is not specified in the corpus that it is drunk in a bar or a cafe, the model need to infer that it is more likely to happen in a bar or a cafe, basing on how similar it is to 'drink coffee' and 'drink wine'.
What makes D2 a problem much more difficult than D1 is the issue of compositionality. We need to not only take care of the words themselves, but also consider the composition of the words into a complex structure. Moreover, something need to be mentioned here is that, the representation we are talking about here is not anything like Montague-ish formal semantic, which only provide a mechanism to formally compose the meaning of 'drink coffee in the bar', while having no answer for why that makes less sense than 'drink wine in the bar', or 'drink coffe in the cafe', or even more than that, how similar is the notion of 'drink coffee in the bar' to the that of 'reading newspaper in the stadium'. We need kind of the representations having a decent account for the compositionality, while being able to articulate the relations between these composed structures, like representing the relations of the words in D1.
In this project, all semantic models will be trained on artificially generated linguistic corpora. We will discuss, in the next section, why we need a corpus but not material in other form and why the corpus need to be artificial but not naturalistic (e.g. Wikipedia), and here assuming that an artificial corpus is what we need. Then the questions is, how do we generate the corpus?
Our choice here is to generate corpora backed up by a simulated world. The reasons are twofolds:
- to ground the linguistic corpus with a 'physical world' reality and provide the 'world knowledge' (world semantics) giving rise to the language (corpus, word semantics).
- since we also want to look at the relationship and interactions between world semantics and word semantics, thus a world for words becomes necessary.
Going back to the very general goal mentioned in the introduction, we are interested in both the relation between world -word semantics and the structure of word semantics. For the former interest, there has to be a simulated world, which also benefits the second interest: when we look at how different semantic models work on a given corpus, we can fix the world so that all models work on the same corpora.
However, while sticking to the purpose of studying word semantic structures, one may argue what is needed is mere the corpus, thus, a backed up world may look redundant. Instead, the classic way to generate corpus is to use generative grammars, for example, CFG or PCFG (Probabilistic Context Free Grammar). However, such generative grammars are not powerful enough when it comes to generate corpora with rich semantics, and they also lack a sense of reality, that a world based corpus may provide.
For example, we look at 'drink coffee' and 'eat an apple'. To mimic the reality, we need the corpus to have 'drink' go with 'coffee' while 'eat' go with 'apple'. Suppose we have a PCFG, the only way to realize such semantics is to write all of these explicitly in the generative rules. If we only have the following PCFG:
Non Terminals: {VP, V, N}
Terminals: {apple, coffee, drink, eat}
Rules:
VP -> V N, 1
V -> drink, p1/ eat, 1-p1
N -> coffee, p2/ apple, 1-p2
Such a PCFG would end up in a language with four types of phrases, with the following probability distribution:
drink coffee: p1p2
drink apple: p1(1-p2)
eat coffee: p2(1-p1)
eat apple: (1-p1)(1-p2)
Suppose we need a corpus in which 'drink coffee' and 'eat apple' has to be there, but not the other two, it turns out mathematically this is not realizable with the PCFG above. Since we don't want 'drink apple', p1 has to be 0, or p2 has to be 1, however, p1 can't be 0, otherwise, 'drink' will not be in the corpus, therefore p2 has to be 1. However, if p2 = 1, and we don't want 'eat coffee', then p1 has to be 1, which implies 'eat' will not be in the corpus, which is not what we want. Therefore, we see that such a corpus can not be realized by the PCFG, above.
A critical problem for the PCFG above is that it makes the distribution of argument nouns independent of the choice of verb (context free), which is in most case, not acceptable for a language embedded with realistic semantics. Because in realistic semantics, the combination is interdependent, we drink liquid and eat foods, the nouns to be take as argument depend on what is the verb (predicate, event). This makes it very difficult for a PCFG to generate a semantic rich corpus.
However, it is still 'possible' to realize realistic semantics through PCFG, by spelling out all the rules:
Non Terminals: {VP, V1, V2, N1, N2}
Terminals: {apple, coffee, drink, eat}
Rules:
VP -> V1 N1, p/ V2 N2, p
V1 -> drink 1
V2 -> eat, 1
N1 -> coffee, 1
N2 -> apple, 1
In this way, we are explicitly writing down the two legal phrases in the rules, and we have to divide the set of verbs and nouns into discrete singleton, because they all have different coupling behaviors.
A corollary follows that one has to spell out every single sentence (one line of rule, for one sentence), and make a grammatical category for almost every single word to generate a corpus with realistic semantics. Because the richness of semantics implying (almost) every single word has its own meaning, reflected in its coupling with other words. Therefore, there will be no rule like A -> B C, where B and C stand for non-singleton, otherwise, for B1 and B2 in set B, they have the exact same meaning.
Such a corollary implies that PCFG style grammar, without spelling out every single sentence it is supposed to create, is not able to generate corpora embedded with realistic semantics. Therefore, since after all we have to spell out the sentences, we do it in a more realistic way, that is create a world which 'grounds' every single sentence that we want to have in the corpus as 'world event'. And while these events take place, sentences describing the event is generated, making up the corpus we need.
The world is designed as a 'agent focused world', consists of agent and environment. The agents are drive motivated, going around and act to fulfill their drives. Corresponding to the project goals concerning D1 and D2, we have two versions of world, version 1.0 (W1) and version 2.0 (W2) which is built on the first version. Both of W1 and W2 are designed with the aim to generate a corpus meeting the study goals of D1 and D2. In this section, we first introduce W1 while introducing the basic set up of all versions of simulated world in the study, and then move on to some plan for W2.
There are 10 types of 'entities' in W1:
Agents: humans, carnivores, herbivores_1(H1), herbivores_2(H2), herbivores_3(H3)
Resourses: drinks, plants, fruits, nuts
Locations: locations
And for each of the 10 types include 3 different categories
Humans: three names randomly chosen from a name list
Carnivores: {tiger, wolf, hyena}
H1: {squirrel, rabbit, fox}
H2: {ibex, mouflon, boar}
H3: {bison, buffalo, auroch}
drinks: {water, juice, milk}
fruits: {apple, peach, pear}
nuts: {walnut, cashew, almond}
plants: {flower, grass, leaf}
locations: {river, tent, fire}
All agents have two major systems, working together so that the agents take actions automatically. The two systems are Drive, and Event Tree.
Drive
In W1, all agents have 3 drives: hunger, thirst and sleepiness, which determine their behaviors. The level of 3 drives changes over time. When the drives are below some presupposed threshold, the agents may do nothing, otherwise, they will act to satisfy their need. For example, if a tiger has hunger with level 10, while the hunger threshold is 5, it will do something (hunting, eating) to soothe the hunger, and once it has had the meal, the level of hunger drop, while other drives probably has leveled up, say sleepiness, then the tiger may go sleep.
Event Tree
When agents need to fulfill their drives, they don't act randomly, but follow his/her embedded 'event tree'. An event tree is an internal system that determine the order and structure of the actions that an agent takes.
For example, when ordering and eating in Subway, one may go through ordering, paying, and then eat. Therefore, the whole 'eating at Subway' event can be further decomposed into a sequence of sub events, and we can say that the 'grand' event 'eating at Subway' is a serial combination of the three sub events. However, if we look closer at it, we notice that between paying and eating, there is an optional action of getting your soda, which depend on your order: if a soda is ordered, one needs to go get it, otherwise, there is no such an action. In this case, we can't constituent the grand event by mere a sequence of sub events, since the latter sub-event might differ based on the choice made in some earlier event. In this case, wherever there is a choice to make, we need to have a 'parallel design', to decompose the larger event into several options. For example, for the 'eating at Subway' event, after 'ordering', it can be divided into two 'branches', therefore, two paralleled sub events, one representing the option with soda, and the other not. Under the option without soda, it is followed by paying and eating, while for the one with soda, it adds a 'fetching drink' event between paying and eating.
The example above summarize the basic organization of our daily life events. Almost for every event with a procedure, it can be decomposed as a combination of serial combined event and parallel combined events. The serial decomposition occurs once the event can be divided into a sequential steps or linear order, while the parallel decomposition comes in whenever there is a choice to make.
Here, we use the event tree structure to organize how the agents act in the world. The leaves of an event tree are atomic events, which correspond to the verbs in the target corpus. The atomic actions are combined in serial or in parallel to form complex events, represented by parent branching nodes in the tree, and the sub events are the children node of the parent nodes. If the sub events are of serial order, the parent node is called a serial combination of its children, and if the sub events are combined in parallel, it is called a parallel combination. These parent nodes, while abstract, represent some intermediate event in the tree structures, and these complex events are composed serially or parallel, recursively, ending up to the drive nodes representing the drives. Finally, the drive nodes are combined in parallel to make the full event tree of the agent.
In W1, there are only three kinds of event trees: one for humans, one for carnivores, and one for herbivores. Here is an example of the carnivore event tree in W1:
{()}
{}
'searching',(0,0)
'going_to',(0,1)
'chasing',(0,2)
'biting',(0,3)
'eating',(0,4)
'laying_down',(1,0)
'sleeping',(1,1)
'waking_up',(1,2)
'getting_up',(1,3)
'going_to',(2,0)
'drinking',(2,1)
'resting',(3,)
The first two lines notify the parallel branching node and probabilistic parallel branching node in the tree, all the other nodes are either terminal nodes or serial branching nodes. A probabilistic parallel node choose its child by some probability distribution over the children, while a deterministic parallel node choose its child by some deterministic algorithm.
And starting from the third line all the way to the last are the terminal nodes together with its coordinate in the tree. '()' refers to the coordinate of the root node, and for the rest of the node, the length of its coordinate indicate the level of the node. Given the terminal nodes and their coordinate, one can recover the whole tree. For the carnivores' event tree, we can tell that they have three drive nodes (0,), (1,), (2,), and each one is a serial combination of several atomic events. (3,) marked by 'resting' is a void event which take place when all three drives are below the threshold.
At each time point, the agent takes turn based on its current drive level, and accord to its event tree. And at each moment, the agent has a inner state, which is the node position in the event tree. An agent taking turns is referred as its inner state moving on its event tree. Since all nodes in the event tree represent some 'event' (terminals are verbs, or atomic events, non-terminal branching nodes are intermediate event making of smaller sub events, and the root the full event), each of the node(event) in the structure has an 'complete/incomplete' attribute. For non-terminal nodes, if it is a serial combination, it is considered as complete if and only if all its children are complete, and to complete a serial node, one finish all its children in the order from left to right on the tree. And if it is a parallel combination, it is complete as long as one of its child node is complete, thus to complete a parallel node, one child among the children nodes is chosen, and it only needs to complete that chosen child event. For terminal nodes, which are atomic event corresponding to the real actions, the completeness attribute is determined by the action specific function.
At the very beginning, the agent start from their root, and all nodes in the tree are 'incomplete'. It first evaluate the current drive level and decide which drive to satisfy, once the drive has been chosen, the inner state moves to the drive node. And on each non terminal branching node, it decides which node to go for this step. If the event represented by the current node has been finished, it goes back to its parent node(which is its parent event), otherwise, it continue on the event represented by the current node, which means that it should go to one of its child (go to the next step if serial/ choose one if parallel). When it is on the terminal node, which is an action corresponding to a verb in the corpus, a specific corresponding action function is called representing the execution of this action, and whether it finish the action depends on the function. Some actions are finished at once, while others may take a few more turns. Once the action is complete, a sentence is generated, describing the sentence. The sentence always include the agent and the verb, and in W1, it might have a patient if the verb is transitive, as we have both transitive (chase) and intransitive events (sleep) in W1. When the root node turn 'complete', an epoch is over, indicating that the agent has went over a series of actions satisfying the drive chosen at the beginning. Then the event tree of the agent is rebooted, so that all nodes in the tree becomes 'incomplete'. Then the agent looks at its current drive levels and start another epoch.
Actions
In the simulated worlds, the actions are the predicate of the world events. In W1, there are 20 actions, which are:
Intransitives: search/ lay_down / sleep / wake_up / get_up / rest
Transitives: go_to / chase / bite / eat / drink / trap / stab / shoot / throw_at / gather / butcher / crack / peel / cook
These actions are carried out by specific functions defined in the program, and when they are finished, a simple sentence describing the event is generated, contributing to the corpus. For example, if a bison finished eating grass, a sentence 'bison eat grass' is generated, and added into the corpus. We will talk more about the corpus in the next section.
One important point to make, is that the verbs are not for all agents, but each verb is specific to a constrained types of agents, which make up part of the world semantic, and will be talked about later in the section.
Other Entities
Apart from the agents, we have some resources and locations, making up the nouns in the corpus. In W1, locations are like other resources, considered as patient of actions, e.g. go_to river. Yet, in later versions of world, it can become a real location, as an adverbial component in a sentence. Different from the animals and humans which can be of either agent or patient role in a event, e.g. a bison can eat grass, while chased by a tiger, the resources and location only appear in the patient role of the sentence.
How the world run
The world is agent focused. At each moment, all existing agent take their turns updating their inner state on their event trees. If some agent is on ite terminal nodes, a world event is generated, coupled with a simple sentence describing it added to the corpus. And since all agents act to fulfill their drives, the system will just go on and on automatically until the end of the program (number of turns). And in W1, the agents do not direct interact each other, except for humans and carnivores may hunt herbivores. No communal action is allowed. Every individual's behaviors only depend on its own current drive level and the inner event tree, therefore maximally independent of other agents.
World semantics
In W1, there are entities and actions. In terms the targeting language, the actions becomes the predicate(verb) of sentences while the entities become agents/patient of the predicates. When creating the event tree, agent-predicate relations are stipulated, an agent will only have the actions appeared in its event tree. Similarly, verb-predicate relations are also stipulated for every single verb, and it results in a noun-verb relation table:
crack(a) | chase(a) | eat(a) | drink(a) | crack(p) | chase(p) | eat(p) | drink(p) | |
---|---|---|---|---|---|---|---|---|
Mary | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
Tiger | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
Bison | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
Walnut | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
apple | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Water | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Above is part of the syntagmatic rule table of W1, which stipulate the noun-verb relation. For example, the first row of the table stipulate that, in W1, Mary as a human, may crack, chase, eat, drink and be chased, while may not be cracked, eaten or drunk. These combinatorial rules is considered as the main component of the world semantics in W1, and since the corpus is a description of the world events generated from these rules, the corpus will inherit these rules, which can be shown in the distributional pattern of the words in the corpus. For example, in the table we have Mary allowed to be agent for crack, but not patient for eat, therefore in the corpus, there may be sentence like 'Mary crack X', but never 'Y eat Mary', in other word, Mary may appear immediately before the verb 'crack', but not immediately after 'eat'. The distributional information is in turn, critical to the semantic representation, and can be considered as a major component of the word semantics.
Also in W1, we have design to differentiate the agents by what they eat: humans eat herbivores, fruits and nuts, carnivores eat herbivores, and herbivores eat plants. The agent-patient syntagmatic relations, although not included in the table above, also contributes to the world semantics, and would also be featured as distributional patterns in the corresponding corpora.
And binding some verb-noun, and agent-patient relation, we have some D2 here in the picture. For example, there are foods like fruits and plants that can be eaten, different from the drinks, which can be drunk but not eaten. While when comparing fruits and plants, the distinction is not only in the action, but also in the agent they couple with. Although both fruits and plants can be eaten, fruits are only eaten by humans, while plants only by herbivores, and this is exactly D2. While W1 is not aimed for studying D2, we see the possibility in the design.
Another component of world semantics embedded in W1 is the event order. In reality, events are not just a collection of unordered happenings, but have temporal orders and casual relations. W1 capture the temporal linearity of the event, which is stipulated in the event trees. Remember in the event trees, there are serial combinations, which arrange the sub-events by a serial order. For example, to bite a game, the carnivore has to first chase it and catch it. This generic temporal linearity in the world lead to the orderness of sentences in the corresponding corpus, which differentiate it from the corpora generated by generative grammar. We will have more discussion on the effect of this later.
To be updated.
In the previous section, it is assumed that an artificial corpus is what we need. In this sections, we answer the two questions:
- Why using linguistic material (corpus) ?
- Why do we need artificial corpora ?
And after these answers, we have a closer look at the corpus generated in the project.
There are two types of 'material/input' used to form the semantic representation: linguistic corpus data(LSA, HAL, BEAGLE, Word2Vec Topic model, etc), and 'non-linguistic' data like word norm (Nelson 2004) or feature norms(McRae et al. 2005). By 'linguistic corpus', we refer to the linguistic data in the form of natural language, which could be articles, dialogues. They should be set of paragraphs (where the sentences are lined up with a logical or temporal order), or at least set of sentences (where words are lined up by syntax, and meanings). On the other hand, non-linguistic data refer to other types of construction built on lexical items (words mostly), whereas these lexical, items are more likely to be considered as 'concept', rather than 'word' (as a linguistic unit).
There are also other types of 'non-linguistic' input, e.g. visual data, which have critical connection to semantic representation and there are a considerable amount of studies on the topic. While these non-linguistic inputs are important, this project nevertheless, do not focus on them, thus is not further discussed.
The most critical distinction between the two types of input is that: linguistic input has a generic connection to the cognitive process, while non-linguistic most of the non-linguistic data to date, are formed empirically,by asking humans to generate 'answers' given some stimuli. As a result, the semantic models/representation built on linguistic material have the potential (and also are valid) to answer the question: where are the representation/the semantic from; while models /representation built on non-linguistic data can not speak to this, rather, they merely provide a representational form.
To spell it out, semantic models built on linguistic input rely on such fact: in the real world, humans take in all sorts of information (including linguistic information), and they form semantic memory. Using the linguistic data to train the models and form representation, is a simulation of the forming semantic representation 'in the wild'. The linguistic data, in most of current models, are treated as sequences of symbols, and the very first encoding process in most cases is 'word count', which is a cognitively plausible process of learning. In this way, training models on linguistic input is not only about the representation, but also the cognitive process.
On the other hand, the non-linguistic data, which are collected empirically, is 'resultant' from the beginning. Not to mention the effect of artifacts, human judgements are essentially the 'output' of some form of semantic representation, but not the 'input'. Therefore, the semantic representation formed on these data, even if they have decent prediction of human data, at most reflect the representational power of some data structure(graph, vector space). It is more informative about what a semantic representation is probably like, but say nothing how do learners (models) build up such a representation, after all, humans do not form their semantic representation by knowing the fact that other humans utter 'dog' as an associate to the given stimulus 'cat'.
We choose linguistic data over non-linguistic data, since we care about both the representation and the learning process. We not only look at the form of semantic representations as results of training, but also try to connect the representation with cognitive processes.
Deciding on linguistic corpus for the input, the next question is: what linguistic corpus do we need? Most models to date use naturalistic corpus like Wikipedia, WSJ and CHILDES. While the models trained on these real world linguistic data had considerable performances in a wide range of semantic tasks, here we reject the natural data from two perspective:
- Bias of real world data in terms of the information included in it.
- Project goals.
While real world linguistic data does encode rich semantic information, they have some critical 'linguistic bias'. For example , in a norming task, human participants are probably associating 'carrot' with the color 'orange' but not other colors, therefore 'carrot is orange' should with little doubt part of semantic knowledge. However, in real world language, the color word coupled with 'carrot' most is 'yellow' instead of 'orange', since it is redundant to say 'orange carrot' which is the default. In short, what we know is, in lots of cases, not what we say.
Since the real world language is affected by pragmatic factor, they may not reflect what people actually know (semantic memory). For example, with natural data, a carrot would be most likely associate with 'yellow' rather than 'orange'.
Meanwhile, it is not only what we produce different from what we know, ofter, 'how we produce' is different from what we know. Going back to the phrase 'drink coffee', we know we drink coffee, but do we actually say 'drink coffee' when we actually mean that? Consider the following sentences:
- I had a cup of coffee in the morning, and that made my day.
- A good day starts from a cup of coffee in the morning.
Neither the sentence include 'drink coffee', but we know both of them mean that. The verb 'have' instead of 'drink', is the word we use when we want to say 'drink coffee'. And in the second sentence, it is even no verb for the action of drinking, but we know that you start your day with 'drinking that cup of coffee', but not 'pouring', or 'discussing' it. While both sentences are more likely to be in the natural data, a superficial encoding may not reveal what they actually mean.
Going back to the project goal, no matter for the investigation of D1, or D2, we need a language with minimal pragmatic add-ups, and more literal semantics. To encode 'drink coffee', we need exactly 'drink coffee', instead of 'have coffee', or 'start with coffee'.
In D1, we look at the combinatorial properties of (content words, noun-verb), and in D2, we investigate the higher order dependency. Both goals can be abstracted as investigating some formal property of semantics, therefore we need minimal other linguistic components in the linguistic data.
In the wild, the corpus includes too many elements, e.g. punctuation, all types of inflections(aspect, tense, voice), which are critical to word semantics, to the meaning of the language itself, but less to semantic memory. To train a semantic model speaking to formal semantic properties as D1, and D2, it is hard to say whether these linguistic elements have any significance, while they are hard to control as variables.
Therefore, we need a corpus that is less 'linguistic', in the way that it involves meanings relating to semantic memory, but less to the language (for example, 'A dog is chasing the cat', and 'The dog chased a cat' have different meanings for sure, while it is hard to say how different are their effects on the semantic memory). To get such a corpus, the easiest way is to construct an artificial corpus with what we need.
In the project, we start from generating a corpus which include the 'gist' of the part of semantics that is to investigate. The corpus/looks 'conceptual', and is not even grammatical to native English speakers, as it is lemmatized, and with minimal punctuation. Following is an example for how the language look like:
Mary search.
Mary go_to rabbit.
Mary trap rabbit.
Mary stab rabbit.
Mary gather rabbit.
We take out anything that is not relevant to the event(conceptual) structure and leave it with the 'content stem', so that the languge looks more conceptual, and less a language. The assumption is that, a semantic model, when doing its word count for the representation of semantic memory, could use the stem part of the language, instead of the raw form. In other word, we ignore the pre-processing, and feed the model the meat.
But still, one may point out that the language consisting of mere simple sentence with such simple event structure is too far away from the everyday input(too simple), thus the semantic representation formed on such a corpus may miss some critical component contributing to semantic memory. While such concern is reasonable to some extent, we should notice that such a language capture the most ubiquitous event structure in all naturalistic language: agent-predicate-patient, or agent-predicate. And such structures feature the combinatorial properties which is universal in natural language and paramount to the semantic representation.
And with such infrastructure, we are able to level up the complexity of the language due to further research interests. adjuncts such as modifiers and adverbial components can be added to the picture, the same for embedded clauses, and function units (inflections, function words). While it is interesting to develop the language to speak to more research questions, here we start with the minimal set up, with what we need for the current research goals.
The semantic models are trained on the corpus described in the previous sections, forming its model specific semantic representation. The 'semantic representation' refers to the semantic space a model construct basing on the corpus. To be more specific, all models in the project encode the distributional information of the language by doing the word by word co-occurrence count, and form the WW co-occurrence matrix. Basing on this common co-occurrence matrix, different models have its specific further processing, ending up its own semantic representation.
In general, the semantic models we look at are the four types of models which can be summarized in the following table:
co-occurrence | similarity | |
---|---|---|
Space | co-occurrence matrix | similarity matrix |
Graph | co-occurrence graph | similarity graph |
We have a 2 X 2 design for the study. First we look at the representational data structures, where 'space' referring to encode words as vectors in a feature vector space, and 'graph' referring to the graph generated from the corresponding spatial matrix (as the adjacency matrix). And then we differentiate the co-occurrence and similarity encoding, where the co-occurrence refers to the co-occurrence count, and similarity refers to the word-word similarity scores formed by correlating (or other computation) the word vectors in the co-occurrence space. Correspondingly, the co-occurrence graph refers to the graph formed by co-occurrence matrix, and therefore is equivalent to represent 'sentences' as chains fo word nodes, and join the chains by shared words; while similarity graph is derived from similiarity matrix, with nodes representing the words, and the edges as the similariy scores between the word pairs.
Apart from the basic 2 X 2 design, we further divdie the similarity half by how the similarity score (cosine, 2-distance, correlation) is computed, and whether there is a svd reduction (svd, no svd) when forming the similarity matirx, and end up 6 variations for each similarity model therefore instead of a 2 X 2 design, it ends up a 2 X 7 design.
And for each of the 14 conditions, we have 216 variations of models on 6 dimensions:
When forming the corpus:
period: {yes, no}
boundary: {yes, no}
When forming co-oc matrix:
Window size:{1,2,7}
Window weight:{flat, linear}
Window type: {flat, linear}
normalization: {no, row-log, ppmi}
Where 'period' refers to whether adding a period at the end of each sentence, and 'boundary' refers to whether treating each sentence as an independent sequence, if no, the word window can go across the sentence boundary, and words across the sentences are considered as in the same sequence (bag of words), and if yes, the word window stop at the sentence boundary, therefore words across sentences are not considered in the same sequence, thus are not counted in the co-occurrence count.
In all the semantic models above, a model specific semantic space is formed, and therefore, the semantic relation (relatedness) between the words can be derived from the representations. For spatial representations, the relatedness between word pairs are the corresponding entries in the matrix, while for the graphical models, the relatedness is computed by an spreading-activation style measure, where to compute the relatedness from word A to B, we activate word A with activation level 1, and measure the amount of activation flowing to B at the first time, and the activation in the network flows from a node to its neighbors proportional to the weights of the edges linking the nodes and its neighbors, say, A links to two nodes, B and C, with weight 0.3 and 0.7, then 0.3 activation flows to B, while 0.7 flows to A.
And finally, for each model variations in the design, it ends up with a semantic relatedness table, including the semantic relatedness score between all word pairs in the corpus.
As mentioned above, each model forms a semantic representation basing on the given corpus, which is the semantic relatedness score among all word pairs in the corpus. We evaluate the model representation by if the model capture the world semantics reflected in the corpus. The world is set up with following rules stipulating what may and may not happen in the world.
crack(a) | chase(a) | eat(a) | drink(a) | crack(p) | chase(p) | eat(p) | drink(p) | |
---|---|---|---|---|---|---|---|---|
Mary | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
Tiger | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
Bison | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
Walnut | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
apple | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Water | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
And after running the simulation, a corresponding world event occurrence table is generated, in which the entries for 'may not happen' event stay 0, while the 'may happen' entries actual frequency of the event happened in the simulation. We call this the 'event frequency' table.
crack(a) | chase(a) | eat(a) | drink(a) | crack(p) | chase(p) | eat(p) | drink(p) | |
---|---|---|---|---|---|---|---|---|
Mary | 2 | 7 | 9 | 10 | 0 | 2 | 0 | 0 |
Tiger | 0 | 9 | 5 | 12 | 0 | 0 | 0 | 0 |
Bison | 0 | 0 | 8 | 6 | 0 | 3 | 2 | 0 |
Walnut | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 |
apple | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
Water | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 |
And for each model, corresponding to the table above, we have a table for the semantic relatedness between the nouns and verbs:
crack(a) | chase(a) | eat(a) | drink(a) | crack(p) | chase(p) | eat(p) | drink(p) | |
---|---|---|---|---|---|---|---|---|
Mary | 0.3 | 0.5 | 0.3 | 0.6 | 0.001 | 0.1 | 0.01 | 0.001 |
Tiger | 0.05 | 0.6 | 0.4 | 0.5 | 0.001 | 0.01 | 0.001 | 0.001 |
Bison | 0.01 | 0.05 | 0.6 | 0.7 | 0.001 | 0.3 | 0.2 | 0.01 |
Walnut | 0.002 | 0.001 | 0.001 | 0.1 | 0.8 | 0.001 | 0.7 | 0.1 |
apple | 0.001 | 0.002 | 0.001 | 0.1 | 0.1 | 0.001 | 0.3 | 0.1 |
Water | 0.0001 | 0.0001 | 0.001 | 0.01 | 0.02 | 0.01 | 0.1 | 0.8 |
We evaluate the models by how the model semantic relatedness ranks of nouns by verbs correlating to the corresponding ranks in the event frequency table. And the model ranks are formed directly from the verb column vectors. For example, in the table above, for crack(a), 'Mary' has the highest relatedness score, which means that this model 'thinks' that among all nouns, 'Mary' is most likely to be the agent of the verb crack, therefore will be ranked top under 'crack(a)'. Than according to the magnitude of the numbers in the column, 'Mary' follows by Tiger, than by Bison and so on, which end up in the following model rank table:
crack(a) | chase(a) | eat(a) | drink(a) | crack(p) | chase(p) | eat(p) | drink(p) | |
---|---|---|---|---|---|---|---|---|
Mary | 1 | 2 | 3 | 2 | 5 | 2 | 5 | 5.5 |
Tiger | 2 | 1 | 2 | 3 | 5 | 3.5 | 6 | 5.5 |
Bison | 3 | 3 | 1 | 1 | 5 | 1 | 2 | 3 |
Walnut | 4 | 4 | 5 | 5 | 1 | 5.5 | 1 | 3 |
apple | 5 | 5 | 5 | 5 | 2 | 5.5 | 3 | 3 |
Water | 6 | 6 | 5 | 5 | 3 | 3.5 | 4 | 1 |
To form the corresponding rank in the event frequency table, we adjust the method a little bit. Notice that in the word semantic table, there are lots of zeroes, featuring the rule in the world: those events may not happen in the world, and therefore do not occur in the simulation. Remember that in D1, the second goal is to form reasonable inference on what has not been observed in the corpus. Here in the event frequency table, we see that all nouns other than 'Mary' have not been an agent for the verb 'crack'. But we still want to see which noun is more likely to couple with the verb, given the absence of coupling in the corpus (in other word, what combination make more sense?). We form the judgement basing on to what extent is the noun 'similar' to the 'Mary', the actual agent of 'crack' in the simulation: the more similar is a noun to Mary, we judge it as more likely being the agent of crack. And to get the similarity, we compare the 'agent half' row vector of the noun's to that of 'Mary's', and get a similarity score, we order the similarity scores, and rank them after 'Mary' in the column for 'crack(a)'. We do this for every single verb(thematic role), and have the following corpus rank table:
crack(a) | chase(a) | eat(a) | drink(a) | crack(p) | chase(p) | eat(p) | drink(p) | |
---|---|---|---|---|---|---|---|---|
Mary | 1 | 2 | 1 | 2 | 5 | 2 | 4 | 4 |
Tiger | 2 | 1 | 3 | 1 | 5 | 5.5 | 5.5 | 4 |
Bison | 3 | 3 | 2 | 3 | 3 | 1 | 2.5 | 4 |
Walnut | 5 | 5 | 5 | 5 | 1 | 4 | 2.5 | 4 |
apple | 5 | 5 | 5 | 5 | 2 | 3 | 1 | 4 |
Water | 5 | 5 | 5 | 5 | 5 | 5.5 | 5.5 | 1 |
Finally, for each semantic model (there are 3024 of them in total), we correlate the model rank table (in terms of the columns) to the corpus rank table, resulting in model correlation scores. The higher the score is, the more similar the model semantic representation similar to the corpus rank, and therefore is considered more capable of form inferring unobserved but possible semantic relations basing on the corpus.