-
Notifications
You must be signed in to change notification settings - Fork 2
/
background.tex
391 lines (358 loc) · 20.8 KB
/
background.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
\chapter{Introduction}
\label{background-chapter}
This dissertation establishes the utility and reliability of a
statistical distance measure for syntactic dialectometry, expanding
dialectometry's methods to include syntax as well as phonology and the
lexicon. It is a continuation of my previous work \cite{sanders07,sanders08b} and earlier work by \namecite{nerbonne06}, the first
statistical measure of syntax distance. These pioneering studies
explored this measure, but failed to compare it to established results
in dialectology to see if the new method reproduces them. This
dissertation does so, as well as investigating a number of variant
measures and feature sets. It uses Swedish dialect data as a basis for
investigation.
Dialectology is the study of linguistic variation \cite{chambers98}.
% distance / other variables.
Its goal is to characterize the linguistic features that separate
language varieties. The tools that it uses to do this include
isoglosses---geographic descriptions of a particular linguistic
variable--as well as traditional phonological and syntactic analyses
of dialect phenomena. Traditional dialectology predates
sociolinguistics, but has adopted many of its tools so that it has
become in some ways a subfield of sociolinguistics.
Dialectometry is a subfield of dialectology that uses mathematically
sophisticated methods to extract and combine linguistic features
\cite{seguy73}. Its focus is the manipulation of large data sets in a
uniform way, characterizing the differences between regions in a
gradient and statistically sound way. As a result, in recent years most
work in the field has been computational linguistic, largely focused
on phonology, starting with \namecite{kessler95}, followed by
\namecite{nerbonne97} and \namecite{nerbonne01}. \namecite{heeringa04}
provides a comprehensive review of phonological distance in
dialectometry as well as some new methods.
This dissertation compares the results of the syntactic distance
measure with syntactic dialectology, both in the form of the traditional
Swedish dialect regions as well as analysis of syntactic dialect
features. It also compares the results to phonological dialectometry's
results on Swedish.
\section{Overview of Dialectometry}
In dialectometry, a distance measure can be defined in two parts:
first, a method of decomposing the data into features that capture
linguistic properties, and second, a method of combining the features
from tw corpora to produce a single number. The decomposition
method can be thought of as ``feature extraction'' for any number of
feature sets, while the combining method can be thought of as the
``distance'', since it represents differences as a single
number. Figure \ref{abstract-distance-measure-model} gives an overview
of how the model works. The input consists of two corpora; each item in
each corpus is decomposed into a set of features extracted by $f$. The
resulting set of features are then compared by $d$, which combines
them into a single number: the distance.
\begin{figure}
\[\xymatrix@C=1pc{
\textrm{Corpus} \ar@{>}[d]|{} &
S = s_o,s_1,\ldots
\ar@{>}[dd]|{f}
&&
T = t_o,t_1,\ldots
\ar@{>}[dd]|{f}
\\
\textrm{Decomposition}\ar@{>}[dd] &&&\\
&
*{\begin{array}{c}
\left[ + f_o, +f_1 \ldots \right], \\
\left[ - f_o, +f_1 \ldots \right], \\
\ldots \\ \end{array}}
\ar@{>}[ddr]
&&
*{\begin{array}{c}
\left[ + f_o, -f_1 \ldots \right], \\
\left[ + f_o, -f_1 \ldots \right], \\
\ldots \\ \end{array}}
\ar@{>}[ddl] \\
\textrm{Combination} && d & \\
& & \textrm{Distance} & \\
} \]
\label{abstract-distance-measure-model}
\caption{Abstract Distance Measure Model : $d \circ f$}
\end{figure}
Dialectometry has focused on phonological distance measures, while
syntactic measures have remained undeveloped. The most important
reason for this focus is that it is easier to define a distance
measure on phonology. In phonology, it is easy to collect corpora
consisting of identical word sets. Then these words decompose to segments and,
if necessary, segments further decompose to phonological
features. This decomposition is straightforward and based on
\namecite{chomsky68}. For combination, string alignment, or Levenshtein
distance \cite{lev65}, is a well-understood algorithm used for
measuring changes between any two sequences of characters taken from a
common alphabet. Levenshtein distance is simple mathematically, and
has the additional advantage that its intermediate data structures are
easy to interpret as the linguistic processes of epenthesis, deletion and
metathesis. Other methods have been proposed; \namecite{kondrak02}
give several simpler alternatives to Levenshtein distance and
\namecite{sanders06} and \namecite{hinrichs07} give two different
statistical measures. However, Levenshtein distance remains dominant
for currently available corpora because it maintains the best balance
between required corpus size and quality of results.
These things are not possible with syntactic distance: neither matched
sentences nor a single obvious function for decomposition and
combinations exist. Matched sentences could in theory be collected, but the
number of possible interview responses in syntax is so much larger
that the required time and number of informants would be
correspondingly much greater. For example, in the Survey of English
Dialects \cite{orton63}, phonological items were elicited by asking
the interviewee to answer a question: ``cow'' is the standard English
answer to ``What is the animal that you get milk from?''. This method
avoids priming the interviewee with the interviewer's
pronunciation. However, It does not always have the desired effect,
even for phonology: for the item ``newt'', the responses ``newt'',
``ewt'' and ``eft'' are all comparable phonologically, but a response
like ``salamander'' is not. This problem is exponentially worse for
syntax: an interview question that is sufficiently abstract to avoid
priming a particular structure has a low chance of eliciting that very
structure. For example, a prompt such as ``I took your drink. What do
you say to me?'' has a low chance of eliciting sentences that
exemplify the differing English orders of direct and indirect objects
such as ``give me it'' versus ``give it to me'' or ``give it me''.
% for example, in many parts of England, the answer will consist
% entirely of expletives.
The specification of functions for decomposition and combination of
syntax faces another problem. Although many decomposition and
combination methods can be proposed, the standard syntactic theories
% this is what chomsky calls it so shut up. he's always right
cannot be practically used. For example, the parsers in this
dissertation use probabilistic phrase structure rules or dependencies
to represent the grammar of a language. This is typical for parsers
in computational linguistics, but it means that their output, and by
extension the features based on their output, is quite different from
the lexical representations of minimalism. In order to decompose
sentences to minimalist features, a broad-coverage minimalist parser
would be required. Since such a parser does not yet exist, it is
impossible to use the minimalist syntactic structure in the same way that
phonological dialectometry uses distinctive features, for example.
For similar reasons, methods from lexical dialectology are not simple
to adapt to syntax. There are two problems: first, lexical feature
extraction ranges from trivial to easy, so there are no applicable
techniques for feature extraciton. Second, there has been little work
on distance measure specifically for lexical dialectometry. Lexical
distance typically uses a non-specific method such as Goebl's GIW,
described in chapter \ref{methods-chapter}; see for example
\namecite{spruit06}.
%% I'm cutting this paragraph because it makes this section too long
%% and it's more of a pet theory of mine than anything I can back up.
% A secondary reason for dialectometry's focus on phonology is that it
% is inherited from dialectology's focus on phonology.
% % (TODO:Cite?)
% This might be solely due to the history of dialectology as a field, but it is
% likely that more phonological than syntactic differences exist between
% dialects, due to historically greater standardization
% of syntax via the written form of language. Phonological
% dialect features are less likely to be stigmatized and suppressed by a
% standard dialect than syntactic ones.
% % (TODO:Cite, probably
% % Trudgill and Chambers something like '98, maybe where they talk about
% % what aspects of dialects are noticed and stigmatized).
% Whatever the reason, much less dialectology work on syntax is
% available for comparison with new dialectometry results.
\subsection{Syntax and Dialectometry}
Because of the preceding two reasons, syntax is a relatively
undeveloped area in dialectometry. Currently, the literature lacks a
generally accepted syntax measure. Unfortunately, approaching the
problem by copying phonology is not a good solution; there are real
differences between syntax and phonology that mean phonological
approaches do not apply. For example, there are fewer differences to
be found in syntax, and they occur more sparsely. Because
dialectology has traditionally worked with fairly small corpora, and
because of the difficulty of collecting syntactic data, most surveys
cover even fewer syntactic variables than phonological ones. There are
two approaches to remedy this. The first manually enhances the
differences that do exist in small, carefully collected corpora; the
second switches to larger, non-survey corpora and uses statistical
methods to find differences.
The first approach is proposed by \namecite{spruit08} for analyzing
the Syntactic Atlas of the Dutch Dialects \cite{barbiers05}, is to
continue using small dialectology corpora and manually extract
features so that only the most salient features are used. Then a
sophisticated method of combination such as Goebl's Weighted Identity
Value (WIV), described in chapter \ref{methods-chapter}, and by \namecite{goebl06}, can be used to
produce a distance. WIV is more complex mathematically than
Levenshtein distance, and operates on any type of linguistic
feature. However, manual feature extraction requires that the dialect
situation be understood first. In other words, traditional
dialectology methods must be used to find interesting features before
dialectometry can proceed. This negates the usual advantages of dialectometric
methods in providing rapid analysis in knowledge-poor environments.
Manual feature extraction is also subject to bias from the dialectologist:
the best-known features are most likely to become the best manual
features, passing over the rarely occurring and previously unknown
features that might actually be the best indicators of a particular
dialect.
This first approach ignores the specific properties of the syntax
distance problem. Given a large corpus, manually defined features will
have less coverage than the automatically extracted features used by
the second, statistical approach. Furthermore, automatically extracted
features are easy to define for syntax. This dissertation covers
part-of-speech trigrams, leaf-ancestor paths, and leaf-head paths
over nodes, but many variations on these features are possible, such
as lexical trigrams, lexicalized leaf-ancestor paths, or arc-head
paths. Methods from other syntactic work in
computational linguistics could apply too: supertags \cite{joshi94},
convolution kernels \cite{collins01} or any number of simpler features
such as tree height, number of nodes, or number of words.
The problem for the statistical approach is not defining a feature
set. The problem is defining a good feature set. This is the reason
that the statistical approach uses large corpora: with enough data,
statistically significant comparisons can be made between the
different features; the highest ranked ones can be discovered
automatically rather than manually. Fortunately, the typical syntactic corpus
is larger than a phonological corpus because the annotation
work is easier; much of the syntactic annotation can be generated
automatically.
Even with a feature set defined, a distance measure still requires a
method of combining features to find a distance. One such method, a
simple statistical measure called $R$, has been proposed by
\namecite{nerbonne06} based on work by \namecite{kessler01}. At
present, however, $R$ has not been adequately shown to detect dialect
differences. A small body of work suggests that it does, but as yet
there has not been a satisfying correlation of its results with
existing results from the dialectology literature on syntax.
Nerbonne \& Wiersma's first paper used part-of-speech trigram features
as a proxy for syntactic information and $R$ for syntax distance
together with a test for statistical significance\cite{nerbonne06}.
Their experiment compared two generations of Norwegian L2 speakers of
English. They found that the two generations were significantly
different, although they had to normalize the trigram counts to
account for differences in sentence length and complexity. However,
showing that two generations of speakers are significantly different
with respect to $R$ does not necessarily imply that the same will be
true for other types of language varieties. Specifically, for this
dissertation, the success of $R$ on generational differences does not
imply success on dialect differences.
I addressed this problem \cite{sanders08b} by measuring $R$ between
the nine Government Office Regions of England, using the International
Corpus of English Great Britain \cite{nelson02}; see the discussion in
section \ref{methods-chapter-dialectometry-section}. Speakers were
classified by birthplace. I also introduced Sampson's leaf-ancestor
paths as a feature set \cite{sampson00}. I found statistically
significant differences between most regions, using both trigrams and
leaf-ancestor paths as features. However, $R$'s distances were not
significantly correlated with Levenshtein distances. Nor did I show
any qualitative similarities between known syntactic dialect features
and the high-ranked features used by $R$ in producing its distance. As
a result, it is not clear whether the significant $R$ distances
correlate either with dialectometric phonological distance or with known
features found by dialectologists.
% NOTE: 2-d stuff is not the primary problem, since we can't compare
% trees to trees anyway. The primary problem is comparing two corpora
% full of differing sentences. A secondary problem arises to make sure
% that the 2-d-extracted features aren't skewed one way or another. I
% guess I need to come up with a general justification for the
% normalizing and smoothing code from Nerbonne & Wiersma
% Additional problems: phonology is 1-dimensional, with one obvious way
% to decompose words into segments and segments into features. Syntax is
% 2-dimensional, so the decomposition must take several more factors
% into account so that the features it produces are
% useful and comparable to each other. And those features are \ldots
% Overview : Goal, Variables, Method
% Contribution
% Literature Review
% : (including theoretical background)
% Draw hypotheses from earlier studies
% Method
% :
% Experiment section as 'Corpus' section
% Goal: To extend existing measurement methods. To measure them
% better. To measure them on more complete data.
\section{Overview of the Dissertation}
The problem outlined in the previous section is that dialectometry
lacks a statistical method designed for syntax which does not require
the linguist to specify ad-hoc features manually. This dissertation
addresses the lack directly by applying the method to a dialect
corpus, then comparing the results to existing syntactic dialectology
literature of Swedish, as well as phonological work using established
dialectometry methods. In addition, it tests variations of the
experimental parameters in order to identify the highest-performing
parameters. In summary, this analysis allows future dialectometry
studies to include syntactic as well as phonological analyses, having
an idea of the best method and parameters to use.
There are three research questions that must be answered to determine
the reliability of this measure. They are given in chapter
\ref{questions-chapter}. First, does the measure agree with the
results of dialectology? Previous work has not addressed this
question, but it is crucial that a new measure reproduce the results
from previous linguistic work. To answer this question, the
Swedish dialect distance results will be processed in a number of ways
so that they are comparable to previous dialect work on Swedish in
multiple ways.
Second, which parameter variations produce the best agreement with
dialectology work? Both the distance measure and feature set can be
varied, as well as a number of other parameter settings, mostly
dealing with controlling for the effects of corpus size. The distance
measures include simple measures like $R$, which is a sum of differences, more
complex variants such as Jensen-Shannon divergence, which is a sum of
logarithmic differences, and cosine similarity, which models each
corpus as a vector in high-dimensional space and finds the angle
between two corpus vectors. Feature sets can be even more varied,
although all the feature sets discussed here assume that the word is the
basic unit of syntactic analysis and that words are naturally grouped
into sentences. Some example feature sets are part-of-speech trigrams,
which are simply triples of parts of speech. Leaf-ancestor paths and
leaf-head paths use the syntactic structure of the sentence, with
leaf-ancestor paths based on constituent grammars (phrase-structure
grammars) and leaf-head paths based on dependency grammars.
Third, does the measure agree with the results of phonological
dialectometry? Agreement is not required; phonological and syntactic
dialect boundaries may disagree, but they are more likely to agree
than disagree, so if the two dialectometric measurements agree, then
this inspires confidence on the new method based on the old method's
reliability.
To answer the three research questions, I start with the statistical
method described in the previous section with the parameter variations
described above in chapter \ref{methods-chapter}. To make sure that
the results are comparable to previous dialectology, I use the dialect
corpus Swediasyn, which is a transcription of interviews recorded
in villages throughout Sweden. The interviewees were balanced
between older and younger men and women. To generate features from the
Swediasyn, a good deal of processing is required; the corpus is a
transcription with no syntactic annotation. To annotate the Swediasyn,
I use a number of automatic annotators, trained on Talbanken, a corpus
of spoken and written Swedish. However, Talbanken does not include
dialect sources, so error is expected during the
annotation process. After annotation, feature generation is
straightforward: transformation of parse trees and other
annotations. Because automatic annotators should make identical
mistakes when annotating identical dialect structures, the resulting
features should contribute usefully to distance, despite being
incorrect linguistically.
After measuring distances between the interview sites, a number of
analytic methods are applied to the distances so that they can be
compared to dialectology work. The methods are a test of significance,
a test of correlation, cluster dendrograms and consensus trees,
composite cluster maps, multi-dimensional scaling, and feature
ranking. The tests of significance and correlation represent the
distances' trustworthiness and ability to match dialectology's
assumptions, respectively. The consensus trees, and multi-dimensional
scaling both produce maps. These maps allow the linguist to visually
compare the results with traditional region maps. In the same way,
composite cluster maps allow visual comparison of the results to
isogloss bundles from dialectology. Finally, feature ranking allows
the linguist to view the features that contribute most to separating
two regions. These features can be compared to the dialect phenomena cataloged
by dialectologists.
% Note: There are no isogloss bundles in Swedish dialectology. But
% whatever. (Also I don't really use feature ranking to compare to
% dialectology)
The results in chapter \ref{results-chapter} are presented in the same
order as their corresponding analysis appear in \ref{methods-chapter}.
The dissertation concludes with discussion in chapter
\ref{discussion-chapter}. Here, I compare the results to the dialectology
and phonological dialectometry of Swedish. Then I discuss the relation
of this work to previous work in syntactic dialectology, detailing its
contribution to the field. I finish by presenting avenues for future
work: with a statistical measure of dialect distance, dialectometry
can analyze syntactic features as well as phonological and lexical
ones, producing more complete analyses.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "dissertation.tex"
%%% End: