@@ -18,11 +18,18 @@ <h1>RFC on Sparse matrices in R</h1>
18
18
distributed as part of PCx by Czyzyk, Mehrotra, Wagner, and
19
19
Wright and is copywrite by the University of Chicago.
20
20
</ p >
21
+ < p >
22
+ Recently I become very interested in certain sparse matrix
23
+ calculations myself and have looked at some of the available
24
+ Open Source software for the sparse Cholesky decomposition.
25
+ While I certainly appreciate the work that Roger and Pin have
26
+ done I will propose a slightly different implementation.
27
+ </ p >
21
28
< h2 > Representations of sparse matrices</ h2 >
22
29
< p >
23
30
Conceptually, the simplest representation of a sparse matrix is
24
31
as a triplet of an integer vector < i > i</ i > giving the row
25
- numbers, an integer vector < i > j</ i > giving the column numbers
32
+ numbers, an integer vector < i > j</ i > giving the column numbers,
26
33
and a numeric vector < i > x</ i > giving the non-zero values in the
27
34
matrix. An S4 class definition might be
28
35
</ p >
@@ -33,10 +40,11 @@ <h2>Representations of sparse matrices</h2>
33
40
Dim = "integer"))
34
41
</ pre >
35
42
< p >
36
- The triplet representation would be row-oriented if elements in
37
- the same row were adjacent or column-oriented if elements in the
43
+ The triplet representation is row-oriented if elements in
44
+ the same row were adjacent and column-oriented if elements in the
38
45
same column were adjacent. The compressed sparse row (csr)
39
- and compressed sparse column (csc) representations are similar
46
+ (or compressed sparse column - csc) representation is
47
+ similar
40
48
to row-oriented triplet (column-oriented triplet) except that
41
49
< i > i</ i > (< i > j</ i > ) just stores the index of the first element
42
50
in the row (column). (There are a couple of other details but
@@ -49,15 +57,15 @@ <h2>Representations of sparse matrices</h2>
49
57
The preferred representation of sparse matrices in the SparseM
50
58
package is csr. < a href ="http://www.mathworks.com/ "> Matlab</ a >
51
59
uses csc. We hope that < a
52
- href ="http://www.octave.org/ "> Octave</ a > will also do at some
53
- time. There are certain advantages to csc in systems like R and
54
- Matlab where dense matrices are stored in column-major order.
55
- For example, Sivan Toledo's < a
60
+ href ="http://www.octave.org/ "> Octave</ a > will also use this
61
+ representation. There are certain advantages to csc in systems
62
+ like R and Matlab where dense matrices are stored in
63
+ column-major order. For example, Sivan Toledo's < a
56
64
href ="http://www.tau.ac.il/~stoledo/taucs "> TAUCS</ a > library and
57
65
Tim Davis's < a
58
66
href ="http://www.cise.ufl.edu/research/sparse/umfpack "> UMFPACK</ a >
59
- library both use level-3 BLAS in certain sparse matrix
60
- computations.
67
+ library are both based on csc and can both use level-3 BLAS in
68
+ certain sparse matrix computations.
61
69
</ p >
62
70
< p >
63
71
I feel that compatibility with Matlab (and, we hope, Octave), the
@@ -74,11 +82,12 @@ <h2>Applications of sparse matrices</h2>
74
82
large sparse contingency tables.
75
83
</ p >
76
84
< p >
77
- As Roger and Pin have pointed out, the key to solving large
78
- linear models quickly and with a minimum of storage requirements
79
- will be in providing a way for < code > model.matrix</ code > to
80
- generate a sparse model matrix < code > X</ code > or a sparse symmetric
81
- representation of < code > X'X</ code > and < code > X'y</ code > .
85
+ As Roger and Pin have pointed out, the key to estimating
86
+ parameters in large linear models quickly and with minimal
87
+ storage requirements will be in providing a way for
88
+ < code > model.matrix</ code > to generate a sparse model matrix
89
+ < code > X</ code > or a sparse symmetric representation of
90
+ < code > X'X</ code > and < code > X'y</ code > .
82
91
</ p >
83
92
< p >
84
93
Assuming that we have a sparse representation of the model
@@ -99,24 +108,24 @@ <h2>Applications of sparse matrices</h2>
99
108
</ p >
100
109
< p >
101
110
For statistical analysis of a linear model we probably also want
102
- at least the standard errors of the coefficient estimates and
103
- for that we want an inverse of the Cholesky factor. TAUCS has
104
- an inverse factorization routine
105
- < code > taucs_ccs_factor_xxt </ code > that can provide a sparse
106
- representation of the inverse. I think that we want to use that
107
- for a linear model analysis. We can use the multifrontal solver
108
- if we only want coefficients.
109
- </ p >
110
- < p >
111
- There will be a tradeoff between speed from reordering and
112
- information available in the original ordering of the rows and
113
- columns. The simplest way to determine the
114
- sequential sums of squares for the terms in the model is to
115
- maintain the column ordering in < code > X</ code > but that could
116
- result in dramatic amounts of fill-in for the sparse Cholesky
117
- and especially for the inverse factorization. I think it is
118
- best to compromise and get the inverse factorization of the
119
- reordered matrix. This can provide standard errors and
111
+ at least the standard errors of the coefficient estimates which
112
+ means we want an inverse of the Cholesky factor. TAUCS has an
113
+ inverse factorization routine < code > taucs_ccs_factor_xxt </ code >
114
+ that can provide a sparse representation of the inverse. I
115
+ think that we want to use that for a linear model analysis. We
116
+ can use the multifrontal solver if we only want coefficients.
117
+ </ p >
118
+ < p >
119
+ When working with linear models there will be a tradeoff between
120
+ the speed boost available by reordering rows and columns and the
121
+ statistical information available in the original ordering of
122
+ the rows and columns. For example, the simplest way to
123
+ determine the sequential sums of squares of the terms in the
124
+ model is to maintain the column ordering in < code > X</ code > but
125
+ that could result in dramatic amounts of fill-in for the sparse
126
+ Cholesky and especially for the inverse factorization. I think
127
+ it is best to compromise and obtain the inverse factorization of
128
+ the reordered matrix. This can provide standard errors and
120
129
correlations of coefficients but not the sequential sums of
121
130
squares. (At least I don't know how to get them from the
122
131
reordered matrix.)
@@ -132,7 +141,7 @@ <h2>Applications of sparse matrices</h2>
132
141
with < a
133
142
href ="http://www.stat.wisc.edu/~bates/reports/MixedComp.pdf "> partially
134
143
crossed grouping factors</ a > . For these I need to manipulate
135
- both sparse contingency tables and associated sparse positive
144
+ both sparse contingency tables and some associated sparse positive
136
145
definite matrices.
137
146
</ p >
138
147
< h2 > Utilities for sparse matrices</ h2 >
@@ -148,11 +157,16 @@ <h2>Utilities for sparse matrices</h2>
148
157
UMFPACK is a set of routines for solving unsymmetric sparse
149
158
linear systems with the Unsymmetric MultiFrontal method. It has
150
159
a couple of very convenient routines for switching between csc
151
- and a triplet representation. As described in the UMFPACK
152
- documentation, a general triplet to csc converter allows simple
153
- ways to write operations like transposition of matrices
154
- (convert csc to triplet, interchange < i > i</ i > and < i > j</ i > ,
155
- convert back to csc).
160
+ and a triplet representation. The triplet to csc converter is
161
+ quite general in that it allows redundant triplet
162
+ representations (more than one entry for the same position -
163
+ multiple entries have their values summed) and arbitrary
164
+ ordering. This allows convenient creation of sparse contingency
165
+ tables (build up the triplet representation then compress it).
166
+ As described in the UMFPACK documentation, it also allows simple
167
+ ways to write operations like transposition of matrices (convert
168
+ csc to triplet, interchange < i > i</ i > and < i > j</ i > , convert back
169
+ to csc).
156
170
</ p >
157
171
< p >
158
172
As a side note, it appears that the UMFPACK/AMD form of the csc
@@ -161,7 +175,8 @@ <h2>Utilities for sparse matrices</h2>
161
175
csc matrix (in TAUCS these are called ccs) the result does not
162
176
have the rows in increasing order within each column. AMD
163
177
doesn't like this and I find it confusing when trying to examine
164
- the matrix.
178
+ the matrix. Again the csc to triplet to csc conversion can be
179
+ used to remove this problem.
165
180
</ p >
166
181
< h2 > Licenses</ h2 >
167
182
< p >
@@ -206,7 +221,7 @@ <h2>Proposed plan</h2>
206
221
< address > < a href ="
mailto:[email protected] "
> Douglas Bates
</ a > </ address >
207
222
<!-- Created: Tue Oct 21 13:45:49 CDT 2003 -->
208
223
<!-- hhmts start -->
209
- Last modified: Tue Oct 21 15:47:25 CDT 2003
224
+ Last modified: Tue Oct 21 16:11:58 CDT 2003
210
225
<!-- hhmts end -->
211
226
</ body >
212
227
</ html >
0 commit comments