Skip to content

Commit d732cbe

Browse files
author
bates
committed
*** empty log message ***
git-svn-id: https://svn.r-project.org/R-dev-web/trunk@1522 c52295ea-58df-0310-926a-d16021944841
1 parent 8380856 commit d732cbe

File tree

1 file changed

+56
-41
lines changed

1 file changed

+56
-41
lines changed

Sparse.html

+56-41
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,18 @@ <h1>RFC on Sparse matrices in R</h1>
1818
distributed as part of PCx by Czyzyk, Mehrotra, Wagner, and
1919
Wright and is copywrite by the University of Chicago.
2020
</p>
21+
<p>
22+
Recently I become very interested in certain sparse matrix
23+
calculations myself and have looked at some of the available
24+
Open Source software for the sparse Cholesky decomposition.
25+
While I certainly appreciate the work that Roger and Pin have
26+
done I will propose a slightly different implementation.
27+
</p>
2128
<h2>Representations of sparse matrices</h2>
2229
<p>
2330
Conceptually, the simplest representation of a sparse matrix is
2431
as a triplet of an integer vector <i>i</i> giving the row
25-
numbers, an integer vector <i>j</i> giving the column numbers
32+
numbers, an integer vector <i>j</i> giving the column numbers,
2633
and a numeric vector <i>x</i> giving the non-zero values in the
2734
matrix. An S4 class definition might be
2835
</p>
@@ -33,10 +40,11 @@ <h2>Representations of sparse matrices</h2>
3340
Dim = "integer"))
3441
</pre>
3542
<p>
36-
The triplet representation would be row-oriented if elements in
37-
the same row were adjacent or column-oriented if elements in the
43+
The triplet representation is row-oriented if elements in
44+
the same row were adjacent and column-oriented if elements in the
3845
same column were adjacent. The compressed sparse row (csr)
39-
and compressed sparse column (csc) representations are similar
46+
(or compressed sparse column - csc) representation is
47+
similar
4048
to row-oriented triplet (column-oriented triplet) except that
4149
<i>i</i> (<i>j</i>) just stores the index of the first element
4250
in the row (column). (There are a couple of other details but
@@ -49,15 +57,15 @@ <h2>Representations of sparse matrices</h2>
4957
The preferred representation of sparse matrices in the SparseM
5058
package is csr. <a href="http://www.mathworks.com/">Matlab</a>
5159
uses csc. We hope that <a
52-
href="http://www.octave.org/">Octave</a> will also do at some
53-
time. There are certain advantages to csc in systems like R and
54-
Matlab where dense matrices are stored in column-major order.
55-
For example, Sivan Toledo's <a
60+
href="http://www.octave.org/">Octave</a> will also use this
61+
representation. There are certain advantages to csc in systems
62+
like R and Matlab where dense matrices are stored in
63+
column-major order. For example, Sivan Toledo's <a
5664
href="http://www.tau.ac.il/~stoledo/taucs">TAUCS</a> library and
5765
Tim Davis's <a
5866
href="http://www.cise.ufl.edu/research/sparse/umfpack">UMFPACK</a>
59-
library both use level-3 BLAS in certain sparse matrix
60-
computations.
67+
library are both based on csc and can both use level-3 BLAS in
68+
certain sparse matrix computations.
6169
</p>
6270
<p>
6371
I feel that compatibility with Matlab (and, we hope, Octave), the
@@ -74,11 +82,12 @@ <h2>Applications of sparse matrices</h2>
7482
large sparse contingency tables.
7583
</p>
7684
<p>
77-
As Roger and Pin have pointed out, the key to solving large
78-
linear models quickly and with a minimum of storage requirements
79-
will be in providing a way for <code>model.matrix</code> to
80-
generate a sparse model matrix <code>X</code> or a sparse symmetric
81-
representation of <code>X'X</code> and <code>X'y</code>.
85+
As Roger and Pin have pointed out, the key to estimating
86+
parameters in large linear models quickly and with minimal
87+
storage requirements will be in providing a way for
88+
<code>model.matrix</code> to generate a sparse model matrix
89+
<code>X</code> or a sparse symmetric representation of
90+
<code>X'X</code> and <code>X'y</code>.
8291
</p>
8392
<p>
8493
Assuming that we have a sparse representation of the model
@@ -99,24 +108,24 @@ <h2>Applications of sparse matrices</h2>
99108
</p>
100109
<p>
101110
For statistical analysis of a linear model we probably also want
102-
at least the standard errors of the coefficient estimates and
103-
for that we want an inverse of the Cholesky factor. TAUCS has
104-
an inverse factorization routine
105-
<code>taucs_ccs_factor_xxt</code> that can provide a sparse
106-
representation of the inverse. I think that we want to use that
107-
for a linear model analysis. We can use the multifrontal solver
108-
if we only want coefficients.
109-
</p>
110-
<p>
111-
There will be a tradeoff between speed from reordering and
112-
information available in the original ordering of the rows and
113-
columns. The simplest way to determine the
114-
sequential sums of squares for the terms in the model is to
115-
maintain the column ordering in <code>X</code> but that could
116-
result in dramatic amounts of fill-in for the sparse Cholesky
117-
and especially for the inverse factorization. I think it is
118-
best to compromise and get the inverse factorization of the
119-
reordered matrix. This can provide standard errors and
111+
at least the standard errors of the coefficient estimates which
112+
means we want an inverse of the Cholesky factor. TAUCS has an
113+
inverse factorization routine <code>taucs_ccs_factor_xxt</code>
114+
that can provide a sparse representation of the inverse. I
115+
think that we want to use that for a linear model analysis. We
116+
can use the multifrontal solver if we only want coefficients.
117+
</p>
118+
<p>
119+
When working with linear models there will be a tradeoff between
120+
the speed boost available by reordering rows and columns and the
121+
statistical information available in the original ordering of
122+
the rows and columns. For example, the simplest way to
123+
determine the sequential sums of squares of the terms in the
124+
model is to maintain the column ordering in <code>X</code> but
125+
that could result in dramatic amounts of fill-in for the sparse
126+
Cholesky and especially for the inverse factorization. I think
127+
it is best to compromise and obtain the inverse factorization of
128+
the reordered matrix. This can provide standard errors and
120129
correlations of coefficients but not the sequential sums of
121130
squares. (At least I don't know how to get them from the
122131
reordered matrix.)
@@ -132,7 +141,7 @@ <h2>Applications of sparse matrices</h2>
132141
with <a
133142
href="http://www.stat.wisc.edu/~bates/reports/MixedComp.pdf">partially
134143
crossed grouping factors</a>. For these I need to manipulate
135-
both sparse contingency tables and associated sparse positive
144+
both sparse contingency tables and some associated sparse positive
136145
definite matrices.
137146
</p>
138147
<h2>Utilities for sparse matrices</h2>
@@ -148,11 +157,16 @@ <h2>Utilities for sparse matrices</h2>
148157
UMFPACK is a set of routines for solving unsymmetric sparse
149158
linear systems with the Unsymmetric MultiFrontal method. It has
150159
a couple of very convenient routines for switching between csc
151-
and a triplet representation. As described in the UMFPACK
152-
documentation, a general triplet to csc converter allows simple
153-
ways to write operations like transposition of matrices
154-
(convert csc to triplet, interchange <i>i</i> and <i>j</i>,
155-
convert back to csc).
160+
and a triplet representation. The triplet to csc converter is
161+
quite general in that it allows redundant triplet
162+
representations (more than one entry for the same position -
163+
multiple entries have their values summed) and arbitrary
164+
ordering. This allows convenient creation of sparse contingency
165+
tables (build up the triplet representation then compress it).
166+
As described in the UMFPACK documentation, it also allows simple
167+
ways to write operations like transposition of matrices (convert
168+
csc to triplet, interchange <i>i</i> and <i>j</i>, convert back
169+
to csc).
156170
</p>
157171
<p>
158172
As a side note, it appears that the UMFPACK/AMD form of the csc
@@ -161,7 +175,8 @@ <h2>Utilities for sparse matrices</h2>
161175
csc matrix (in TAUCS these are called ccs) the result does not
162176
have the rows in increasing order within each column. AMD
163177
doesn't like this and I find it confusing when trying to examine
164-
the matrix.
178+
the matrix. Again the csc to triplet to csc conversion can be
179+
used to remove this problem.
165180
</p>
166181
<h2>Licenses</h2>
167182
<p>
@@ -206,7 +221,7 @@ <h2>Proposed plan</h2>
206221
<address><a href="mailto:[email protected]">Douglas Bates</a></address>
207222
<!-- Created: Tue Oct 21 13:45:49 CDT 2003 -->
208223
<!-- hhmts start -->
209-
Last modified: Tue Oct 21 15:47:25 CDT 2003
224+
Last modified: Tue Oct 21 16:11:58 CDT 2003
210225
<!-- hhmts end -->
211226
</body>
212227
</html>

0 commit comments

Comments
 (0)