Independence, Conditional Independence and Everything Inbetween #146

zenna · 2021-02-17T18:52:52Z

zenna
Feb 17, 2021
Maintainer

Riffing off #122 and #120 and #6

This is to discuss the nature of independence, conditional independence and related matters.

There have been many iterations of the semantics. Currently, it is as follows:

There is an infinite set of primitive random variables: conceptually -- StdUnif1, StdUnif2, ..., StdNormal1, StdNormal2...
Concretely, to access these elements we have a finite set of primitive random variable classes, where
- A random variable class is a function class(id, \omega)
- The id'th member of the class is the function \omega -> class(id, \omega)
We can make families as functions from parameters, e.g. normal(\mu, \sigma) = id, \omega -> StdNormal(id, \omega) * \sigma + \mu
Suppose X = normal(2, 3) then X is a class of normally dist randvar with mean 2 and stdv 3.
- One thing that is a little strange perhaps is that if I then say Y = normal(10, 3), it's not the case that X and Y are independent classes. The independence only follows between elements in X (and between Y)
- If I in fact wanted to make an independent class, how would I do it?
  - I would do Y_ = normal(10, 3), Y(id, \omega) = Y_(pair(id,2), \omega)
  - We can encapsulate that into a function id1 <| class = id2, \omega -> class(pair(id1, id2), \omega)
    - Then Y = 2 <| normal(10, 3)
I've dabbled with two different semantics for ciid / ~:
1. id ~ class = \omega -> class(id, \omega)
2. id ~ X = \omega -> X(proj(\omega, id) where proj(\omega, id) = id2 -> \omega(pair(id, id2))
- This second approach is neat because
  - It allows us to create independent random variables willy nilly

The Jax/Dex approach is a little different:

There's a set of primitive classes bernoulli(key, p), similar to as described above
You use split_key to turn a key into two keys
Making an iid random variable is as simple as using splitkey
Making a class or plate:

def plate(id, key):
  a = parent1(key)
  b = parent2(key)
  k = split(key, max)[id]
  return a + b + uniform(k)

The differences between Omega

They might be doing something smart with ids whereas I've been using tuples
They don't explicitly use ids, they split the \omega
- This makes sense since ids are kinda arbitrary but might be a little more inconvenient
- To make an iid copy

def iid(X):
  def Xcopy(key)
    k1, k2 = split(key, max)[id]
    X(k2)

So rather than say you want the 10th element you'd say you want 10 copies and take the 10th one.

I'm not sure if that is a fundamental difference.

zenna · 2021-02-18T22:30:51Z

zenna
Feb 18, 2021
Maintainer Author

Let me try to understand jax more carefully.

The key is represented as a pair of 32 bit unsigned integers.

There are a class of PRNGs that have a number of nice properties:

They are counter based, which means that:

the internal state is only an integer
We can compute a random number as a function of the integer

They're splittable, as in there's a function (rng1, rng2) = split(rng0), whereby they all produce independent streams.

I'm not exactly sure how the splitting works, but I can ignore those details for the moment.

What this essentially means is that the sample space $\Omega = UInt64$, and so there are 2^64 random numbers
-- One thing that is a little confusing is that even thought 2^64 is very large, it still seems kind of crazy that we can encode all of the randomness in one number.
-- This seems kind of at odds with the alternative perspective that $\Omega = [0, 1]^n$ where $n$ can be arbitrarily large

In Jax, suppose I wanted to create two independent uniforms, I would do:

def X(key):
  k1, k2 = split(key)
  return uniform(k1)

def Y(key):
  k1, k2 = split(key)
  return uniform(k2)

1 reply

zenna Feb 18, 2021
Maintainer Author

Let's imagine that the keyspace is a lot smaller, like 3 bits, so there are only 8 states.

So Key = {000, 001, 010, 011, 100, 101, 110, 111}
rand(key) then can produce exactly 8 numbers but they are all independent and random.
Now if I split a key into two keys key = (key1, key2)

The state space of key1 and key2 are the same, but let's suppose they are different points, e.g.
key1 = 010 and key2 = 110

zenna · 2021-02-22T16:17:46Z

zenna
Feb 22, 2021
Maintainer Author

After thinking I have come to a few conclusions.

First, is to clear up some confusion about what Jax is doing with split.

Essentially, in Jax, the Omega space is the set of 64 bit integers. Splitting, takes some ω in UInt64 and maps it to ω1 and ω2 of the same type. It is moving around in the sample space, and not the same thing as projecting between different components of \Omega as I had done before. More on that in a minute.

What do we actually need?

I think there a few things.

When we create random variables, we want to avoid clashes. For example, currently if I define a random variable as follows:

function f(ω)
  a = (1 ~ Normal(0, 1))(ω)
end

This random variable could clash with some other random variable in the same file, another file, or some other persons module. That's bad for composability.

Fundamentally, if we have an addressing system, then there needs to be some kind of mechanism by which the addresses I choose are not clashing with the ones you choose. One idea that follows from this is to then using the addressing scheme that Julia already provides:

function f(ω)
  is = ids(f)
  a = (is[1] ~ Normal(0, 1))(ω)
end

here ids is a set of ids unique to f. The point is, since my f is distinct from your g through julia's semantics, we can get unique ids and avoid clashes.

Would there be any weird consequences of this?

Whether you decide to inline a function or not could change the ids. But I don't think there's any observable effect of this.

I can't think of any...

A second requirement is that it not be so cumbersome to write. If we were using Cassette, you can imagine tracking the current function in the context, and when when someone does 1 ~ Normal.. this gets replaced with mix(1, ctx) ~ Normal..... We might be able to do the same thing using the tagging mechanism, but less robustly.

0 replies

zenna · 2021-02-22T16:40:41Z

zenna
Feb 22, 2021
Maintainer Author

Another thought regards projection.

This is somewhat inspired by the Jax splittable-rng approach.

I think I've been entertaining two separate ideas of a projection which are not quite the same:

A projection is a kind of lazy index into ω, i.e., ω[id] as the pair (ω, id). So we're at a particular point in ω, we just wait until late to resolve it
As the sample space being split up into disjoint components ids = ((a1, a2, a3, a4,...), (b1, b2, b3, ....), ....), and a projection then corresponds to one of these components,e.g. (ω, c)

Example

function f(ω)
  a = (1 ~ StdNormal)(ω)
end

This, currently will be equivalent to ω -> StdNormal(1, ω) which is equivalen to ω -> ω[(StdNormal, 1)]

The idea with a projection is that if ω is projected onto c then this is equivalent to ω -> ω[(StdNormal, c1)]

Does this buy us anything?

When you enter a function, you can project onto an id associated with that function. This resolves the global clashing issue without the need to do is = ids(f)
Seems compatible with the Jax interface

function f(ω)
  ω1, ω2 = split(ω)
  a = unfiorm(ω1)
  b = uniform(ω2)
end

This projection is what I was doing before but:

I had a special RandVar type that was doing it, which is necessary to do the dispatch
it was bound up with the counter mechaism
I was using integer ids as the projection basis, which were not generally globally or even locally unique.

1 reply

zenna Feb 22, 2021
Maintainer Author

I think fundamentally there's not much difference. The interface of the projection s nicer, even in the manual case. We should have a few things

function f(ω)
  ω = proj(ω, f)
  a = (1 ~ Normal(0, 1))(ω)
  b = Normal(2, ω, 0, 1) 
  c = g(3, ω)
end

So from this perspective does 2 ~ Normal(0, 1) make sense? It is simply ω -> g(2, ω)

zenna · 2021-02-24T07:09:02Z

zenna
Feb 24, 2021
Maintainer Author

Another axis is the semantics of classes / places and so on

I've wavered between two different approaches.

Approach 1

Here, things are simple:

A random variable is a function X(ω)
A random variable class is a function X(id, ω) which returns the idth member of the class
id ~ X is just shorthand for `ω -> X(id, ω)

The sample space is essentially an infinite sequence of primitives, StdNormal1, StdNormal2, ..., StdUniform1, StdUniform2, .... Since we can't of course explicitly provide such an infinite sequence, we do so indirectly by defining primitive random variable classes. We can access these through primitive random variables through their class e.g. StdNormal1 = ω -> StdNormal(1, ω)

The main nice thing is that its very simple, with straight forward semantics. It also allows us to easily model plates.

The main downside is about composition. If I want to make a new class, essentially a plate, anything that I want to vary (include within the plate) has to already be a class. In other words. I can't make independent copies of a random variable.

Approach 2

In approach two:

There's a primitive set of random variables as before
iid(id, X) returns the idth primitive, but more generally the idth element of an independent sequence of random variables distributed according to (equivalently to?) X
We can make an independent copy of a random variable

This is very flexible. We'd have to implement it with the projection mechanism

What worries me is that:

Does as random variable's meaning then depend on its context? I think this why I reverted to approach 1 in the first place.
In other words if I have some random variable X and then I have A(w) = ... X(w) ... and I have B(w) = ... X(w) .... Does the X differ in meaning?
It can differ in meaning if you have a kind of accumulative id pairing

function f(ω)
  a = 1 ~ Normal(ω, 0, 1)
  b = 2 ~ Normal(ω, 0, 1)
  a + b
end

function g(ω)
  2 + f(ω)
end

function h(ω)
  α = 1.0 * f(ω)
end

o(ω) = (1 ~ h)(ω) + (2 ~ g)(ω)

So does the f within (1 ~ h) differ from that within (2 ~ g)?

No, it's the same function but
It'll return a different result, because its essentially applied to a different ω

The problems this brings however (and it's all coming back to me now) are in things like interventions. What does o | do(f -> 10) now mean? Let's consider it in the current method

function f(id, ω)
  a = (id, 1) ~ Normal(ω, 0, 1)
  b = (id, 2) ~ Normal(ω, 0, 1)
  a + b
end

function g(ω)
  2 + f(1, ω)
end

f2 = ω -> f(2, ω)
function h(ω)
  α = 1.0 * f2(ω)
end

o(ω) = h(ω) + g(ω)

So we can't intervene on f since it is a class. If we intervene on f2 that'll only affect that result.

Q: Can we intervene on all f in this paradigm? ... Seems like no
Q: In the iid paradigm, can we intervene on something like f2? Seems like it'll be possible but trickier.

The same problem arises in memoization and around proposals.

0 replies

zenna · 2021-02-25T06:20:03Z

zenna
Feb 25, 2021
Maintainer Author

Another question

Should the plate ids be the same kind of ids as the projection ids?

I wrote this linear regression model

xs = rand(N)
plate(i, ω) = linear_model(xs[i], M(ω), C(ω)) + ((i, 1) ~ Normal(0, 0.1))(ω) 
Y = (1:N) .~ plate

The issue is that xs[i] doesn't work because i is an integer tuple, not an integer.

On the other hand it seems like the indices for a plate should be integers or cartesian indexes and not tied to indices we use for projection.
On the other hand, the mixing mechanism is such that the id that is finally used in stdnormal(id, w) a mix of id and whatever the subspace is, so these things need to at least be compatible if not the same thing.

Another way to think about it is that for the primitives, we need some indexing system to determine one from another, which is currently ID = Vector{Int} by default. We don't allow a user to index into StdNormal with whatever indices they want.

Could we?

If we interpret the final index as a mix (ID1, ID2, ID3, ...) but are agnostic about what ID1 ID2 and ID3 are then maybe

Options 2
-- You can index into a plate however you want, but the indices of the primitives are fixed and so you'll need to do the conversion

plate(i, ω) = linear_model(xs[i], M(ω), C(ω)) + ((i, 1) ~ Normal(0, 0.1))(ω)

Option 3
-- indices for plates have to be IDS

Option 2 seems most appealing

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Independence, Conditional Independence and Everything Inbetween #146

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Independence, Conditional Independence and Everything Inbetween #146

zenna Feb 17, 2021 Maintainer

Replies: 5 comments · 2 replies

zenna Feb 18, 2021 Maintainer Author

zenna Feb 18, 2021 Maintainer Author

zenna Feb 22, 2021 Maintainer Author

zenna Feb 22, 2021 Maintainer Author

zenna Feb 22, 2021 Maintainer Author

zenna Feb 24, 2021 Maintainer Author

Approach 1

Approach 2

zenna Feb 25, 2021 Maintainer Author

zenna
Feb 17, 2021
Maintainer

Replies: 5 comments 2 replies

zenna
Feb 18, 2021
Maintainer Author

zenna Feb 18, 2021
Maintainer Author

zenna
Feb 22, 2021
Maintainer Author

zenna
Feb 22, 2021
Maintainer Author

zenna Feb 22, 2021
Maintainer Author

zenna
Feb 24, 2021
Maintainer Author

zenna
Feb 25, 2021
Maintainer Author