Clarifications on data properties

I've just started having a play with the tomo_challenge data, and I'm curious about some of the quirks of the dataset. To this end, I've included here a bunch of observations & questions. I understand that the challenge is designed to be "idealised", however it occurs to me that many of these quirks might work to favour machine learning codes, for example, and disadvantage other methods (I give examples of what I mean). 
 
My main observations are below. Apologies in advance for the long list (and for any silly observations...). I've also included a handful of figures to show what I mean. In each figure I have included data from both the (noisy) tomo_challenge dataset and from (noiseless!) MICE2 dataset, for comparison. Full disclosure that I am using MICE2 as a benchmark for 'what I roughly expect to see'. 

1. The colour-redshift relation shown by individual SED tracks appears chaotic. The redshift axis is clearly discretised (due to snapshots in the simulation?), which is not an issue per-se, but between adjacent snapshots the colours appear uncorrelated. Colour-redshift tracks abruptly end and restart. This behaviour is mirrored exactly in both the training and validation datasets, so may not be overly problematic for machine learning classifiers, but such behaviour could unfairly hamper template-based or hybrid approaches that (i would say fairly) assume coherent colour-redshift evolution? 
![Template_chaos](https://user-images.githubusercontent.com/5625880/86647166-1f7bad80-bfe0-11ea-9dc7-1bfa8a8480a4.png)

2. The template discrimination (i.e. the ease of discerning underlying templates in colour-redshift space) appears to become better as a function of redshift, rather than worse. At a given magnitude, the colour-redshift space becomes increasingly discretised as a function of redshift.  
![tracks_afo_redshift](https://user-images.githubusercontent.com/5625880/86647324-3f12d600-bfe0-11ea-96b3-ece0ccefdf50.png)

3. In the above figure, you can also see that the range of galaxy colours in the tomo_challenge data is considerably smaller than in MICE2, especially in the region 0.3≤z≤0.7 where lots of blue sources are missing. Is this because the tomo_challenge templates are missing low-mass (blue) sources? This could be important prior information for bayesian codes?  

4. There appear to be small gaps in redshift at low-z, in between (what I'm guessing are) snapshots of the simulation. While these aren't themselves a problem, they may be symptomatic of an underlying bug. 
![redshift_gaps_ _LSS](https://user-images.githubusercontent.com/5625880/86647424-55b92d00-bfe0-11ea-972c-4cc96bc160b8.png)

5. In the above figures, you can also see that the tomo_challenge data appear to be lacking any large-scale structure? This would bias against hybrid approaches that wish to invoke cross-correlation. 

6. The noisy tomo_challenge data extend to much higher redshift, at a given magnitude, than the sources in MICE2. Not sure if this is a problem (?), but mostly just an observation. If could hamper non-machine-learning codes invoke a data-driven redshift-magnitude prior. 
![redshift_magnitude_difference](https://user-images.githubusercontent.com/5625880/86646934-f1966900-bfdf-11ea-9375-5387c80dd52c.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarifications on data properties #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarifications on data properties #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions