-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_probst_giebenhain.tex
77 lines (62 loc) · 5.03 KB
/
02_probst_giebenhain.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
\documentclass{article}
\usepackage[de]{ukon-infie}
\usepackage[utf8]{inputenc}
\usepackage{algorithm2e}
\usepackage{amsmath}
\usepackage{graphicx}
% kann de oder en sein
% kann bubble break, topexercise sein
\Names{Jonas Probst, Simon Giebenhain}
\Lecture[AnaVis]{Analyse und Visualisierung von Informationen}
\Term{WS 2017/18}
\begin{document}
\begin{ukon-infie}[08.11.17]{2}
\begin{exercise}[p=9.5]{}
\question{}{Normalization brings data to a scale between 0 and 1, which makes comparing data easier, possibly both for humans and algorithms. \\
One example where normalization is very useful is in a scenario, where data is exponentially distributed (the vast majority clusters around the mean, with increasing distance from the mean the occurence of data drops exponentially). Here a logarithmic normalization can bring more insights to the visulaization, since otherwise the vast majority of data points will look almost identically in the visualization (see example of US countiey of the lecture).\\
Another example is the visualization of income in Germany(x-axis: income, y-axis: people earning that amount per year). Here one could normalize in such a way, that all people earning more than 200.000 per year are represented by the same value. In this way the x-axis is shrunk, such that the details of the less wealthy people are not lost.}
\question{}{Normalization only makes sense for 'Grösse' and 'Gewicht', all the other columns are either ordinal or nominal values. Normalizing makes comparing the data between columns easier, so it is useful in this case.}
\question{}{$f_{lin}(x)=\frac{x-min}{max-min}$, $min=165$, $max=203$\\
$f_{lin}(167)=0,0526$\\
$f_{lin}(181)=0,4211$\\
$f_{lin}(192)=0,7105$\\}
\question{}{$f_{ln}(x)=\frac{ln(x)-ln(min)}{ln(max)-ln(min)}$, $min=52$, $max=72$\\
$f_{ln}(57) \approx 0,282$\\
$f_{ln}(63) \approx 0,59$\\
$f_{ln}(68) \approx 0,824$\\}
\end{exercise}
\begin{exercise}[p=3]{}
\textbf{Data cleaning:} Remove noise, deal with inconsitencies and missing values.\\
\textbf{Normalization:} Transform data to a standardized scale which is easier to work with.\\
\textbf{Data Reduction:} Reduce the number of data points by sampling and the number dimensions by removal of redundant(corelated) columns.
\end{exercise}
\begin{exercise}[p=6]{}
\question{}{The goal of sampling is to reduce the size of the data set while keeping the information content the same. \\
Another reason is the following: In a case where one wants the gain information of a small group of the population, one can sample in such a way, that the probability of a high percentage of the group of interest in the sampled data is hight. Thus this group can be investigated more easily.}
\question{}{In probabilistic sampling every data point can randomly end up in the sample, while in non-probabilistic sampling some data points are choosen out of which the sample is drawn, so some data points have a porbability of zero to end up in the sample.}
\question{}{\textbf{Systematic Random Sampling:} Advantage: Easy to implement; Disadvantage: Problem with periodicities\\
\textbf{Cluster Random Sampling:} Advantage: Cheap method when it is geographically convenient; Disadvantage: Least representative of the population\\
\textbf{Random Sampling:} Should be used when there is only on column so no categories are possible and when there is no danger of missing a characteristic. Examples:\\ 1. List of heights of people.\\ 2. List of employee income without any other information given.\\
\textbf{Stratified Random Sampling:}Should be used when there are multiple columns, one of which puts the data in different categories. Examples:\\ 1. Income of people grouped by level of education\\ 2. Weight of people grouped by gender\\}
\end{exercise}
\begin{exercise}[p=2]{Data Mining and Visualization}
\question{}{
The subway map or bus map is an example of visulazation in my everyday life. It conveys the most important information very well, while ignoring details. Thus the visualization is very easy and quick to understand.
}
\question{}{
The Google advertisments are the most prominant examples of data mining in my everyday life. Data is mined from my browser history, in oder to display fitting ads.
}
\end{exercise}
\begin{exercise}[p=4]{Visualization: Human vs. Computer}
\question{}{
It makes sense, because it can be very tedious to visualize huge data sets by hand. This can be solved well by automation. \\
Moreover digital visualzations can be updated (e.g. improve visualization, incorporate new data) at every time without much effort, whereas non-digital visualizations have to be drawn from scratch.
}
\question{}{
The raw data extracted by the computer is hard to interpret. \\
The visulaization makes it much easier to interpret the result.\\
Additionaly this can lead to new insights/discovery of pattern and inspire a change in the algorithm applied proviously.
}
\end{exercise}
\end{ukon-infie}
\end{document}