Skip to content

Commit d580b5f

Browse files
committed
Initialize Git repository
0 parents  commit d580b5f

5 files changed

+252
-0
lines changed

.devcontainer.json

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"name": "example-get-started",
3+
"image": "mcr.microsoft.com/devcontainers/python:3.10",
4+
"extensions": ["Iterative.dvc", "ms-python.python", "redhat.vscode-yaml"],
5+
"features": {
6+
"ghcr.io/iterative/features/dvc:1": {}
7+
},
8+
"postCreateCommand": "pip3 install --user -r src/requirements.txt"
9+
}

.gitattributes

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.dvc linguist-language=YAML
2+
dvc.lock linguist-language=YAML

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.venv/

.gitlab-ci.yml

+62
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
report:
2+
rules:
3+
- if: $CI_PIPELINE_SOURCE == 'merge_request_event'
4+
- if: $CI_COMMIT_BRANCH == 'main'
5+
image: dvcorg/cml:0-dvc3-base1
6+
before_script:
7+
- cml ci && cml --version
8+
- npm install -g json5
9+
script: |
10+
if [ $CI_COMMIT_REF_NAME = main ]; then
11+
PREVIOUS_REF=HEAD~1
12+
COMMIT_HASH1=$CI_COMMIT_BEFORE_SHA
13+
COMMIT_HASH2=$CI_COMMIT_SHA
14+
else
15+
PREVIOUS_REF=main
16+
git fetch --depth=1 origin main:main
17+
COMMIT_HASH1=$CI_MERGE_REQUEST_DIFF_BASE_SHA
18+
COMMIT_HASH2=$CI_COMMIT_SHA
19+
fi
20+
21+
dvc pull eval
22+
dvc plots diff $PREVIOUS_REF workspace \
23+
--show-vega --targets ROC | json5 > vega.json
24+
vl2svg vega.json roc.svg
25+
26+
dvc plots diff $PREVIOUS_REF workspace \
27+
--show-vega --targets Precision-Recall | json5 > vega.json
28+
vl2svg vega.json prc.svg
29+
30+
dvc plots diff $PREVIOUS_REF workspace \
31+
--show-vega --targets Confusion-Matrix | json5 > vega.json
32+
vl2svg vega.json confusion.svg
33+
34+
cp eval/plots/images/importance.png importance_workspace.png
35+
36+
git checkout $PREVIOUS_REF -- dvc.lock
37+
cp eval/plots/images/importance.png importance_previous.png
38+
39+
dvc_report=$(dvc exp diff $PREVIOUS_REF --md)
40+
41+
cat <<EOF > report.md
42+
# CML Report
43+
[![DVC](https://img.shields.io/badge/-Open_in_Studio-grey?style=flat-square&logo=dvc)](https://studio.iterative.ai/team/Iterative/views/example-get-started-2gpv7kdqx2?panels=plots%2C%3Bcompare%2C&commits=${COMMIT_HASH2}%3B${COMMIT_HASH1}&activeCommits=${COMMIT_HASH1}%3Aprimary%3B${COMMIT_HASH2}%3Apurple)
44+
## Plots
45+
![ROC](./roc.svg)
46+
![Precision-Recall](./prc.svg)
47+
![Confusion Matrix](./confusion.svg)
48+
#### Feature Importance: ${PREVIOUS_REF}
49+
![Feature Importance: ${PREVIOUS_REF}](./importance_previous.png)
50+
#### Feature Importance: workspace
51+
![Feature Importance: workspace](./importance_workspace.png)
52+
53+
## Metrics and Params
54+
### ${PREVIOUS_REF} → workspace
55+
${dvc_report}
56+
EOF
57+
58+
if [ $CI_COMMIT_REF_NAME = main ]; then
59+
cml comment create --target=commit report.md
60+
else
61+
cml comment update --target=pr report.md
62+
fi

README.md

+178
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
[![DVC](https://img.shields.io/badge/-Open_in_Studio-grey.svg?style=flat-square&logo=dvc)](https://studio.iterative.ai/team/Iterative/views/example-get-started-zde16i6c4g)
2+
3+
# DVC Get Started
4+
5+
This is an auto-generated repository for use in [DVC](https://dvc.org)
6+
[Get Started](https://dvc.org/doc/get-started). It is a step-by-step quick
7+
introduction into basic DVC concepts.
8+
9+
![](https://static.iterative.ai/img/example-get-started/readme-head.png)
10+
11+
The project is a natural language processing (NLP) binary classifier problem of
12+
predicting tags for a given StackOverflow question. For example, we want one
13+
classifier which can predict a post that is about the R language by tagging it
14+
`R`.
15+
16+
🐛 Please report any issues found in this project here -
17+
[example-repos-dev](https://github.com/iterative/example-repos-dev).
18+
19+
## Installation
20+
21+
Python 3.9+ is required to run code from this repo.
22+
23+
```console
24+
$ git clone https://github.com/iterative/example-get-started
25+
$ cd example-get-started
26+
```
27+
28+
Now let's install the requirements. But before we do that, we **strongly**
29+
recommend creating a virtual environment with a tool such as
30+
[virtualenv](https://virtualenv.pypa.io/en/stable/):
31+
32+
```console
33+
$ virtualenv -p python3 .venv
34+
$ source .venv/bin/activate
35+
$ pip install -r src/requirements.txt
36+
```
37+
38+
> This instruction assumes that DVC is already installed, as it is frequently
39+
> used as a global tool like Git. If DVC is not installed, see the
40+
> [DVC installation guide](https://dvc.org/doc/install) on how to install DVC.
41+
42+
This DVC project comes with a preconfigured DVC
43+
[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw
44+
data (input), intermediate, and final results that are produced. This is a
45+
read-only HTTP remote.
46+
47+
```console
48+
$ dvc remote list
49+
storage https://remote.dvc.org/get-started
50+
```
51+
52+
You can run [`dvc pull`](https://man.dvc.org/pull) to download the data:
53+
54+
```console
55+
$ dvc pull
56+
```
57+
58+
## Running in your environment
59+
60+
Run [`dvc exp run`](https://man.dvc.org/exp/run) to reproduce the
61+
[pipeline](https://dvc.org/doc/user-guide/pipelines) and create a new
62+
[experiment](https://dvc.org/doc/user-guide/experiment-management).
63+
64+
```console
65+
$ dvc exp run
66+
Ran experiment(s): rapid-cane
67+
Experiment results have been applied to your workspace.
68+
```
69+
70+
If you'd like to test commands like [`dvc push`](https://man.dvc.org/push),
71+
that require write access to the remote storage, the easiest way would be to set
72+
up a "local remote" on your file system:
73+
74+
> This kind of remote is located in the local file system, but is external to
75+
> the DVC project.
76+
77+
```console
78+
$ mkdir -p /tmp/dvc-storage
79+
$ dvc remote add local /tmp/dvc-storage
80+
```
81+
82+
You should now be able to run:
83+
84+
```console
85+
$ dvc push -r local
86+
```
87+
88+
## Existing stages
89+
90+
This project with the help of the Git tags reflects the sequence of actions that
91+
are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel
92+
free to checkout one of them and play with the DVC commands having the
93+
playground ready.
94+
95+
- `0-git-init`: Empty Git repository initialized.
96+
- `1-dvc-init`: DVC has been initialized. `.dvc/` with the cache directory
97+
created.
98+
- `2-track-data`: Raw data file `data.xml` downloaded and tracked with DVC using
99+
[`dvc add`](https://man.dvc.org/add). First `.dvc` file created.
100+
- `3-config-remote`: Remote HTTP storage initialized. It's a shared read only
101+
storage that contains all data artifacts produced during next steps.
102+
- `4-import-data`: Use `dvc import` to get the same `data.xml` from the DVC data
103+
registry.
104+
- `5-source-code`: Source code downloaded and put into Git.
105+
- `6-prepare-stage`: Create `dvc.yaml` and the first pipeline stage with
106+
[`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV.
107+
- `7-ml-pipeline`: Feature extraction and train stages created. It takes data in
108+
TSV format and produces two `.pkl` files that contain serialized feature
109+
matrices. Train runs random forest classifier and creates the `model.pkl` file.
110+
- `8-evaluation`: Evaluation stage. Runs the model on a test dataset to produce
111+
its performance AUC value. The result is dumped into a DVC metric file so that
112+
we can compare it with other experiments later.
113+
- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more
114+
features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time
115+
to illustrate how DVC can reuse cached files and detect changes along the
116+
computational graph, regenerating the model with the updated data.
117+
- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based
118+
model.
119+
- `11-random-forest-experiments`: Reproduce experiments to tune the random
120+
forest classifier parameters and select the best experiment.
121+
122+
There are three additional tags:
123+
124+
- `baseline-experiment`: First end-to-end result that we have performance metric
125+
for.
126+
- `bigrams-experiment`: Second experiment (model trained using bigrams
127+
features).
128+
- `random-forest-experiments`: Best of additional experiments tuning random
129+
forest parameters.
130+
131+
These tags can be used to illustrate `-a` or `-T` options across different
132+
[DVC commands](https://man.dvc.org/).
133+
134+
## Project structure
135+
136+
The data files, DVC files, and results change as stages are created one by one.
137+
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download
138+
data, models, and plots tracked by DVC, the workspace should look like this:
139+
140+
```console
141+
$ tree
142+
.
143+
├── README.md
144+
├── data # <-- Directory with raw and intermediate data
145+
│   ├── data.xml # <-- Initial XML StackOverflow dataset (raw data)
146+
│   ├── data.xml.dvc # <-- .dvc file - a placeholder/pointer to raw data
147+
│   ├── features # <-- Extracted feature matrices
148+
│   │   ├── test.pkl
149+
│   │   └── train.pkl
150+
│   └── prepared # <-- Processed dataset (split and TSV formatted)
151+
│   ├── test.tsv
152+
│   └── train.tsv
153+
├── dvc.lock
154+
├── dvc.yaml # <-- DVC pipeline file
155+
├── eval
156+
│   ├── metrics.json # <-- Binary classifier final metrics (e.g. AUC)
157+
│   └── plots
158+
│   ├── images
159+
│   │   └── importance.png # <-- Feature importance plot
160+
│   └── sklearn # <-- Data points for ROC, confusion matrix
161+
│   ├── cm
162+
│   │   ├── test.json
163+
│   │   └── train.json
164+
│   ├── prc
165+
│   │   ├── test.json
166+
│   │   └── train.json
167+
│   └── roc
168+
│   ├── test.json
169+
│   └── train.json
170+
├── model.pkl # <-- Trained model file
171+
├── params.yaml # <-- Parameters file
172+
└── src # <-- Source code to run the pipeline stages
173+
├── evaluate.py
174+
├── featurization.py
175+
├── prepare.py
176+
├── requirements.txt # <-- Python dependencies needed in the project
177+
└── train.py
178+
```

0 commit comments

Comments
 (0)