|
| 1 | +[](https://studio.iterative.ai/team/Iterative/views/example-get-started-zde16i6c4g) |
| 2 | + |
| 3 | +# DVC Get Started |
| 4 | + |
| 5 | +This is an auto-generated repository for use in [DVC](https://dvc.org) |
| 6 | +[Get Started](https://dvc.org/doc/get-started). It is a step-by-step quick |
| 7 | +introduction into basic DVC concepts. |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | +The project is a natural language processing (NLP) binary classifier problem of |
| 12 | +predicting tags for a given StackOverflow question. For example, we want one |
| 13 | +classifier which can predict a post that is about the R language by tagging it |
| 14 | +`R`. |
| 15 | + |
| 16 | +🐛 Please report any issues found in this project here - |
| 17 | +[example-repos-dev](https://github.com/iterative/example-repos-dev). |
| 18 | + |
| 19 | +## Installation |
| 20 | + |
| 21 | +Python 3.9+ is required to run code from this repo. |
| 22 | + |
| 23 | +```console |
| 24 | +$ git clone https://github.com/iterative/example-get-started |
| 25 | +$ cd example-get-started |
| 26 | +``` |
| 27 | + |
| 28 | +Now let's install the requirements. But before we do that, we **strongly** |
| 29 | +recommend creating a virtual environment with a tool such as |
| 30 | +[virtualenv](https://virtualenv.pypa.io/en/stable/): |
| 31 | + |
| 32 | +```console |
| 33 | +$ virtualenv -p python3 .venv |
| 34 | +$ source .venv/bin/activate |
| 35 | +$ pip install -r src/requirements.txt |
| 36 | +``` |
| 37 | + |
| 38 | +> This instruction assumes that DVC is already installed, as it is frequently |
| 39 | +> used as a global tool like Git. If DVC is not installed, see the |
| 40 | +> [DVC installation guide](https://dvc.org/doc/install) on how to install DVC. |
| 41 | +
|
| 42 | +This DVC project comes with a preconfigured DVC |
| 43 | +[remote storage](https://dvc.org/doc/commands-reference/remote) that holds raw |
| 44 | +data (input), intermediate, and final results that are produced. This is a |
| 45 | +read-only HTTP remote. |
| 46 | + |
| 47 | +```console |
| 48 | +$ dvc remote list |
| 49 | +storage https://remote.dvc.org/get-started |
| 50 | +``` |
| 51 | + |
| 52 | +You can run [`dvc pull`](https://man.dvc.org/pull) to download the data: |
| 53 | + |
| 54 | +```console |
| 55 | +$ dvc pull |
| 56 | +``` |
| 57 | + |
| 58 | +## Running in your environment |
| 59 | + |
| 60 | +Run [`dvc exp run`](https://man.dvc.org/exp/run) to reproduce the |
| 61 | +[pipeline](https://dvc.org/doc/user-guide/pipelines) and create a new |
| 62 | +[experiment](https://dvc.org/doc/user-guide/experiment-management). |
| 63 | + |
| 64 | +```console |
| 65 | +$ dvc exp run |
| 66 | +Ran experiment(s): rapid-cane |
| 67 | +Experiment results have been applied to your workspace. |
| 68 | +``` |
| 69 | + |
| 70 | +If you'd like to test commands like [`dvc push`](https://man.dvc.org/push), |
| 71 | +that require write access to the remote storage, the easiest way would be to set |
| 72 | +up a "local remote" on your file system: |
| 73 | + |
| 74 | +> This kind of remote is located in the local file system, but is external to |
| 75 | +> the DVC project. |
| 76 | +
|
| 77 | +```console |
| 78 | +$ mkdir -p /tmp/dvc-storage |
| 79 | +$ dvc remote add local /tmp/dvc-storage |
| 80 | +``` |
| 81 | + |
| 82 | +You should now be able to run: |
| 83 | + |
| 84 | +```console |
| 85 | +$ dvc push -r local |
| 86 | +``` |
| 87 | + |
| 88 | +## Existing stages |
| 89 | + |
| 90 | +This project with the help of the Git tags reflects the sequence of actions that |
| 91 | +are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel |
| 92 | +free to checkout one of them and play with the DVC commands having the |
| 93 | +playground ready. |
| 94 | + |
| 95 | +- `0-git-init`: Empty Git repository initialized. |
| 96 | +- `1-dvc-init`: DVC has been initialized. `.dvc/` with the cache directory |
| 97 | + created. |
| 98 | +- `2-track-data`: Raw data file `data.xml` downloaded and tracked with DVC using |
| 99 | + [`dvc add`](https://man.dvc.org/add). First `.dvc` file created. |
| 100 | +- `3-config-remote`: Remote HTTP storage initialized. It's a shared read only |
| 101 | + storage that contains all data artifacts produced during next steps. |
| 102 | +- `4-import-data`: Use `dvc import` to get the same `data.xml` from the DVC data |
| 103 | + registry. |
| 104 | +- `5-source-code`: Source code downloaded and put into Git. |
| 105 | +- `6-prepare-stage`: Create `dvc.yaml` and the first pipeline stage with |
| 106 | + [`dvc run`](https://man.dvc.org/run). It transforms XML data into TSV. |
| 107 | +- `7-ml-pipeline`: Feature extraction and train stages created. It takes data in |
| 108 | + TSV format and produces two `.pkl` files that contain serialized feature |
| 109 | + matrices. Train runs random forest classifier and creates the `model.pkl` file. |
| 110 | +- `8-evaluation`: Evaluation stage. Runs the model on a test dataset to produce |
| 111 | + its performance AUC value. The result is dumped into a DVC metric file so that |
| 112 | + we can compare it with other experiments later. |
| 113 | +- `9-bigrams-model`: Bigrams experiment, code has been modified to extract more |
| 114 | + features. We run [`dvc repro`](https://man.dvc.org/repro) for the first time |
| 115 | + to illustrate how DVC can reuse cached files and detect changes along the |
| 116 | + computational graph, regenerating the model with the updated data. |
| 117 | +- `10-bigrams-experiment`: Reproduce the evaluation stage with the bigrams based |
| 118 | + model. |
| 119 | +- `11-random-forest-experiments`: Reproduce experiments to tune the random |
| 120 | + forest classifier parameters and select the best experiment. |
| 121 | + |
| 122 | +There are three additional tags: |
| 123 | + |
| 124 | +- `baseline-experiment`: First end-to-end result that we have performance metric |
| 125 | + for. |
| 126 | +- `bigrams-experiment`: Second experiment (model trained using bigrams |
| 127 | + features). |
| 128 | +- `random-forest-experiments`: Best of additional experiments tuning random |
| 129 | + forest parameters. |
| 130 | + |
| 131 | +These tags can be used to illustrate `-a` or `-T` options across different |
| 132 | +[DVC commands](https://man.dvc.org/). |
| 133 | + |
| 134 | +## Project structure |
| 135 | + |
| 136 | +The data files, DVC files, and results change as stages are created one by one. |
| 137 | +After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download |
| 138 | +data, models, and plots tracked by DVC, the workspace should look like this: |
| 139 | + |
| 140 | +```console |
| 141 | +$ tree |
| 142 | +. |
| 143 | +├── README.md |
| 144 | +├── data # <-- Directory with raw and intermediate data |
| 145 | +│ ├── data.xml # <-- Initial XML StackOverflow dataset (raw data) |
| 146 | +│ ├── data.xml.dvc # <-- .dvc file - a placeholder/pointer to raw data |
| 147 | +│ ├── features # <-- Extracted feature matrices |
| 148 | +│ │ ├── test.pkl |
| 149 | +│ │ └── train.pkl |
| 150 | +│ └── prepared # <-- Processed dataset (split and TSV formatted) |
| 151 | +│ ├── test.tsv |
| 152 | +│ └── train.tsv |
| 153 | +├── dvc.lock |
| 154 | +├── dvc.yaml # <-- DVC pipeline file |
| 155 | +├── eval |
| 156 | +│ ├── metrics.json # <-- Binary classifier final metrics (e.g. AUC) |
| 157 | +│ └── plots |
| 158 | +│ ├── images |
| 159 | +│ │ └── importance.png # <-- Feature importance plot |
| 160 | +│ └── sklearn # <-- Data points for ROC, confusion matrix |
| 161 | +│ ├── cm |
| 162 | +│ │ ├── test.json |
| 163 | +│ │ └── train.json |
| 164 | +│ ├── prc |
| 165 | +│ │ ├── test.json |
| 166 | +│ │ └── train.json |
| 167 | +│ └── roc |
| 168 | +│ ├── test.json |
| 169 | +│ └── train.json |
| 170 | +├── model.pkl # <-- Trained model file |
| 171 | +├── params.yaml # <-- Parameters file |
| 172 | +└── src # <-- Source code to run the pipeline stages |
| 173 | + ├── evaluate.py |
| 174 | + ├── featurization.py |
| 175 | + ├── prepare.py |
| 176 | + ├── requirements.txt # <-- Python dependencies needed in the project |
| 177 | + └── train.py |
| 178 | +``` |
0 commit comments