Skip to content

Commit 5d3ccf5

Browse files
committed
Create ML pipeline stages
1 parent 6c5ba05 commit 5d3ccf5

File tree

4 files changed

+75
-0
lines changed

4 files changed

+75
-0
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
.venv/
2+
/model.pkl

data/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
/data.xml
22
/prepared
3+
/features

dvc.lock

+44
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,47 @@ stages:
2121
md5: 153aad06d376b6595932470e459ef42a.dir
2222
size: 8437363
2323
nfiles: 2
24+
featurize:
25+
cmd: python src/featurization.py data/prepared data/features
26+
deps:
27+
- path: data/prepared
28+
hash: md5
29+
md5: 153aad06d376b6595932470e459ef42a.dir
30+
size: 8437363
31+
nfiles: 2
32+
- path: src/featurization.py
33+
hash: md5
34+
md5: e22789fc9581cad11ef7a6fa3aa3f17b
35+
size: 4158
36+
params:
37+
params.yaml:
38+
featurize.max_features: 100
39+
featurize.ngrams: 1
40+
outs:
41+
- path: data/features
42+
hash: md5
43+
md5: f8f5cbc3188008a7542d02d63054d9d2.dir
44+
size: 1556290
45+
nfiles: 2
46+
train:
47+
cmd: python src/train.py data/features model.pkl
48+
deps:
49+
- path: data/features
50+
hash: md5
51+
md5: f8f5cbc3188008a7542d02d63054d9d2.dir
52+
size: 1556290
53+
nfiles: 2
54+
- path: src/train.py
55+
hash: md5
56+
md5: 324001573ed724e5ae092226fcf9ca30
57+
size: 1666
58+
params:
59+
params.yaml:
60+
train.min_split: 0.01
61+
train.n_est: 50
62+
train.seed: 20170428
63+
outs:
64+
- path: model.pkl
65+
hash: md5
66+
md5: cfa72ff6e2575c44f78f423cada5b783
67+
size: 1855075

dvc.yaml

+29
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,14 @@ artifacts:
33
path: data/data.xml
44
type: dataset
55
desc: Initial XML StackOverflow dataset (raw data)
6+
text-classification:
7+
path: model.pkl
8+
desc: Detect whether the given stackoverflow question should have R language tag
9+
type: model
10+
labels:
11+
- nlp
12+
- classification
13+
- stackoverflow
614
stages:
715
prepare:
816
cmd: python src/prepare.py data/data.xml
@@ -14,3 +22,24 @@ stages:
1422
- prepare.split
1523
outs:
1624
- data/prepared
25+
featurize:
26+
cmd: python src/featurization.py data/prepared data/features
27+
deps:
28+
- data/prepared
29+
- src/featurization.py
30+
params:
31+
- featurize.max_features
32+
- featurize.ngrams
33+
outs:
34+
- data/features
35+
train:
36+
cmd: python src/train.py data/features model.pkl
37+
deps:
38+
- data/features
39+
- src/train.py
40+
params:
41+
- train.min_split
42+
- train.n_est
43+
- train.seed
44+
outs:
45+
- model.pkl

0 commit comments

Comments
 (0)