Merge pull request #7 from noobpk/dev

bump to 1.6
noobpk · Oct 24, 2023 · 4a93bd7 · 4a93bd7
2 parents 24f44fb + 7b870ea
commit 4a93bd7
Show file tree

Hide file tree

Showing 4 changed files with 158 additions and 10 deletions.
diff --git a/.gitignore b/.gitignore
@@ -160,5 +160,6 @@ cython_debug/
 #.idea/
 py-env
 gemini.keras
+gemini-*.h5
 docker-compose.dev.yml
 gemini_realtime_predict_req_resp.csv
diff --git a/DEEPLEARNING.md b/DEEPLEARNING.md
@@ -0,0 +1,142 @@
+# Web Vulnerability Detection with Deep Learning
+
+This is a detection method that using combine Convolutional Neural Network (CNN) and a family of Recurrent Neural Network (RNN) to analyze features and relationships in requests from users and predict whether they are vulnerability or not.
+
+## Model Architecture
+
+This is a compact architectural model with two channels. For channel A, I using three layer include Conv1D - MaxPooling - GlobalMaxPooling. And for channel B, I using two layer of the RNN family (RNN, LSTM, GRU). With extremely large data sets, the model can scale with multiple channels and multiple layers to be able to respond to the size of the dataset.
+
+## Vulnerabilities Detection
+
+- Cross-Site Scripting
+- SQL Injection
+- Path Traversal (LFI)
+- Command Injection
+- Remote File Inclusion (RFI)
+- Json & XML Injection
+- HTML5 Injection
+- Server Side Includes (SSI) Injection
+
+## Datasets
+
+The training dataset is split 70:30 for training and testing. With 70% of the district training, I use k-fold cross validation with k=5 to train the model.
+
+| Dataset | Sample | Access |
+|---|---|---|
+| CISC2010 | 61065 (SQLi, XSS, CSRF, ...) | [Public](https://www.kaggle.com/datasets/ispangler/csic-2010-web-application-attacks) |
+| HTTPPram | 31066 -> 10852(SQLi) 532(XSS) 89(CMDi) 290(LFI) | [Public](https://github.com/Morzeux/HttpParamsDataset) |
+| Shah's | 44605 -> 13686(XSS) 30919(SQLi) | [Public](https://www.kaggle.com/syedsaqlainhussain/datasets) |
+| Generate Dataset | 592479 -> 332131 (Normal) 260348 (Abnormal) | Private |
+
+<img width="1312" alt="image" src="https://github.com/noobpk/gemini-web-vulnerability-detection/assets/31820707/975bc53a-4f4a-4545-95d3-0a0da7baa847">
+
+## Data Decoder
+
+The decoder was built with multiple decode layers including base64 - URL - Unicode - utf8 - clean data - ....
+
+| Original | Decoded |
+|---|---|
+| ```<object data="data:text/html;base64,PHNjcmlwdD5hbGVydCgxKTwvc2NyaXB0Pg=="></object>``` | ```<objectdata="data:text/html;base64,<script>alert(1)</script>"></object>```|
+
+## Data Processing
+
+Using SentenceTransformers. A Python framework for state-of-the-art sentence, text and image embeddings.
+
+| Original | Encoder |
+|---|---|
+| ```/etc/mixmaster/remailer/pgponly.hlp``` | ```[-2.79157665e-02  7.86799937e-02 -1.95519626e-02 -4.09332477e-02 9.84075591e-02 -8.66753384e-02 -4.61700819e-02 -2.39454824e-02 ...]```|
+
+## Model Summary
+
+```
+Model: "model_3"
+__________________________________________________________________________________________________
+ Layer (type)                Output Shape                 Param #   Connected to
+==================================================================================================
+ input_4 (InputLayer)        [(None, 384)]                0         []
+
+ reshape_3 (Reshape)         (None, 384, 1)               0         ['input_4[0][0]']
+
+ conv1d_15 (Conv1D)          (None, 382, 32)              128       ['reshape_3[0][0]']
+
+ max_pooling1d_15 (MaxPooli  (None, 380, 32)              0         ['conv1d_15[0][0]']
+ ng1D)
+
+ conv1d_16 (Conv1D)          (None, 378, 64)              6208      ['max_pooling1d_15[0][0]']
+
+ max_pooling1d_16 (MaxPooli  (None, 376, 64)              0         ['conv1d_16[0][0]']
+ ng1D)
+
+ conv1d_17 (Conv1D)          (None, 374, 128)             24704     ['max_pooling1d_16[0][0]']
+
+ max_pooling1d_17 (MaxPooli  (None, 372, 128)             0         ['conv1d_17[0][0]']
+ ng1D)
+
+ conv1d_18 (Conv1D)          (None, 370, 256)             98560     ['max_pooling1d_17[0][0]']
+
+ gru_15 (GRU)                (None, 384, 32)              3360      ['reshape_3[0][0]']
+
+ max_pooling1d_18 (MaxPooli  (None, 368, 256)             0         ['conv1d_18[0][0]']
+ ng1D)
+
+ gru_16 (GRU)                (None, 384, 64)              18816     ['gru_15[0][0]']
+
+ conv1d_19 (Conv1D)          (None, 366, 512)             393728    ['max_pooling1d_18[0][0]']
+
+ gru_17 (GRU)                (None, 384, 128)             74496     ['gru_16[0][0]']
+
+ max_pooling1d_19 (MaxPooli  (None, 364, 512)             0         ['conv1d_19[0][0]']
+ ng1D)
+
+ gru_18 (GRU)                (None, 384, 256)             296448    ['gru_17[0][0]']
+
+ global_max_pooling1d_3 (Gl  (None, 512)                  0         ['max_pooling1d_19[0][0]']
+ obalMaxPooling1D)
+
+ gru_19 (GRU)                (None, 512)                  1182720   ['gru_18[0][0]']
+
+ dropout_9 (Dropout)         (None, 512)                  0         ['global_max_pooling1d_3[0][0]
+                                                                    ']
+
+ dropout_10 (Dropout)        (None, 512)                  0         ['gru_19[0][0]']
+
+ multiply_3 (Multiply)       (None, 512)                  0         ['dropout_9[0][0]',
+                                                                     'dropout_10[0][0]']
+
+ dropout_11 (Dropout)        (None, 512)                  0         ['multiply_3[0][0]']
+
+ dense_18 (Dense)            (None, 512)                  262656    ['dropout_11[0][0]']
+
+ dense_19 (Dense)            (None, 256)                  131328    ['dense_18[0][0]']
+
+ dense_20 (Dense)            (None, 128)                  32896     ['dense_19[0][0]']
+
+ dense_21 (Dense)            (None, 64)                   8256      ['dense_20[0][0]']
+
+ dense_22 (Dense)            (None, 32)                   2080      ['dense_21[0][0]']
+
+ dense_23 (Dense)            (None, 1)                    33        ['dense_22[0][0]']
+
+==================================================================================================
+Total params: 2536417 (9.68 MB)
+Trainable params: 2536417 (9.68 MB)
+Non-trainable params: 0 (0.00 Byte)
+__________________________________________________________________________________________________
+```
+
+## Evaluate
+
+```
+1852/1852 [==============================] - 86s 45ms/step - loss: 0.0604 - accuracy: 0.9761
+1852/1852 [==============================] - 80s 42ms/step
+
+Accuracy: 97.61%
+              precision    recall  f1-score   support
+
+           0       0.97      0.98      0.98     33261
+           1       0.98      0.97      0.97     25987
+
+    accuracy                           0.98     59248
+   macro avg       0.98      0.98      0.98     59248
+weighted avg       0.98      0.98      0.98     59248
+```
diff --git a/build-docker-image/Dockerfile b/build-docker-image/Dockerfile
@@ -94,7 +94,7 @@ EXPOSE 443
 
 COPY app.py .
 
-COPY gemini.keras .
+COPY gemini-23-10-23.h5 .
 
 # Run the Flask API using Waitress and Nginx
 CMD service nginx start && python3 app.py
diff --git a/build-docker-image/app.py b/build-docker-image/app.py
@@ -28,7 +28,7 @@
 
 CORS(app)
 
-gemini_model = load_model('gemini.keras')
+gemini_model = load_model('gemini-23-10-23.h5')
 
 def validate_ip(ip):
     try:
@@ -53,11 +53,13 @@ def server_info():
     if str(authorization_header) == str(AUTH_KEY):
         return jsonify({
             "model": "Gemini-Web-Vulnerability-Detection",
+            "sample": "592479",
             "max_input_length": "unlimit",
             "vector_size": "384",
-            "model_build_at": "2023-08-01",
+            "param": "2536417",
+            "model_build_at": "2023-10-23",
             "encoder": "sentence-transformers/all-MiniLM-L6-v2",
-            "docker_image_version": "1.5",
+            "docker_image_version": "1.6",
             "extension": "kafka",
             "author": "noobpk - lethanhphuc"
         }), 200
@@ -79,26 +81,29 @@ def predict():
                 encode_input = encoder.encode(input_string).reshape((1,384))
                 prediction = gemini_model.predict(encode_input)
                 accuracy = prediction * 100
-                accuracy_value = float(accuracy[0][0])
+                score_value = float(accuracy[0][0])
                 hash_encode_input = sha256(encode_input).hexdigest()
+                now = datetime.now()
                 if str(ENABLE_KAFKA_STREAMING) == 'True':
                     input_ip = request.json['ip']
                     validated_ip = validate_ip(input_ip)
-                    now = datetime.now()
                     key_prediction_data = b'prediction_data'
                     payload_prediction_data = {
                         'time': now.strftime('%Y-%m-%d %H:%M:%S'),
                         'ipaddress': validated_ip,
                         'payload': input_string,
-                        'score': accuracy_value,
+                        'score': score_value,
                         'hash': hash_encode_input
                         }
                     kafka_send_message(key_prediction_data, payload_prediction_data)
                 return jsonify({
                         "status": "Success",
-                        "prediction": input_string,
-                        "accuracy": accuracy_value,
-                        "hash": hash_encode_input
+                        "threat_metrix": {
+                            "time": now.strftime('%Y-%m-%d %H:%M:%S'),
+                            "prediction": input_string,
+                            "score": score_value,
+                            "hash": hash_encode_input
+                        }
                     }), 200
             else:
                 return jsonify({