This repository contains code and resources for predicting stock prices using machine learning models. The project explores various models and techniques, with a focus on backtesting and feature engineering. We will explore different machine learning models, their limitations, and various ways of measuring the performance of a trained model.
NB: I know there is no algorithm that can predict the future (and it would not be as simple). This project is for education purposes only. It was inspired by an article (I will update this README.md for the credits).
- Data Source: The data used in this project is historical stock data for Apple Inc. (
AAPL), sourced from Yahoo Finance, but any company can be used. - Objective: To predict whether the stock price will go up the next day based on various features derived from historical data.
- Models: We compare the performance of multiple machine learning models, with and without backtesting, and with varying sets of features.
data/: Contains the stock data in JSON format.src/: Python scripts for data processing, model training, and evaluation.results/: Contains performance comparison plots and metrics.README.md: Project overview and instructions.
Clone the repository and install the required Python packages:
git clone https://github.com/Equone/MarketProject.git
cd MarketProject
pip install -r requirements.txtimport yfinance as yf
import os
import numpy as np
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
import matplotlib.pyplot as plt
from alphavantage_api import get_days_elapsed_since_dividends
from election_utils import get_election_dates, convert_to_month_tuples, get_days_elapsed_for_country
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifierIf you looked at requirements.txt, you should have every packages installed.
election_utils will be useful to web scrape data from Wikipedia, and alphavantage_api to have an access to data concerning companies.
# Load the stock data
symbol = "AAPL"
api_key = 'XXXXXXXXXXXX' # Key for the alphavantage API.
DATA_PATH = f"C:/Users/user_name/Market_Project/{symbol}_data.json"
if os.path.exists(DATA_PATH):
stock = yf.Ticker(symbol)
info = stock.info
country = info.get('country', 'Country not found')
with open(DATA_PATH) as f:
stock_hist = pd.read_json(DATA_PATH)
else:
stock = yf.Ticker(symbol)
stock_hist = stock.history(period="max")
stock_hist.to_json(DATA_PATH)
info = stock.info
country = info.get('country', 'Country not found')The variable symbol can be changed into the one you are interested in. In this code, I have chosen Apple, "AAPL".
We check if we already imported data from yfinance in order to avoid useless requests.
We will need multiple criterias in order to train our different models. Here are the 'basic' ones :
weekly_trend = data["Target"].shift(1).rolling(7).sum()
monthly_trend = data["Target"].shift(1).rolling(30).sum()
annual_trend = data["Target"].shift(1).rolling(365).sum()
data["weekly_trend"] = weekly_trend
data["monthly_trend"] = monthly_trend
data["annual_trend"] = annual_trend
weekly_mean = data["Actual_Close"].rolling(7).mean()
quarterly_mean = data["Actual_Close"].rolling(90).mean()
annual_mean = data["Actual_Close"].rolling(365).mean()
data["weekly_mean"] = weekly_mean / data["Actual_Close"]
data["quarterly_mean"] = quarterly_mean / data["Actual_Close"]
data["annual_mean"] = annual_mean / data["Actual_Close"]
data["annual_weekly_mean"] = data["annual_mean"] / data["weekly_mean"]
data["annual_quarterly_mean"] = data["annual_mean"] / data["quarterly_mean"]
data["open_close_ratio"] = data["Open"] / data["Actual_Close"]
data["high_close_ratio"] = data["High"] / data["Actual_Close"]
data["low_close_ratio"] = data["Low"] / data["Actual_Close"]They can be divided into multiple sub-sections :
weekly_trend = data["Target"].shift(1).rolling(7).sum()
monthly_trend = data["Target"].shift(1).rolling(30).sum()
annual_trend = data["Target"].shift(1).rolling(365).sum()
data["weekly_trend"] = weekly_trend
data["monthly_trend"] = monthly_trend
data["annual_trend"] = annual_trendweekly_mean = data["Actual_Close"].rolling(7).mean()
quarterly_mean = data["Actual_Close"].rolling(90).mean()
annual_mean = data["Actual_Close"].rolling(365).mean()
data["weekly_mean"] = weekly_mean / data["Actual_Close"]
data["quarterly_mean"] = quarterly_mean / data["Actual_Close"]
data["annual_mean"] = annual_mean / data["Actual_Close"]
data["annual_weekly_mean"] = data["annual_mean"] / data["weekly_mean"]
data["annual_quarterly_mean"] = data["annual_mean"] / data["quarterly_mean"]data["open_close_ratio"] = data["Open"] / data["Actual_Close"]
data["high_close_ratio"] = data["High"] / data["Actual_Close"]
data["low_close_ratio"] = data["Low"] / data["Actual_Close"]Nevertheless, many things may impact economy.
In order to improve my models, I chosed two supplementary things : dates where dividends were given and election dates.
It is where the election_utils.py and alphavantage_api.py play their role.
The goal of this file is to extract the date of the different elections from the wikipedia page : https://en.wikipedia.org/wiki/List_of_next_general_elections
This script contains functions that handle election data, such as scraping the next general election dates, processing these dates, and calculating the days elapsed since these elections. The primary objective of this module is to create features that could be used for predictive modeling in stock price prediction tasks.
NB: we calculate the days elapsed since the given election because in the dataframe data we use, days are counted since the introduction of the company into the stock market. Then, we assume elections are every 5 years. We can be more precise if we check, for each countries, precisely the period between elections. However, most of the time, 5 years is accurate.
-
get_election_dates(): Scrapes election data from the Wikipedia page listing the next general elections. The data includes upcoming presidential and legislative elections, as well as the last election dates. -
extract_month(date_str): Extracts the month from a date string, which is useful for understanding when an election took place or is scheduled to occur. This helper function splits a date string and returns the month part, which is later used to calculate time-related features. -
convert_to_month_tuples(election_data): Converts the election data into tuples that contain the country, the month of the next legislative election, and the month of the last presidential election. -
days_since_month(month_str, year_str): Calculates the number of days elapsed since the specified month and year. All of these functions are here for one reason : the next function, and how we manage time in our dataframe. -
get_days_elapsed_for_country(country_name, month_tuples): For a given country, this function calculates the days elapsed since the last presidential or legislative election (that are present in the tuple). Then, when we have this value, we can assume that the election period was at :data.iloc[-days_elapsed - period/2 : -days_elapsed + period/2]where the valueperiodcorrespond to the time where we consider elections having a relevant impact.
Here is how we proceed to integrate those functions :
n = len(data)
years_interval = 5 * 365 # 5 years in days
for days in days_elapsed:
if isinstance(days, int): # Check if 'days' is an integer
while True:
pos_minus = max(-days - 15, 0) # Ensure not to go before the start of the DataFrame
pos_plus = min(-days + 15, n - 1) # Ensure not to exceed the DataFrame bounds
# Set 1s in the 'elections_trend' column within the calculated positions
data.iloc[pos_minus:pos_plus, data.columns.get_loc("elections_trend")] = 1
# Move to the previous 5-year period
days += years_interval
# Stop if the calculated position is beyond the DataFrame's range
if days >= n:
breakElections are now considered in our models. The parameter that can be change is the 15, which is the interval around the election.
This script provides functionality to retrieve and process dividend data for a specific stock symbol using the Alpha Vantage API. The primary function is used to calculate the days elapsed since the last dividend distributions, which is another relevant criteria in share prices.
get_days_elapsed_since_dividends(symbol, api_key): Fetches the weekly adjusted time series data for a given stock symbol and calculates the days elapsed since each non-zero dividend distribution. The reason why we calculate days elapsed is the same as before.
Here is how we proceed to integrate this script :
time_elapsed_div = get_days_elapsed_since_dividends(symbol, api_key)
data["div_periode"] = 0 # 0 everywhere first
div_days = 15 # The interval around the dividend date
for nb_days_elapsed in time_elapsed_div:
if nb_days_elapsed > div_days:
start_pos = -nb_days_elapsed - div_days
end_pos = -nb_days_elapsed + div_days + 1
data.iloc[start_pos:end_pos, data.columns.get_loc('div_periode')] = 1Thus, if the company gives dividends, this criteria will be considered into our training.
Now, we have set up our criterias. We can begin the training of our different models.
This list defines the features that will be used as inputs for the models.
# List of predictors
predictors = ["Actual_Close", "Volume", "Open", "High", "Low", "weekly_mean",
"quarterly_mean", "annual_mean",
"annual_weekly_mean", "annual_quarterly_mean",
"open_close_ratio","high_close_ratio","low_close_ratio",
"weekly_trend","monthly_trend","annual_trend","div_trend",
"elections_trend"]We will see how it impacts the training : their nature, their number..
# Backtesting to improve our model
def back_test(data, model, predictors, start = 1000, step = 400):
# A model is trained every "step" rows
predictions = []
for i in range(1000, data.shape[0], step):
train = data.iloc[0:i].copy()
test = data.iloc[i:(i+step)].copy()
model.fit(train[predictors], train["Target"])
preds = model.predict_proba(test[predictors])[:,1] # Allow us to have values between 0 and 1 instead of 0/1
preds = pd.Series(preds, index=test.index)
preds[preds > .6] = 1 # We put the treshold here at 0.6, so it predicts it will go up with more confidence
preds[preds<=.6] = 0
combined = pd.concat({"Target": test["Target"], "Predictions": preds}, axis=1)
predictions.append(combined)
return pd.concat(predictions)Backtesting simulates how a predictive model or trading strategy would have performed on historical data, providing insight into its potential future performance.
In the context of machine learning, backtesting involves retraining a model multiple times on different subsets of data to evaluate how the model would have performed in a real-world scenario, where data arrives sequentially over time. This approach is especially useful for time series data, where the order of data points matters, and traditional cross-validation might not be appropriate.
# Initialize models
models = {
"Random Forest": RandomForestClassifier(n_estimators=100, min_samples_split=200, random_state=1),
"Logistic Regression": LogisticRegression(),
"SVM": SVC(probability=True),
"KNN": KNeighborsClassifier(),
"Gradient Boosting": GradientBoostingClassifier()
}This is the different models we will be using. Each comes with different pros and cons. We will talk about those in the Results section. We use a dictionnary to simplify the code.
This section trains the models on a static train/test split and evaluates their performance; results will be discussed in the next part.
The following code train all models,test them on the wanted period and print all the metrics :
# 1. Without Backtesting: Evaluate on static split
train = data.iloc[:-100]
test = data.iloc[-100:]
# Define your custom interval
a, b = 0, 100 # Example values, can be adjusted
print(f"\nWithout Backtesting: Performance on Interval [{a}, {b}]")
precision_scores = []
recall_scores = []
f1_scores = []
roc_auc_scores = []
model_names = []
for model_name, model in models.items():
# Train the model on the training data
model.fit(train[predictors], train["Target"])
# Predict on the test data
preds = model.predict(test[predictors])
preds_proba = model.predict_proba(test[predictors])[:, 1]
# Create a DataFrame for easier manipulation and plotting
results = pd.DataFrame({
"Target": test["Target"],
"Predictions": preds,
"Prediction_Probabilities": preds_proba
}, index=test.index)
# Select the custom interval [a, b] from the results
interval_predictions = results.iloc[a:b]
# Calculate performance metrics for the selected interval
precision_interval = precision_score(interval_predictions["Target"], interval_predictions["Predictions"])
recall_interval = recall_score(interval_predictions["Target"], interval_predictions["Predictions"])
f1_interval = f1_score(interval_predictions["Target"], interval_predictions["Predictions"])
roc_auc_interval = roc_auc_score(interval_predictions["Target"], interval_predictions["Prediction_Probabilities"])
print(f"\nPerformance metrics for {model_name} on the interval [{a}, {b}] (Without Backtesting):")
print(f"Precision: {precision_interval:.4f}")
print(f"Recall: {recall_interval:.4f}")
print(f"F1-Score: {f1_interval:.4f}")
print(f"ROC-AUC Score: {roc_auc_interval:.4f}")
# Count the number of trades (predictions)
trades = interval_predictions["Predictions"].value_counts()
print(f"Number of trades for {model_name} on interval [{a}, {b}]:\n{trades}")
# Plot target vs predictions for the interval
plt.figure(figsize=(10, 6))
plt.plot(interval_predictions.index, interval_predictions["Target"], label="Actual Target", color="blue", marker="o")
plt.plot(interval_predictions.index, interval_predictions["Predictions"], label="Predicted", color="red", marker="x")
plt.title(f"{model_name} - Target vs Predictions (Interval [{a}, {b}], Without Backtesting)")
plt.xlabel("Date")
plt.ylabel("Target")
plt.legend()
plt.show()And to see all the metrics with a better look :
# 1. Without Backtesting: Evaluate on static split
train = data.iloc[:-100]
test = data.iloc[-100:]
print("\nWithout Backtesting:")
precision_scores = []
recall_scores = []
f1_scores = []
roc_auc_scores = []
model_names = []
for model_name, model in models.items():
# Train the model on the training data
model.fit(train[predictors], train["Target"])
# Predict on the test data
preds = model.predict(test[predictors])
preds_proba = model.predict_proba(test[predictors])[:, 1]
# Create a DataFrame for easier manipulation and plotting
results = pd.DataFrame({
"Target": test["Target"],
"Predictions": preds,
"Prediction_Probabilities": preds_proba
}, index=test.index)
precision = precision_score(test["Target"], preds)
recall = recall_score(test["Target"], preds)
f1 = f1_score(test["Target"], preds)
roc_auc = roc_auc_score(test["Target"], preds_proba)
precision_scores.append(precision)
recall_scores.append(recall)
f1_scores.append(f1)
roc_auc_scores.append(roc_auc)
model_names.append(model_name)
print(f"\nPerformance metrics for {model_name}:")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")
# Compute ROC curve and ROC area
fpr, tpr, _ = roc_curve(test["Target"], preds_proba)
plt.plot(fpr, tpr, label=f'{model_name} (ROC-AUC = {roc_auc:.2f})')
# Plot diagonal line for reference (no skill classifier)
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
# Adding labels and title
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve - without Backtesting')
plt.legend(loc="lower right")
# Show the plot
plt.show()
# Plotting the metrics for static split (without backtesting)
plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
plt.barh(model_names, precision_scores, color='blue')
plt.title('Precision (Without Backtesting)')
plt.subplot(2, 2, 2)
plt.barh(model_names, recall_scores, color='green')
plt.title('Recall (Without Backtesting)')
plt.subplot(2, 2, 3)
plt.barh(model_names, f1_scores, color='red')
plt.title('F1-Score (Without Backtesting)')
plt.subplot(2, 2, 4)
plt.barh(model_names, roc_auc_scores, color='purple')
plt.title('ROC-AUC (Without Backtesting)')
plt.tight_layout()
plt.show()To add the Backtesting function, we just need to modify this :
predictions = back_test(data.iloc[365:], model, predictors)and the rest does not change.
In this part, we look at the different results we have from the company Apple (symbol: AAPL), the 25 august 2024.
If you need explanations about the metrics and notions that are used in this part, they are well detailled in this article : https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec
Here are the results I obtained from my tests :
This table sums up the impact of the addition of BackTesting on our models.
| Model \ Metric | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|
| Gradient Boosting | Increase | Increase | Increase | Increase |
| KNN | Decrease | Decrease | Decrease | Decrease |
| SVM | N/A | N/A | N/A | Decrease |
| Logisitic Regression | Increase | Decrease | Decrease | Constant |
| Random Forest | Constant | Increase | Increase | Increase |
Looking at the impact of the introduction of Backtesting on metrics tells us how our models react to more dynamic data. As an example, the KNN model is taking a huge hit, while Random Forest takes a lot of profits from it.
It is relevant to look at the impact of BackTesting, as in some fields - like finances - data are very variable.
Additionnaly, we can look at this graph, which shows the ratio between True Positive Rate and False Positive Rate (without BackTesting here):
As the table already suggested, Random Forest and Gradient Boosting are strong at this task; while Logistic Regression or SVM perform weaker, which may show that they are not well-suited for this time-series prediction task.
Once the models are trained, we apply them on the last 100 days in order to see how good they are, and how many trades we would have made in this period.
__NB: While the code analyzes the last 100 days, you can adjust this timeframe. To focus on specific event dates, locate the appropriate index and exclude that period from model training. Furthermore, every graph can be created with the code, I only took some to write this README.
For every model, we plot the Target and the Prediction by the date, in order to know if we were right or not. Let's take some relevant examples, so we can see some limits of machine learning.
Logistic Regression, without BackTesting :
SVM, without BackTesting :
As we can see, it is quite polarized. For Logistic Regression, we would have made 100 trades for a global precision of .65 - which does not mean profit, as we do not consider the ratio open\close.
Obviously, this would not be a good strategy at all and therefore, this kind of model for this data and without BackTesting, is not viable nor reliable.
Gradient Boosting, without BackTesting :
KNN, without BackTesting :
Respectively, with Gradient Boosting and KNN we would have made 11 and 47 trades. However, metrics are quite different :
| Model \ Metric | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|
| Gradient Boosting | 0.6364 | 0.1077 | 0.1842 | 0.6055 |
| KNN | 0.6981 | 0.5692 | 0.6271 | 0.5622 |
This correspond to two different strategies : a "safe" one, and a more "aggressive" one, depending on how we want to approach our investments and how much risks we are willing to take. However, KNN metrics are still relatively high compared to its rival.
You can still play on different approachs, parameters to see how models react, etc. I just put some examples here as a sample, we can do much more in-depth analysis with this.
The reason why the models above were without BackTesting is : the addition of BackTesting + many many criterias may lead to overfitting.
Here is a clear example :
Random Forest, with BackTesting :
with really (really) high metrics :
Overfitting means the model learned too much the data - not only the underlying patterns in the training data but also the noise and random fluctuations - this can lead to several issues:
- Poor generalization : weak perfomances on new (unseen) data.
- High variance : this is linked to the point before; often we want things to be as stable as possible.
- Loss of robustness : undesired patterns can be reproduced on new data.
- Misleading Metrics : a lack of vigilance on this phenomenon can be enhanced by the fact that metrics are good in this case.
Nevertheless, solutions exist : More data, early stopping, simplification of the model (like removing some criterias),...
If you want to see different behaviours, feel free to play with criterias, BackTesting parameters, etc.
Of course, some criterias may be added or changed. Furthermore, we could add a feature that tells us how much we would have won if trusting a model. Also, removing some parts of the training data could be an idea to reduces thz variance.
Through this project, we saw the process of training different machine learning model on stock prices, the impact of implementing features like BackTesting and some of their limitations (overfitting).
Keys points :
- Importance of criterias, as it plays a crucial role in the performance of the models.
- Role of BackTesting, which is essential for evaluating a model's ability to generalize to unseen data.
- Model comparison
- Understanding Overfitting and its implications
- Using various techniques to get informations / data : Webscrapping, API, specific librairies
Final Thoughts: Predicting stock prices is a complex and uncertain task, as financial markets are influenced by a wide range of factors, many of which are impossible to quantify. While machine learning offers powerful tools for analyzing historical data and making predictions, it is important to approach such tasks with caution and recognize the limitations of these models.
This project is not intended for real-world financial applications; it is here to give me an idea of the process of data preparation, feature engineering, model training, and evaluation, providing a foundation for further exploration and experimentation in the field of financial modeling. And as we say in France, for legal reasons : "Les performances passées ne préjugent pas des performances futures"








