Functions#
Submodules#
functions.process_data module#
- functions.process_data.combine_data(output_file_name='combined_data', input_file_name_v1='bitcoin_price.csv', input_file_name_v2='headlines_sentiment.csv')[source]#
Combines two datasets and writes the combined dataset to a CSV file.
- Parameters
output_file_name (str, optional) – Name of the output file, by default “combined_data”.
input_file_name_v1 (str, optional) – Name of the first input file, by default “bitcoin_price.csv”.
input_file_name_v2 (str, optional) – Name of the second input file, by default “headlines_sentiment.csv”.
- Return type
None
- functions.process_data.double_quotation_remover(input_file_name_v1, delimiter=',')[source]#
Removes double quotation marks from specified columns in a CSV file.
- Parameters
input_file_name_v1 (str) – Path to the input CSV file.
delimiter (str, optional) – Delimiter used in the CSV file, by default “,”.
- Return type
None
- functions.process_data.modify_date_format(input_file_name_v1, date_column_name, date_format='%b %d, %Y', delimiter=',')[source]#
Modifies the date format of a specific column in a CSV file.
- Parameters
input_file_name_v1 (str) – Path to the input CSV file.
date_column_name (str) – Name of the column containing dates to be modified.
date_format (str, optional) – Format of the dates in the input file, by default “%b %d, %Y”.
delimiter (str, optional) – Delimiter used in the CSV file, by default “,”.
- Return type
None
Examples
Format 1: “%Y-%m-%d” (e.g., “2023-05-14”)
Format 2: “%m-%d-%Y” (e.g., “05-14-2023”)
Format 3: “%d-%m-%Y” (e.g., “14-05-2023”)
Format 4: “%Y/%m/%d” (e.g., “2023/05/14”)
Format 5: “%m/%d/%Y” (e.g., “05/14/2023”)
Format 6: “%d/%m/%Y” (e.g., “14/05/2023”)
Format 7: “%Y.%m.%d” (e.g., “2023.05.14”)
Format 8: “%m.%d.%Y” (e.g., “05.14.2023”)
Format 9: “%d.%m.%Y” (e.g., “14.05.2023”)
Format 10: “%Y %m %d” (e.g., “2023 05 14”)
Format 11: “%m %d %Y” (e.g., “05 14 2023”)
Format 12: “%d %m %Y” (e.g., “14 05 2023”)
Format 13: “%b %d, %Y” (e.g., “May 14, 2023”)
- functions.process_data.process_data(file_path, lags=[1, 2, 3], rolling_windows=[3, 7, 14], target_col='Price', datetime_col='Date', volume_col='Vol.', scaler='MinMaxScaler')[source]#
Preprocesses a given dataset by performing steps such as loading the data, handling missing values, creating lagged features, scaling, etc.
- Parameters
file_path (str) – Path to the CSV file.
lags (list, optional) – List of lags to create lagged features, by default [1, 2, 3].
rolling_windows (list, optional) – List of windows to create rolling window features, by default [3, 7, 14].
target_col (str, optional) – Name of the target column, by default “Price”.
datetime_col (str, optional) – Name of the datetime column, by default “Date”.
volume_col (str, optional) – Name of the volume column, by default “Vol.”.
scaler (str, optional) – Scaler to use for scaling the features. Options are ‘StandardScaler’, ‘RobustScaler’, and ‘MinMaxScaler’. Defaults to ‘MinMaxScaler’.
- Returns
Tuple containing scaled features, scaled target, and the scaler used.
- Return type
tuple
- functions.process_data.split_data(X_scaled, y_scaled, test_ratio)[source]#
Splits the scaled features and target into training/validation and test sets.
- Parameters
X_scaled (DataFrame) – Scaled features.
y_scaled (DataFrame) – Scaled target.
test_ratio (float) – Ratio of the test set size.
- Returns
Tuple containing training/validation features, training/validation target, test features, and test target.
- Return type
tuple
functions.rolling_window_trainer module#
- class functions.rolling_window_trainer.RollingWindowTrainer(scaler, X_train_val, y_train_val, train_window=100, val_window=20, step_size=5, checkpoint_path='models/save/checkpoints/', early_stopping_values={'cnn': {'min_delta': 0.0003, 'monitor': 'val_loss', 'patience': 7}, 'lstm': {'min_delta': 0.0001, 'monitor': 'val_loss', 'patience': 10}, 'nn': {'min_delta': 0.0002, 'monitor': 'val_loss', 'patience': 5}}, model_list=None)[source]#
Bases:
objectThis class provides methods for training and evaluating machine learning models using a rolling window approach. The rolling window approach is a time series forecasting technique where the model is retrained at each time step, using only the most recent data as training data.
- Parameters
scaler (class) – The scaler class to scale the predictions.
X_train_val (DataFrame) – The feature dataset for training and validation.
y_train_val (DataFrame) – The target dataset for training and validation.
train_window (int, optional) – The size of the training window, by default 100.
val_window (int, optional) – The size of the validation window, by default 20.
step_size (int, optional) – The step size to move the window for each iteration, by default 5.
early_stopping_values (dict, optional) – A dictionary with keys as the model names and values as a dictionary containing early stopping parameters.
checkpoint_path (str, optional) – Path to save the model checkpoints, by default “models/save/checkpoints/”.
model_list (list, optional) – A list of models to be trained.
- check_overfitting(step, history)[source]#
This method checks for overfitting in the model’s training history.
- Parameters
step (int) – The current step in the rolling window approach.
history (History) – The training history of the model.
- generate_mean_df(predictions_all)[source]#
Generates a mean DataFrame from the predictions of all models.
- Parameters
predictions_all (pd.DataFrame) – DataFrame containing predictions of all models.
- Returns
Mean DataFrame with averaged predictions per model.
- Return type
pd.DataFrame
- get_all_models()[source]#
Returns the list of all models.
- Returns
List of all models.
- Return type
list
- get_all_val_metrics()[source]#
Returns all validation metrics.
- Returns
List of all validation metrics.
- Return type
list
- get_best_model_at_window(window)[source]#
Returns the best model information for the specified window step.
- Parameters
window (int) – Window step for which to retrieve the best model.
- Returns
model_info – Model information dictionary for the best model at the given window step. Returns None if no best model is found for the specified window.
- Return type
dict or None
- get_best_models()[source]#
Returns the list of best models.
- Returns
List of best models.
- Return type
list
- get_checkpoint_callback(checkpoint_dir)[source]#
Returns the checkpoint callback for saving the best model.
- Parameters
checkpoint_dir (str) – Directory path for saving the best model checkpoint.
- Returns
Checkpoint callback for saving the best model.
- Return type
ModelCheckpoint
- get_checkpoint_dir()[source]#
Returns the checkpoint directory for saving the best model.
- Returns
Checkpoint directory for saving the best model.
- Return type
str
- get_histories()[source]#
Returns the histories of neural network, CNN, and LSTM models.
- Returns
Tuple containing the histories of neural network, CNN, and LSTM models.
- Return type
tuple
- get_train_window(start, end)[source]#
Retrieves the training window data.
- Parameters
start (int) – Start index of the training window.
end (int) – End index of the training window.
- Returns
X_train_window (pd.DataFrame, pd.Series) – Input data for the training window.
y_train_window (pd.DataFrame, pd.Series) – Target data for the training window.
- get_val_window(start, end)[source]#
Retrieves the validation window data.
- Parameters
start (int) – Start index of the validation window.
end (int) – End index of the validation window.
- Returns
X_val_window (pd.DataFrame, pd.Series) – Input data for the validation window.
y_val_window (pd.DataFrame, pd.Series) – Target data for the validation window.
- get_window_indices(step)[source]#
Returns the start and end indices for the given window step.
- Parameters
step (int) – The window step.
- Returns
Tuple containing the start and end indices for the window.
- Return type
tuple
- load_and_set_histories(file_name_nn='nn_history', file_name_cnn='cnn_history', file_name_lstm='lstm_history')[source]#
This method loads and sets the training history of the models from the specified files.
- Parameters
file_name_nn (str, optional) – The file name for the history of the Neural Network model.
file_name_cnn (str, optional) – The file name for the history of the Convolutional Neural Network model.
file_name_lstm (str, optional) – The file name for the history of the LSTM model.
- predict_all_models(X)[source]#
Predicts all models based on the given input data.
- Parameters
X (pd.DataFrame or np.ndarray) – Input data for prediction.
- Returns
Predicted values of all models. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a 2D array with the predictions.
- Return type
pd.DataFrame or np.ndarray
- predict_best_models(X)[source]#
Predicts the best models based on the given input data.
- Parameters
X (pd.DataFrame or np.ndarray) – Input data for prediction.
- Returns
Predicted values of the best models. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a Series with the predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.
- Return type
pd.DataFrame or pd.Series or None
- save_histories(file_name_nn='nn_history', file_name_cnn='cnn_history', file_name_lstm='lstm_history')[source]#
Saves the histories of neural network, CNN, and LSTM models as pickle files.
- Parameters
file_name_nn (str, optional) – Name of the file to save the neural network history, by default “nn_history”.
file_name_cnn (str, optional) – Name of the file to save the CNN history, by default “cnn_history”.
file_name_lstm (str, optional) – Name of the file to save the LSTM history, by default “lstm_history”.
- set_all_models(loaded_models)[source]#
Sets the list of all models.
- Parameters
loaded_models (list) – List of loaded all models.
- set_best_models(loaded_models)[source]#
Sets the list of best models.
- Parameters
loaded_models (list) – List of loaded best models.
- train_nn_model_with_window(X_train, y_train, X_val, y_val, checkpoint_callback)[source]#
This method trains a Neural Network model with a rolling window approach.
- Parameters
X_train (DataFrame) – The input training data.
y_train (DataFrame) – The output training data.
X_val (DataFrame) – The input validation data.
y_val (DataFrame) – The output validation data.
checkpoint_callback (Callback) – A callback for saving the model checkpoints.
- tune_non_nn_model(X_train, X_val, y_train, y_val)[source]#
This method tunes the hyperparameters of a non-Neural Network model.
- Parameters
X_train (DataFrame) – The input training data.
y_train (DataFrame) – The output training data.
X_val (DataFrame) – The input validation data.
y_val (DataFrame) – The output validation data.
- update_time_consumption(start_time)[source]#
This method updates the time consumption of the model.
- Parameters
start_time (float) – The time when the model’s training started.
- vote_predict_best_models(X)[source]#
Predicts the best models using voting based on the given input data (Not working as expected.).
- Parameters
X (pd.DataFrame or np.ndarray) – Input data for prediction.
- Returns
Predicted values of the best models using voting. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a Series with the predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.
- Return type
pd.DataFrame or pd.Series or None
- weighted_predict_best_models(X)[source]#
Predicts the best models using weighted averaging based on the given input data (Not working as expected.).
- Parameters
X (pd.DataFrame or np.ndarray) – Input data for prediction.
- Returns
Weighted predictions of the best models. If input X is pd.DataFrame, returns a DataFrame with the weighted predictions. If input X is np.ndarray, returns a Series with the weighted predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.
- Return type
pd.DataFrame or pd.Series or None
functions.save_load_model module#
- functions.save_load_model.get_val_metrics_from_log(log_file)[source]#
Extract validation metrics from a given log file.
- Parameters
log_file (str) – Path to the log file.
- Returns
A dictionary where the keys are tuples of (model_name, step) and the values are validation metrics.
- Return type
dict
- functions.save_load_model.load_trained_models(root_dir, log_file)[source]#
Loads trained models from a specified directory. The models are filtered based on the validation metrics provided in a log file.
- Parameters
root_dir (str) – The root directory from where the models will be loaded.
log_file (str) – Path to the log file containing validation metrics.
- Returns
A list of dictionaries, where each dictionary contains details about a trained model and its validation metric.
- Return type
list
- functions.save_load_model.save_trained_models(trained_models, root_dir)[source]#
Saves trained models to a specified directory.
- Parameters
trained_models (list of dicts) – A list of dictionaries, where each dictionary contains details about a trained model.
root_dir (str) – The root directory where the models will be saved.
functions.sentiment_analyzer module#
- class functions.sentiment_analyzer.SentimentAnalysis(path, delimiter, seed=42)[source]#
Bases:
objectA class for performing sentiment analysis on text data using VADER sentiment analysis and RoBERTa tokenization.
The SentimentAnalysis class provides methods for data preprocessing, sentiment score computation, tokenization using RoBERTa, and aggregation of sentiment scores. It also includes functionality to save the processed data to a CSV file.
- Parameters
df (DataFrame) – The input dataset containing text data.
seed (int) – Random seed used for reproducibility.
stopwords (set) – Set of stopwords for text cleaning.
sid (SentimentIntensityAnalyzer) – An instance of the SentimentIntensityAnalyzer from the NLTK library.
- aggregate_by_date()[source]#
Aggregates VADER sentiment scores by date.
- Returns
The aggregated dataframe.
- Return type
DataFrame
- compute_vader_scores(df, label)[source]#
Computes VADER sentiment scores (negative, neutral, positive, compound) for each tweet.
- Parameters
df (DataFrame) – The dataframe.
label (str) – The column in the dataframe containing the text to analyze.
- Returns
The dataframe with the added VADER sentiment scores.
- Return type
DataFrame
- preprocess()[source]#
Preprocesses the data by cleaning text and computing VADER sentiment scores.
- Returns
The processed dataframe.
- Return type
DataFrame
- process_inputs(max_len)[source]#
Tokenizes inputs using RoBERTa for training, validation, and testing.
- Parameters
max_len (int) – Maximum length for tokenization.
- save_df(filename)[source]#
Aggregates the dataframe by date and saves it to a CSV file.
- Parameters
filename (str) – The name of the output CSV file.
functions.utils module#
- functions.utils.calculate_metrics(preds, y_test)[source]#
Calculates Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for the given predictions and actual values. It also calculates daily MSE and RMSE and stores them in a pandas DataFrame.
- Parameters
preds (np.ndarray or pd.Series) – The predicted values.
y_test (np.ndarray or pd.Series) – The actual values.
- Returns
mse (float) – The Mean Squared Error.
rmse (float) – The Root Mean Squared Error.
metrics_df (pd.DataFrame) – A dataframe containing daily MSE and RMSE.
- functions.utils.create_subsets(X, y, num_subsets)[source]#
Divides the features (X) into a specified number of subsets and concatenates each subset with the target (y).
- Parameters
X (pd.DataFrame) – The input features.
y (pd.Series) – The target values.
num_subsets (int) – The number of subsets to divide X into.
- Returns
subsets – A list of dataframes, each consisting of a subset of features and the target.
- Return type
list of pd.DataFrame
- functions.utils.find_best_models(metric_data)[source]#
Finds the best models based on their validation metric.
- Parameters
metric_data (list of dict) – A list of dictionaries, each containing the information for one model.
- Returns
best_models – A list of dictionaries, each containing the information for one best model.
- Return type
list of dict
- functions.utils.format_number(value)[source]#
Formats a number to a string with suffixes (B for billion, M for million, k for thousand).
- Parameters
value (int or float) – The number to format.
- Returns
The formatted number.
- Return type
str
- functions.utils.group_by_model(df)[source]#
Groups the data by model and calculates the cumulative sum of time consumption for each model.
- Parameters
df (pd.DataFrame) – The input dataframe containing model information.
- Returns
df_list – A list of dataframes, each containing the information for one model.
- Return type
list of pd.DataFrame
- functions.utils.mean_absolute_percentage_error(y_true, y_pred)[source]#
Calculates the Mean Absolute Percentage Error (MAPE) between the true and predicted values.
- Parameters
(np.ndarray) (y_pred) –
(np.ndarray) –
- Returns
float
- Return type
The MAPE.
- functions.utils.plot_heatmaps(subsets)[source]#
Plots and saves correlation heatmaps for each subset of data.
- Parameters
subsets (list of pd.DataFrame) – A list of dataframes for which to plot the correlation heatmaps.
- functions.utils.plot_histories(histories)[source]#
Plots the loss histories for different training epochs.
- Parameters
histories (list of keras.callbacks.History) – The list of history objects for each training epoch.
- functions.utils.plot_price_prediction(X_test, y_test, predictions, title)[source]#
Plots the predicted and actual values for the test data.
- Parameters
(pd.DataFrame) (X_test) –
(pd.Series) (predictions) –
(pd.Series) –
(str) (title) –
- Returns
The figure object of the plot.
- Return type
plotly.graph_objects._figure.Figure
- functions.utils.read_model_metrics(log_filename)[source]#
Reads a log file and extracts model metrics into a pandas DataFrame.
- Parameters
log_filename (str) – The path to the log file.
- Returns
df_sorted – A dataframe containing the extracted model metrics, sorted by model and step.
- Return type
pd.DataFrame
- functions.utils.resize_and_remove_background(image_path, output_path, size=(800, 800))[source]#
Resizes an image to the specified size, removes its background, and saves it in the specified output path.
- Parameters
image_path (str) – The path to the input image.
output_path (str) – The path to save the output image.
size (tuple of int, optional) – The desired output size. Default is (800, 800).
- functions.utils.reverse_values(predictions, X_scaled, y_scaled, scaler)[source]#
Reverses the effect of scaling on the predictions and the scaled features and target.
- Parameters
predictions (pd.DataFrame) – The predicted values.
X_scaled (pd.DataFrame) – The scaled features.
y_scaled (pd.Series) – The scaled target.
scaler (sklearn.preprocessing) – andardScaler): The scaler used to scale the data.
- Returns
reverse_predictions_df (pd.DataFrame) – The unscaled predictions.
reverse_x_df (pd.DataFrame) – The unscaled features.
reverse_y_df (pd.Series) – The unscaled target.
- functions.utils.root_mean_squared_log_error(y_true, y_pred)[source]#
Calculates the Root Mean Squared Logarithmic Error (RMSLE) between the true and predicted values.
- Parameters
(np.ndarray) (y_pred) –
(np.ndarray) –
- Returns
float
- Return type
The RMSLE.
- functions.utils.save_dataframe(df_list, image_dir='model_images/all_models/')[source]#
Saves the dataframes as images in the specified directory.
- Parameters
df_list (list of pd.DataFrame) – A list of dataframes to be saved as images.
image_dir (str, optional) – The directory to save the images. Default is ‘model_images/all_models/’.
- functions.utils.save_keras_models(best_models)[source]#
Saves the diagrams of best Keras models in PNG format.
- Parameters
best_models (list of dict) – A list of dictionaries, each containing a Keras model and other related information.
- functions.utils.save_metrics(df, image_dir='model_images/metrics/')[source]#
Saves each model’s metrics as a separate PNG image in the specified directory.
- Parameters
df (pd.DataFrame) – The input dataframe containing model information.
image_dir (str, optional) – The directory to save the images. Default is ‘model_images/metrics/’.
functions.web_scrapper module#
- class functions.web_scrapper.StoppableThread(*args, **kwargs)[source]#
Bases:
threading.ThreadA thread that can be stopped by setting the internal stop flag through stop() method.
- class functions.web_scrapper.WebScraper(url='https://www.coindesk.com/search?s=bitcoin&sort=1', starting_page_number=1, ending_page_number=3096, output_file_name='headlines')[source]#
Bases:
objectA web scraper for scraping headlines from coindesk.
The WebScraper class provides methods for starting, stopping, pausing, and resuming the scraping process. It uses the Selenium library to navigate through web pages, find required elements, and write scraped data to a CSV file.
- start_scraping(url, starting_page_number, ending_page_number, output_file_name)[source]#
Start the scraping process.
- Parameters
url (str) – URL to scrape from.
starting_page_number (int) – Page number to start scraping from.
ending_page_number (int) – Page number to end scraping at.
output_file_name (str) – Name of the file to output scraped headlines.
- webScrapper(url, starting_page_number, ending_page_number, output_file_name)[source]#
The main method for scraping the website. It navigates through the pages of the given url, finds the required elements, and writes the data to a CSV file. It also handles cases like file not found, consent button, and handles exceptions like TimeoutException. The scraping process can also be paused, resumed and stopped.
- Parameters
url (str) – The base url from where the scraping starts.
starting_page_number (int) – The starting page number for scraping.
ending_page_number (int) – The ending page number for scraping.
output_file_name (str) – The name of the output file where the scraped data is stored.