Functions#

Submodules#

functions.process_data module#

functions.process_data.combine_data(output_file_name='combined_data', input_file_name_v1='bitcoin_price.csv', input_file_name_v2='headlines_sentiment.csv')[source]#

Combines two datasets and writes the combined dataset to a CSV file.

Parameters
  • output_file_name (str, optional) – Name of the output file, by default “combined_data”.

  • input_file_name_v1 (str, optional) – Name of the first input file, by default “bitcoin_price.csv”.

  • input_file_name_v2 (str, optional) – Name of the second input file, by default “headlines_sentiment.csv”.

Return type

None

functions.process_data.double_quotation_remover(input_file_name_v1, delimiter=',')[source]#

Removes double quotation marks from specified columns in a CSV file.

Parameters
  • input_file_name_v1 (str) – Path to the input CSV file.

  • delimiter (str, optional) – Delimiter used in the CSV file, by default “,”.

Return type

None

functions.process_data.modify_date_format(input_file_name_v1, date_column_name, date_format='%b %d, %Y', delimiter=',')[source]#

Modifies the date format of a specific column in a CSV file.

Parameters
  • input_file_name_v1 (str) – Path to the input CSV file.

  • date_column_name (str) – Name of the column containing dates to be modified.

  • date_format (str, optional) – Format of the dates in the input file, by default “%b %d, %Y”.

  • delimiter (str, optional) – Delimiter used in the CSV file, by default “,”.

Return type

None

Examples

  • Format 1: “%Y-%m-%d” (e.g., “2023-05-14”)

  • Format 2: “%m-%d-%Y” (e.g., “05-14-2023”)

  • Format 3: “%d-%m-%Y” (e.g., “14-05-2023”)

  • Format 4: “%Y/%m/%d” (e.g., “2023/05/14”)

  • Format 5: “%m/%d/%Y” (e.g., “05/14/2023”)

  • Format 6: “%d/%m/%Y” (e.g., “14/05/2023”)

  • Format 7: “%Y.%m.%d” (e.g., “2023.05.14”)

  • Format 8: “%m.%d.%Y” (e.g., “05.14.2023”)

  • Format 9: “%d.%m.%Y” (e.g., “14.05.2023”)

  • Format 10: “%Y %m %d” (e.g., “2023 05 14”)

  • Format 11: “%m %d %Y” (e.g., “05 14 2023”)

  • Format 12: “%d %m %Y” (e.g., “14 05 2023”)

  • Format 13: “%b %d, %Y” (e.g., “May 14, 2023”)

functions.process_data.process_data(file_path, lags=[1, 2, 3], rolling_windows=[3, 7, 14], target_col='Price', datetime_col='Date', volume_col='Vol.', scaler='MinMaxScaler')[source]#

Preprocesses a given dataset by performing steps such as loading the data, handling missing values, creating lagged features, scaling, etc.

Parameters
  • file_path (str) – Path to the CSV file.

  • lags (list, optional) – List of lags to create lagged features, by default [1, 2, 3].

  • rolling_windows (list, optional) – List of windows to create rolling window features, by default [3, 7, 14].

  • target_col (str, optional) – Name of the target column, by default “Price”.

  • datetime_col (str, optional) – Name of the datetime column, by default “Date”.

  • volume_col (str, optional) – Name of the volume column, by default “Vol.”.

  • scaler (str, optional) – Scaler to use for scaling the features. Options are ‘StandardScaler’, ‘RobustScaler’, and ‘MinMaxScaler’. Defaults to ‘MinMaxScaler’.

Returns

Tuple containing scaled features, scaled target, and the scaler used.

Return type

tuple

functions.process_data.split_data(X_scaled, y_scaled, test_ratio)[source]#

Splits the scaled features and target into training/validation and test sets.

Parameters
  • X_scaled (DataFrame) – Scaled features.

  • y_scaled (DataFrame) – Scaled target.

  • test_ratio (float) – Ratio of the test set size.

Returns

Tuple containing training/validation features, training/validation target, test features, and test target.

Return type

tuple

functions.rolling_window_trainer module#

class functions.rolling_window_trainer.RollingWindowTrainer(scaler, X_train_val, y_train_val, train_window=100, val_window=20, step_size=5, checkpoint_path='models/save/checkpoints/', early_stopping_values={'cnn': {'min_delta': 0.0003, 'monitor': 'val_loss', 'patience': 7}, 'lstm': {'min_delta': 0.0001, 'monitor': 'val_loss', 'patience': 10}, 'nn': {'min_delta': 0.0002, 'monitor': 'val_loss', 'patience': 5}}, model_list=None)[source]#

Bases: object

This class provides methods for training and evaluating machine learning models using a rolling window approach. The rolling window approach is a time series forecasting technique where the model is retrained at each time step, using only the most recent data as training data.

Parameters
  • scaler (class) – The scaler class to scale the predictions.

  • X_train_val (DataFrame) – The feature dataset for training and validation.

  • y_train_val (DataFrame) – The target dataset for training and validation.

  • train_window (int, optional) – The size of the training window, by default 100.

  • val_window (int, optional) – The size of the validation window, by default 20.

  • step_size (int, optional) – The step size to move the window for each iteration, by default 5.

  • early_stopping_values (dict, optional) – A dictionary with keys as the model names and values as a dictionary containing early stopping parameters.

  • checkpoint_path (str, optional) – Path to save the model checkpoints, by default “models/save/checkpoints/”.

  • model_list (list, optional) – A list of models to be trained.

check_overfitting(step, history)[source]#

This method checks for overfitting in the model’s training history.

Parameters
  • step (int) – The current step in the rolling window approach.

  • history (History) – The training history of the model.

generate_mean_df(predictions_all)[source]#

Generates a mean DataFrame from the predictions of all models.

Parameters

predictions_all (pd.DataFrame) – DataFrame containing predictions of all models.

Returns

Mean DataFrame with averaged predictions per model.

Return type

pd.DataFrame

get_all_models()[source]#

Returns the list of all models.

Returns

List of all models.

Return type

list

get_all_val_metrics()[source]#

Returns all validation metrics.

Returns

List of all validation metrics.

Return type

list

get_best_model_at_window(window)[source]#

Returns the best model information for the specified window step.

Parameters

window (int) – Window step for which to retrieve the best model.

Returns

model_info – Model information dictionary for the best model at the given window step. Returns None if no best model is found for the specified window.

Return type

dict or None

get_best_models()[source]#

Returns the list of best models.

Returns

List of best models.

Return type

list

get_checkpoint_callback(checkpoint_dir)[source]#

Returns the checkpoint callback for saving the best model.

Parameters

checkpoint_dir (str) – Directory path for saving the best model checkpoint.

Returns

Checkpoint callback for saving the best model.

Return type

ModelCheckpoint

get_checkpoint_dir()[source]#

Returns the checkpoint directory for saving the best model.

Returns

Checkpoint directory for saving the best model.

Return type

str

get_histories()[source]#

Returns the histories of neural network, CNN, and LSTM models.

Returns

Tuple containing the histories of neural network, CNN, and LSTM models.

Return type

tuple

get_train_window(start, end)[source]#

Retrieves the training window data.

Parameters
  • start (int) – Start index of the training window.

  • end (int) – End index of the training window.

Returns

  • X_train_window (pd.DataFrame, pd.Series) – Input data for the training window.

  • y_train_window (pd.DataFrame, pd.Series) – Target data for the training window.

get_val_window(start, end)[source]#

Retrieves the validation window data.

Parameters
  • start (int) – Start index of the validation window.

  • end (int) – End index of the validation window.

Returns

  • X_val_window (pd.DataFrame, pd.Series) – Input data for the validation window.

  • y_val_window (pd.DataFrame, pd.Series) – Target data for the validation window.

get_window_indices(step)[source]#

Returns the start and end indices for the given window step.

Parameters

step (int) – The window step.

Returns

Tuple containing the start and end indices for the window.

Return type

tuple

load_and_set_histories(file_name_nn='nn_history', file_name_cnn='cnn_history', file_name_lstm='lstm_history')[source]#

This method loads and sets the training history of the models from the specified files.

Parameters
  • file_name_nn (str, optional) – The file name for the history of the Neural Network model.

  • file_name_cnn (str, optional) – The file name for the history of the Convolutional Neural Network model.

  • file_name_lstm (str, optional) – The file name for the history of the LSTM model.

predict_all_models(X)[source]#

Predicts all models based on the given input data.

Parameters

X (pd.DataFrame or np.ndarray) – Input data for prediction.

Returns

Predicted values of all models. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a 2D array with the predictions.

Return type

pd.DataFrame or np.ndarray

predict_best_models(X)[source]#

Predicts the best models based on the given input data.

Parameters

X (pd.DataFrame or np.ndarray) – Input data for prediction.

Returns

Predicted values of the best models. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a Series with the predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.

Return type

pd.DataFrame or pd.Series or None

print_model_info()[source]#

Prints information about the models.

save_histories(file_name_nn='nn_history', file_name_cnn='cnn_history', file_name_lstm='lstm_history')[source]#

Saves the histories of neural network, CNN, and LSTM models as pickle files.

Parameters
  • file_name_nn (str, optional) – Name of the file to save the neural network history, by default “nn_history”.

  • file_name_cnn (str, optional) – Name of the file to save the CNN history, by default “cnn_history”.

  • file_name_lstm (str, optional) – Name of the file to save the LSTM history, by default “lstm_history”.

set_all_models(loaded_models)[source]#

Sets the list of all models.

Parameters

loaded_models (list) – List of loaded all models.

set_best_models(loaded_models)[source]#

Sets the list of best models.

Parameters

loaded_models (list) – List of loaded best models.

start_training()[source]#
train_nn_model_with_window(X_train, y_train, X_val, y_val, checkpoint_callback)[source]#

This method trains a Neural Network model with a rolling window approach.

Parameters
  • X_train (DataFrame) – The input training data.

  • y_train (DataFrame) – The output training data.

  • X_val (DataFrame) – The input validation data.

  • y_val (DataFrame) – The output validation data.

  • checkpoint_callback (Callback) – A callback for saving the model checkpoints.

tune_non_nn_model(X_train, X_val, y_train, y_val)[source]#

This method tunes the hyperparameters of a non-Neural Network model.

Parameters
  • X_train (DataFrame) – The input training data.

  • y_train (DataFrame) – The output training data.

  • X_val (DataFrame) – The input validation data.

  • y_val (DataFrame) – The output validation data.

update_time_consumption(start_time)[source]#

This method updates the time consumption of the model.

Parameters

start_time (float) – The time when the model’s training started.

vote_predict_best_models(X)[source]#

Predicts the best models using voting based on the given input data (Not working as expected.).

Parameters

X (pd.DataFrame or np.ndarray) – Input data for prediction.

Returns

Predicted values of the best models using voting. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a Series with the predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.

Return type

pd.DataFrame or pd.Series or None

weighted_predict_best_models(X)[source]#

Predicts the best models using weighted averaging based on the given input data (Not working as expected.).

Parameters

X (pd.DataFrame or np.ndarray) – Input data for prediction.

Returns

Weighted predictions of the best models. If input X is pd.DataFrame, returns a DataFrame with the weighted predictions. If input X is np.ndarray, returns a Series with the weighted predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.

Return type

pd.DataFrame or pd.Series or None

functions.save_load_model module#

functions.save_load_model.get_val_metrics_from_log(log_file)[source]#

Extract validation metrics from a given log file.

Parameters

log_file (str) – Path to the log file.

Returns

A dictionary where the keys are tuples of (model_name, step) and the values are validation metrics.

Return type

dict

functions.save_load_model.load_trained_models(root_dir, log_file)[source]#

Loads trained models from a specified directory. The models are filtered based on the validation metrics provided in a log file.

Parameters
  • root_dir (str) – The root directory from where the models will be loaded.

  • log_file (str) – Path to the log file containing validation metrics.

Returns

A list of dictionaries, where each dictionary contains details about a trained model and its validation metric.

Return type

list

functions.save_load_model.save_trained_models(trained_models, root_dir)[source]#

Saves trained models to a specified directory.

Parameters
  • trained_models (list of dicts) – A list of dictionaries, where each dictionary contains details about a trained model.

  • root_dir (str) – The root directory where the models will be saved.

functions.sentiment_analyzer module#

class functions.sentiment_analyzer.SentimentAnalysis(path, delimiter, seed=42)[source]#

Bases: object

A class for performing sentiment analysis on text data using VADER sentiment analysis and RoBERTa tokenization.

The SentimentAnalysis class provides methods for data preprocessing, sentiment score computation, tokenization using RoBERTa, and aggregation of sentiment scores. It also includes functionality to save the processed data to a CSV file.

Parameters
  • df (DataFrame) – The input dataset containing text data.

  • seed (int) – Random seed used for reproducibility.

  • stopwords (set) – Set of stopwords for text cleaning.

  • sid (SentimentIntensityAnalyzer) – An instance of the SentimentIntensityAnalyzer from the NLTK library.

aggregate_by_date()[source]#

Aggregates VADER sentiment scores by date.

Returns

The aggregated dataframe.

Return type

DataFrame

compute_vader_scores(df, label)[source]#

Computes VADER sentiment scores (negative, neutral, positive, compound) for each tweet.

Parameters
  • df (DataFrame) – The dataframe.

  • label (str) – The column in the dataframe containing the text to analyze.

Returns

The dataframe with the added VADER sentiment scores.

Return type

DataFrame

preprocess()[source]#

Preprocesses the data by cleaning text and computing VADER sentiment scores.

Returns

The processed dataframe.

Return type

DataFrame

process_inputs(max_len)[source]#

Tokenizes inputs using RoBERTa for training, validation, and testing.

Parameters

max_len (int) – Maximum length for tokenization.

save_df(filename)[source]#

Aggregates the dataframe by date and saves it to a CSV file.

Parameters

filename (str) – The name of the output CSV file.

tweet_to_words(tweet)[source]#

Cleans a tweet by removing non-alphanumeric characters and stopwords, and applies stemming.

Parameters

tweet (str) – A tweet.

Returns

A list of cleaned words from the tweet.

Return type

list

unlist(list)[source]#

Joins a list of words into a string.

Parameters

list (list) – A list of words.

Returns

A string with words separated by spaces.

Return type

str

functions.utils module#

functions.utils.calculate_metrics(preds, y_test)[source]#

Calculates Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for the given predictions and actual values. It also calculates daily MSE and RMSE and stores them in a pandas DataFrame.

Parameters
  • preds (np.ndarray or pd.Series) – The predicted values.

  • y_test (np.ndarray or pd.Series) – The actual values.

Returns

  • mse (float) – The Mean Squared Error.

  • rmse (float) – The Root Mean Squared Error.

  • metrics_df (pd.DataFrame) – A dataframe containing daily MSE and RMSE.

functions.utils.create_subsets(X, y, num_subsets)[source]#

Divides the features (X) into a specified number of subsets and concatenates each subset with the target (y).

Parameters
  • X (pd.DataFrame) – The input features.

  • y (pd.Series) – The target values.

  • num_subsets (int) – The number of subsets to divide X into.

Returns

subsets – A list of dataframes, each consisting of a subset of features and the target.

Return type

list of pd.DataFrame

functions.utils.find_best_models(metric_data)[source]#

Finds the best models based on their validation metric.

Parameters

metric_data (list of dict) – A list of dictionaries, each containing the information for one model.

Returns

best_models – A list of dictionaries, each containing the information for one best model.

Return type

list of dict

functions.utils.format_number(value)[source]#

Formats a number to a string with suffixes (B for billion, M for million, k for thousand).

Parameters

value (int or float) – The number to format.

Returns

The formatted number.

Return type

str

functions.utils.group_by_model(df)[source]#

Groups the data by model and calculates the cumulative sum of time consumption for each model.

Parameters

df (pd.DataFrame) – The input dataframe containing model information.

Returns

df_list – A list of dataframes, each containing the information for one model.

Return type

list of pd.DataFrame

functions.utils.mean_absolute_percentage_error(y_true, y_pred)[source]#

Calculates the Mean Absolute Percentage Error (MAPE) between the true and predicted values.

Parameters
  • (np.ndarray) (y_pred) –

  • (np.ndarray)

Returns

float

Return type

The MAPE.

functions.utils.plot_heatmaps(subsets)[source]#

Plots and saves correlation heatmaps for each subset of data.

Parameters

subsets (list of pd.DataFrame) – A list of dataframes for which to plot the correlation heatmaps.

functions.utils.plot_histories(histories)[source]#

Plots the loss histories for different training epochs.

Parameters

histories (list of keras.callbacks.History) – The list of history objects for each training epoch.

functions.utils.plot_price_prediction(X_test, y_test, predictions, title)[source]#

Plots the predicted and actual values for the test data.

Parameters
  • (pd.DataFrame) (X_test) –

  • (pd.Series) (predictions) –

  • (pd.Series)

  • (str) (title) –

Returns

The figure object of the plot.

Return type

plotly.graph_objects._figure.Figure

functions.utils.read_model_metrics(log_filename)[source]#

Reads a log file and extracts model metrics into a pandas DataFrame.

Parameters

log_filename (str) – The path to the log file.

Returns

df_sorted – A dataframe containing the extracted model metrics, sorted by model and step.

Return type

pd.DataFrame

functions.utils.resize_and_remove_background(image_path, output_path, size=(800, 800))[source]#

Resizes an image to the specified size, removes its background, and saves it in the specified output path.

Parameters
  • image_path (str) – The path to the input image.

  • output_path (str) – The path to save the output image.

  • size (tuple of int, optional) – The desired output size. Default is (800, 800).

functions.utils.reverse_values(predictions, X_scaled, y_scaled, scaler)[source]#

Reverses the effect of scaling on the predictions and the scaled features and target.

Parameters
  • predictions (pd.DataFrame) – The predicted values.

  • X_scaled (pd.DataFrame) – The scaled features.

  • y_scaled (pd.Series) – The scaled target.

  • scaler (sklearn.preprocessing) – andardScaler): The scaler used to scale the data.

Returns

  • reverse_predictions_df (pd.DataFrame) – The unscaled predictions.

  • reverse_x_df (pd.DataFrame) – The unscaled features.

  • reverse_y_df (pd.Series) – The unscaled target.

functions.utils.root_mean_squared_log_error(y_true, y_pred)[source]#

Calculates the Root Mean Squared Logarithmic Error (RMSLE) between the true and predicted values.

Parameters
  • (np.ndarray) (y_pred) –

  • (np.ndarray)

Returns

float

Return type

The RMSLE.

functions.utils.save_dataframe(df_list, image_dir='model_images/all_models/')[source]#

Saves the dataframes as images in the specified directory.

Parameters
  • df_list (list of pd.DataFrame) – A list of dataframes to be saved as images.

  • image_dir (str, optional) – The directory to save the images. Default is ‘model_images/all_models/’.

functions.utils.save_keras_models(best_models)[source]#

Saves the diagrams of best Keras models in PNG format.

Parameters

best_models (list of dict) – A list of dictionaries, each containing a Keras model and other related information.

functions.utils.save_metrics(df, image_dir='model_images/metrics/')[source]#

Saves each model’s metrics as a separate PNG image in the specified directory.

Parameters
  • df (pd.DataFrame) – The input dataframe containing model information.

  • image_dir (str, optional) – The directory to save the images. Default is ‘model_images/metrics/’.

functions.web_scrapper module#

class functions.web_scrapper.StoppableThread(*args, **kwargs)[source]#

Bases: threading.Thread

A thread that can be stopped by setting the internal stop flag through stop() method.

stop()[source]#

Set the stop flag to stop the thread.

stopped()[source]#

Check if the thread is stopped.

Returns

True if the thread is stopped, False otherwise.

Return type

bool

class functions.web_scrapper.WebScraper(url='https://www.coindesk.com/search?s=bitcoin&sort=1', starting_page_number=1, ending_page_number=3096, output_file_name='headlines')[source]#

Bases: object

A web scraper for scraping headlines from coindesk.

The WebScraper class provides methods for starting, stopping, pausing, and resuming the scraping process. It uses the Selenium library to navigate through web pages, find required elements, and write scraped data to a CSV file.

pause_scraping()[source]#

Pause the scraping process by setting the pause flag to True.

resume_scraping()[source]#

Resume the scraping process by setting the pause flag to False.

start_scraping(url, starting_page_number, ending_page_number, output_file_name)[source]#

Start the scraping process.

Parameters
  • url (str) – URL to scrape from.

  • starting_page_number (int) – Page number to start scraping from.

  • ending_page_number (int) – Page number to end scraping at.

  • output_file_name (str) – Name of the file to output scraped headlines.

stop_scraping()[source]#

Stop the scraping process by stopping the thread in which it runs.

webScrapper(url, starting_page_number, ending_page_number, output_file_name)[source]#

The main method for scraping the website. It navigates through the pages of the given url, finds the required elements, and writes the data to a CSV file. It also handles cases like file not found, consent button, and handles exceptions like TimeoutException. The scraping process can also be paused, resumed and stopped.

Parameters
  • url (str) – The base url from where the scraping starts.

  • starting_page_number (int) – The starting page number for scraping.

  • ending_page_number (int) – The ending page number for scraping.

  • output_file_name (str) – The name of the output file where the scraped data is stored.

Module contents#