Functions#

Submodules#

functions.process_data module#

functions.process_data.combine_data(output_file_name='combined_data', input_file_name_v1='bitcoin_price.csv', input_file_name_v2='headlines_sentiment.csv')[source]#

Combines two datasets and writes the combined dataset to a CSV file.

Parameters

output_file_name (str, optional) – Name of the output file, by default “combined_data”.
input_file_name_v1 (str, optional) – Name of the first input file, by default “bitcoin_price.csv”.
input_file_name_v2 (str, optional) – Name of the second input file, by default “headlines_sentiment.csv”.

Return type

None

functions.process_data.double_quotation_remover(input_file_name_v1, delimiter=',')[source]#

Removes double quotation marks from specified columns in a CSV file.

Parameters

input_file_name_v1 (str) – Path to the input CSV file.
delimiter (str, optional) – Delimiter used in the CSV file, by default “,”.

Return type

None

functions.process_data.modify_date_format(input_file_name_v1, date_column_name, date_format='%b %d, %Y', delimiter=',')[source]#

Modifies the date format of a specific column in a CSV file.

Parameters

input_file_name_v1 (str) – Path to the input CSV file.
date_column_name (str) – Name of the column containing dates to be modified.
date_format (str, optional) – Format of the dates in the input file, by default “%b %d, %Y”.
delimiter (str, optional) – Delimiter used in the CSV file, by default “,”.

Return type

None

Examples

Format 1: “%Y-%m-%d” (e.g., “2023-05-14”)
Format 2: “%m-%d-%Y” (e.g., “05-14-2023”)
Format 3: “%d-%m-%Y” (e.g., “14-05-2023”)
Format 4: “%Y/%m/%d” (e.g., “2023/05/14”)
Format 5: “%m/%d/%Y” (e.g., “05/14/2023”)
Format 6: “%d/%m/%Y” (e.g., “14/05/2023”)
Format 7: “%Y.%m.%d” (e.g., “2023.05.14”)
Format 8: “%m.%d.%Y” (e.g., “05.14.2023”)
Format 9: “%d.%m.%Y” (e.g., “14.05.2023”)
Format 10: “%Y %m %d” (e.g., “2023 05 14”)
Format 11: “%m %d %Y” (e.g., “05 14 2023”)
Format 12: “%d %m %Y” (e.g., “14 05 2023”)
Format 13: “%b %d, %Y” (e.g., “May 14, 2023”)

functions.process_data.process_data(file_path, lags=[1, 2, 3], rolling_windows=[3, 7, 14], target_col='Price', datetime_col='Date', volume_col='Vol.', scaler='MinMaxScaler')[source]#

Preprocesses a given dataset by performing steps such as loading the data, handling missing values, creating lagged features, scaling, etc.

Parameters

file_path (str) – Path to the CSV file.
lags (list, optional) – List of lags to create lagged features, by default [1, 2, 3].
rolling_windows (list, optional) – List of windows to create rolling window features, by default [3, 7, 14].
target_col (str, optional) – Name of the target column, by default “Price”.
datetime_col (str, optional) – Name of the datetime column, by default “Date”.
volume_col (str, optional) – Name of the volume column, by default “Vol.”.
scaler (str, optional) – Scaler to use for scaling the features. Options are ‘StandardScaler’, ‘RobustScaler’, and ‘MinMaxScaler’. Defaults to ‘MinMaxScaler’.

Returns

Tuple containing scaled features, scaled target, and the scaler used.

Return type

tuple

functions.process_data.split_data(X_scaled, y_scaled, test_ratio)[source]#

Splits the scaled features and target into training/validation and test sets.

Parameters

X_scaled (DataFrame) – Scaled features.
y_scaled (DataFrame) – Scaled target.
test_ratio (float) – Ratio of the test set size.

Returns

Tuple containing training/validation features, training/validation target, test features, and test target.

Return type

tuple

functions.rolling_window_trainer module#

class functions.rolling_window_trainer.RollingWindowTrainer(scaler, X_train_val, y_train_val, train_window=100, val_window=20, step_size=5, checkpoint_path='models/save/checkpoints/', early_stopping_values={'cnn': {'min_delta': 0.0003, 'monitor': 'val_loss', 'patience': 7}, 'lstm': {'min_delta': 0.0001, 'monitor': 'val_loss', 'patience': 10}, 'nn': {'min_delta': 0.0002, 'monitor': 'val_loss', 'patience': 5}}, model_list=None)[source]#

Bases: object

This class provides methods for training and evaluating machine learning models using a rolling window approach. The rolling window approach is a time series forecasting technique where the model is retrained at each time step, using only the most recent data as training data.

Parameters

scaler (class) – The scaler class to scale the predictions.
X_train_val (DataFrame) – The feature dataset for training and validation.
y_train_val (DataFrame) – The target dataset for training and validation.
train_window (int, optional) – The size of the training window, by default 100.
val_window (int, optional) – The size of the validation window, by default 20.
step_size (int, optional) – The step size to move the window for each iteration, by default 5.
early_stopping_values (dict, optional) – A dictionary with keys as the model names and values as a dictionary containing early stopping parameters.
checkpoint_path (str, optional) – Path to save the model checkpoints, by default “models/save/checkpoints/”.
model_list (list, optional) – A list of models to be trained.

check_overfitting(step, history)[source]#

This method checks for overfitting in the model’s training history.

Parameters

step (int) – The current step in the rolling window approach.
history (History) – The training history of the model.

generate_mean_df(predictions_all)[source]#

Generates a mean DataFrame from the predictions of all models.

Parameters: predictions_all (pd.DataFrame) – DataFrame containing predictions of all models.
Returns: Mean DataFrame with averaged predictions per model.
Return type: pd.DataFrame

get_all_models()[source]#

Returns the list of all models.

Returns: List of all models.
Return type: list

get_all_val_metrics()[source]#

Returns all validation metrics.

Returns: List of all validation metrics.
Return type: list

get_best_model_at_window(window)[source]#

Returns the best model information for the specified window step.

Parameters: window (int) – Window step for which to retrieve the best model.
Returns: model_info – Model information dictionary for the best model at the given window step. Returns None if no best model is found for the specified window.
Return type: dict or None

get_best_models()[source]#

Returns the list of best models.

Returns: List of best models.
Return type: list

get_checkpoint_callback(checkpoint_dir)[source]#

Returns the checkpoint callback for saving the best model.

Parameters: checkpoint_dir (str) – Directory path for saving the best model checkpoint.
Returns: Checkpoint callback for saving the best model.
Return type: ModelCheckpoint

get_checkpoint_dir()[source]#

Returns the checkpoint directory for saving the best model.

Returns: Checkpoint directory for saving the best model.
Return type: str

get_histories()[source]#

Returns the histories of neural network, CNN, and LSTM models.

Returns: Tuple containing the histories of neural network, CNN, and LSTM models.
Return type: tuple

get_train_window(start, end)[source]#

Retrieves the training window data.

Parameters

start (int) – Start index of the training window.
end (int) – End index of the training window.

Returns

X_train_window (pd.DataFrame, pd.Series) – Input data for the training window.
y_train_window (pd.DataFrame, pd.Series) – Target data for the training window.

get_val_window(start, end)[source]#

Retrieves the validation window data.

Parameters

start (int) – Start index of the validation window.
end (int) – End index of the validation window.

Returns

X_val_window (pd.DataFrame, pd.Series) – Input data for the validation window.
y_val_window (pd.DataFrame, pd.Series) – Target data for the validation window.

get_window_indices(step)[source]#

Returns the start and end indices for the given window step.

Parameters: step (int) – The window step.
Returns: Tuple containing the start and end indices for the window.
Return type: tuple

load_and_set_histories(file_name_nn='nn_history', file_name_cnn='cnn_history', file_name_lstm='lstm_history')[source]#

This method loads and sets the training history of the models from the specified files.

Parameters

file_name_nn (str, optional) – The file name for the history of the Neural Network model.
file_name_cnn (str, optional) – The file name for the history of the Convolutional Neural Network model.
file_name_lstm (str, optional) – The file name for the history of the LSTM model.

predict_all_models(X)[source]#

Predicts all models based on the given input data.

Parameters: X (pd.DataFrame or np.ndarray) – Input data for prediction.
Returns: Predicted values of all models. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a 2D array with the predictions.
Return type: pd.DataFrame or np.ndarray

predict_best_models(X)[source]#

Predicts the best models based on the given input data.

Parameters: X (pd.DataFrame or np.ndarray) – Input data for prediction.
Returns: Predicted values of the best models. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a Series with the predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.
Return type: pd.DataFrame or pd.Series or None

print_model_info()[source]#: Prints information about the models.

save_histories(file_name_nn='nn_history', file_name_cnn='cnn_history', file_name_lstm='lstm_history')[source]#

Saves the histories of neural network, CNN, and LSTM models as pickle files.

Parameters

file_name_nn (str, optional) – Name of the file to save the neural network history, by default “nn_history”.
file_name_cnn (str, optional) – Name of the file to save the CNN history, by default “cnn_history”.
file_name_lstm (str, optional) – Name of the file to save the LSTM history, by default “lstm_history”.

set_all_models(loaded_models)[source]#

Sets the list of all models.

Parameters: loaded_models (list) – List of loaded all models.

set_best_models(loaded_models)[source]#

Sets the list of best models.

Parameters: loaded_models (list) – List of loaded best models.

start_training()[source]#

train_nn_model_with_window(X_train, y_train, X_val, y_val, checkpoint_callback)[source]#

This method trains a Neural Network model with a rolling window approach.

Parameters

X_train (DataFrame) – The input training data.
y_train (DataFrame) – The output training data.
X_val (DataFrame) – The input validation data.
y_val (DataFrame) – The output validation data.
checkpoint_callback (Callback) – A callback for saving the model checkpoints.

tune_non_nn_model(X_train, X_val, y_train, y_val)[source]#

This method tunes the hyperparameters of a non-Neural Network model.

Parameters

X_train (DataFrame) – The input training data.
y_train (DataFrame) – The output training data.
X_val (DataFrame) – The input validation data.
y_val (DataFrame) – The output validation data.

update_time_consumption(start_time)[source]#

This method updates the time consumption of the model.

Parameters: start_time (float) – The time when the model’s training started.

vote_predict_best_models(X)[source]#

Predicts the best models using voting based on the given input data (Not working as expected.).

Parameters: X (pd.DataFrame or np.ndarray) – Input data for prediction.
Returns: Predicted values of the best models using voting. If input X is pd.DataFrame, returns a DataFrame with the predictions. If input X is np.ndarray, returns a Series with the predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.
Return type: pd.DataFrame or pd.Series or None

weighted_predict_best_models(X)[source]#

Predicts the best models using weighted averaging based on the given input data (Not working as expected.).

Parameters: X (pd.DataFrame or np.ndarray) – Input data for prediction.
Returns: Weighted predictions of the best models. If input X is pd.DataFrame, returns a DataFrame with the weighted predictions. If input X is np.ndarray, returns a Series with the weighted predictions. Returns None if the input X is neither pd.DataFrame nor np.ndarray.
Return type: pd.DataFrame or pd.Series or None

functions.save_load_model module#

functions.save_load_model.get_val_metrics_from_log(log_file)[source]#

Extract validation metrics from a given log file.

Parameters: log_file (str) – Path to the log file.
Returns: A dictionary where the keys are tuples of (model_name, step) and the values are validation metrics.
Return type: dict

functions.save_load_model.load_trained_models(root_dir, log_file)[source]#

Loads trained models from a specified directory. The models are filtered based on the validation metrics provided in a log file.

Parameters

root_dir (str) – The root directory from where the models will be loaded.
log_file (str) – Path to the log file containing validation metrics.

Returns

A list of dictionaries, where each dictionary contains details about a trained model and its validation metric.

Return type

list

functions.save_load_model.save_trained_models(trained_models, root_dir)[source]#

Saves trained models to a specified directory.

Parameters

trained_models (list of dicts) – A list of dictionaries, where each dictionary contains details about a trained model.
root_dir (str) – The root directory where the models will be saved.

functions.sentiment_analyzer module#

class functions.sentiment_analyzer.SentimentAnalysis(path, delimiter, seed=42)[source]#

Bases: object

A class for performing sentiment analysis on text data using VADER sentiment analysis and RoBERTa tokenization.

The SentimentAnalysis class provides methods for data preprocessing, sentiment score computation, tokenization using RoBERTa, and aggregation of sentiment scores. It also includes functionality to save the processed data to a CSV file.

Parameters

df (DataFrame) – The input dataset containing text data.
seed (int) – Random seed used for reproducibility.
stopwords (set) – Set of stopwords for text cleaning.
sid (SentimentIntensityAnalyzer) – An instance of the SentimentIntensityAnalyzer from the NLTK library.

aggregate_by_date()[source]#

Aggregates VADER sentiment scores by date.

Returns: The aggregated dataframe.
Return type: DataFrame

compute_vader_scores(df, label)[source]#

Computes VADER sentiment scores (negative, neutral, positive, compound) for each tweet.

Parameters

df (DataFrame) – The dataframe.
label (str) – The column in the dataframe containing the text to analyze.

Returns

The dataframe with the added VADER sentiment scores.

Return type

DataFrame

preprocess()[source]#

Preprocesses the data by cleaning text and computing VADER sentiment scores.

Returns: The processed dataframe.
Return type: DataFrame

process_inputs(max_len)[source]#

Tokenizes inputs using RoBERTa for training, validation, and testing.

Parameters: max_len (int) – Maximum length for tokenization.

save_df(filename)[source]#

Aggregates the dataframe by date and saves it to a CSV file.

Parameters: filename (str) – The name of the output CSV file.

tweet_to_words(tweet)[source]#

Cleans a tweet by removing non-alphanumeric characters and stopwords, and applies stemming.

Parameters: tweet (str) – A tweet.
Returns: A list of cleaned words from the tweet.
Return type: list

unlist(list)[source]#

Joins a list of words into a string.

Parameters: list (list) – A list of words.
Returns: A string with words separated by spaces.
Return type: str

functions.utils module#

functions.utils.calculate_metrics(preds, y_test)[source]#

Calculates Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for the given predictions and actual values. It also calculates daily MSE and RMSE and stores them in a pandas DataFrame.

Parameters

preds (np.ndarray or pd.Series) – The predicted values.
y_test (np.ndarray or pd.Series) – The actual values.

Returns

mse (float) – The Mean Squared Error.
rmse (float) – The Root Mean Squared Error.
metrics_df (pd.DataFrame) – A dataframe containing daily MSE and RMSE.

functions.utils.create_subsets(X, y, num_subsets)[source]#

Divides the features (X) into a specified number of subsets and concatenates each subset with the target (y).

Parameters

X (pd.DataFrame) – The input features.
y (pd.Series) – The target values.
num_subsets (int) – The number of subsets to divide X into.

Returns

subsets – A list of dataframes, each consisting of a subset of features and the target.

Return type

list of pd.DataFrame

functions.utils.find_best_models(metric_data)[source]#

Finds the best models based on their validation metric.

Parameters: metric_data (list of dict) – A list of dictionaries, each containing the information for one model.
Returns: best_models – A list of dictionaries, each containing the information for one best model.
Return type: list of dict

functions.utils.format_number(value)[source]#

Formats a number to a string with suffixes (B for billion, M for million, k for thousand).

Parameters: value (int or float) – The number to format.
Returns: The formatted number.
Return type: str

functions.utils.group_by_model(df)[source]#

Groups the data by model and calculates the cumulative sum of time consumption for each model.

Parameters: df (pd.DataFrame) – The input dataframe containing model information.
Returns: df_list – A list of dataframes, each containing the information for one model.
Return type: list of pd.DataFrame

functions.utils.mean_absolute_percentage_error(y_true, y_pred)[source]#

Calculates the Mean Absolute Percentage Error (MAPE) between the true and predicted values.

Parameters

(np.ndarray) (y_pred) –
(np.ndarray) –

Returns

float

Return type

The MAPE.

functions.utils.plot_heatmaps(subsets)[source]#

Plots and saves correlation heatmaps for each subset of data.

Parameters: subsets (list of pd.DataFrame) – A list of dataframes for which to plot the correlation heatmaps.

functions.utils.plot_histories(histories)[source]#

Plots the loss histories for different training epochs.

Parameters: histories (list of keras.callbacks.History) – The list of history objects for each training epoch.

functions.utils.plot_price_prediction(X_test, y_test, predictions, title)[source]#

Plots the predicted and actual values for the test data.

Parameters

(pd.DataFrame) (X_test) –
(pd.Series) (predictions) –
(pd.Series) –
(str) (title) –

Returns

The figure object of the plot.

Return type

plotly.graph_objects._figure.Figure

functions.utils.read_model_metrics(log_filename)[source]#

Reads a log file and extracts model metrics into a pandas DataFrame.

Parameters: log_filename (str) – The path to the log file.
Returns: df_sorted – A dataframe containing the extracted model metrics, sorted by model and step.
Return type: pd.DataFrame

functions.utils.resize_and_remove_background(image_path, output_path, size=(800, 800))[source]#

Resizes an image to the specified size, removes its background, and saves it in the specified output path.

Parameters

image_path (str) – The path to the input image.
output_path (str) – The path to save the output image.
size (tuple of int, optional) – The desired output size. Default is (800, 800).

functions.utils.reverse_values(predictions, X_scaled, y_scaled, scaler)[source]#

Reverses the effect of scaling on the predictions and the scaled features and target.

Parameters

predictions (pd.DataFrame) – The predicted values.
X_scaled (pd.DataFrame) – The scaled features.
y_scaled (pd.Series) – The scaled target.
scaler (sklearn.preprocessing) – andardScaler): The scaler used to scale the data.

Returns

reverse_predictions_df (pd.DataFrame) – The unscaled predictions.
reverse_x_df (pd.DataFrame) – The unscaled features.
reverse_y_df (pd.Series) – The unscaled target.

functions.utils.root_mean_squared_log_error(y_true, y_pred)[source]#

Calculates the Root Mean Squared Logarithmic Error (RMSLE) between the true and predicted values.

Parameters

(np.ndarray) (y_pred) –
(np.ndarray) –

Returns

float

Return type

The RMSLE.

functions.utils.save_dataframe(df_list, image_dir='model_images/all_models/')[source]#

Saves the dataframes as images in the specified directory.

Parameters

df_list (list of pd.DataFrame) – A list of dataframes to be saved as images.
image_dir (str, optional) – The directory to save the images. Default is ‘model_images/all_models/’.

functions.utils.save_keras_models(best_models)[source]#

Saves the diagrams of best Keras models in PNG format.

Parameters: best_models (list of dict) – A list of dictionaries, each containing a Keras model and other related information.

functions.utils.save_metrics(df, image_dir='model_images/metrics/')[source]#

Saves each model’s metrics as a separate PNG image in the specified directory.

Parameters

df (pd.DataFrame) – The input dataframe containing model information.
image_dir (str, optional) – The directory to save the images. Default is ‘model_images/metrics/’.

functions.web_scrapper module#

class functions.web_scrapper.StoppableThread(*args, **kwargs)[source]#

Bases: threading.Thread

A thread that can be stopped by setting the internal stop flag through stop() method.

stop()[source]#: Set the stop flag to stop the thread.

stopped()[source]#

Check if the thread is stopped.

Returns: True if the thread is stopped, False otherwise.
Return type: bool

class functions.web_scrapper.WebScraper(url='https://www.coindesk.com/search?s=bitcoin&sort=1', starting_page_number=1, ending_page_number=3096, output_file_name='headlines')[source]#

Bases: object

A web scraper for scraping headlines from coindesk.

The WebScraper class provides methods for starting, stopping, pausing, and resuming the scraping process. It uses the Selenium library to navigate through web pages, find required elements, and write scraped data to a CSV file.

pause_scraping()[source]#: Pause the scraping process by setting the pause flag to True.

resume_scraping()[source]#: Resume the scraping process by setting the pause flag to False.

start_scraping(url, starting_page_number, ending_page_number, output_file_name)[source]#

Start the scraping process.

Parameters

url (str) – URL to scrape from.
starting_page_number (int) – Page number to start scraping from.
ending_page_number (int) – Page number to end scraping at.
output_file_name (str) – Name of the file to output scraped headlines.

stop_scraping()[source]#: Stop the scraping process by stopping the thread in which it runs.

webScrapper(url, starting_page_number, ending_page_number, output_file_name)[source]#

The main method for scraping the website. It navigates through the pages of the given url, finds the required elements, and writes the data to a CSV file. It also handles cases like file not found, consent button, and handles exceptions like TimeoutException. The scraping process can also be paused, resumed and stopped.

Parameters

url (str) – The base url from where the scraping starts.
starting_page_number (int) – The starting page number for scraping.
ending_page_number (int) – The ending page number for scraping.
output_file_name (str) – The name of the output file where the scraped data is stored.

Functions

Contents

Functions#

Submodules#

functions.process_data module#

functions.rolling_window_trainer module#

functions.save_load_model module#

functions.sentiment_analyzer module#

functions.utils module#

functions.web_scrapper module#

Module contents#