Progress K-Means Module

mod_kmeans.py

mod_kmeans.py provides functionality for k-means clustering of solar generation data. It contains the KMeans_Pipeline class, which:

  • Preprocesses the solar and clear-sky irradiance data.

  • Allows evaluation of different cluster sizes using metrics like SSE and silhouette scores.

  • Performs cluster probability calculations to facilitate random day selection in the MCS.

KMeans_Pipeline Package

This package provides the KMeans_Pipeline class to perform KMeans clustering on solar generation data. The package includes functionality for preprocessing, data transformation, clustering, and evaluation.

Class:

KMeans_Pipeline: A pipeline class for running KMeans clustering on solar generation data.

progress.mod_kmeans.directory

Path to the directory containing the solar data files.

Type:

str

progress.mod_kmeans.site_data

Path to the CSV file containing site information.

Type:

str

Example Usage:

This example demonstrates how to use the KMeans_Pipeline class to perform KMeans clustering on solar generation data. It performs the following steps: 1. Initializes the pipeline with the solar_data.xlsx file and a list of selected sites. 2. Calculates the cluster probabilities using the calculate_cluster_probability method. 3. Splits and clusters the data based on the generated labels using the split_and_cluster_data method. 4. Outputs metrics of the clustering to a text file using the test_metrics method.

Note: Ensure the data file solar_data.xlsx exists in the same directory as the script, or provide an absolute path to it.

if __name__ == "__main__":

    # Define the list of selected sites for analysis
    directory = r'/Users/abera/Documents/My_Projects/QuESt_Reliability/QuESt_Reliability_App/Data/Solar'
    site_data = directory + '/solar_sites.csv'

    # Initialize the KMeans_Pipeline class
    pipeline = KMeans_Pipeline(directory, site_data)

    # Generate and save the clustering metrics to a text file
    pipeline.test_metrics(clust_eval = 4)

    # Uncomment the following lines to run additional steps
    # pipeline.run(n_clusters = 5)
    # pipeline.calculate_cluster_probability()
    # pipeline.split_and_cluster_data()
class progress.mod_kmeans.KMeans_Pipeline(directory, site_data, **kwargs)

Bases: object

A pipeline class for running KMeans clustering on solar generation data. It performs preprocessing, data transformation, and finally clusters the data.

excel_file_path

Path to the Excel file containing the Solar Generation, Site Information, and CSI data.

Type:

str

solar_gen_df

DataFrame containing Solar Generation data.

Type:

pd.DataFrame

site_info_df

DataFrame containing Site Information.

Type:

pd.DataFrame

csi_df

DataFrame containing Clear Sky Index (CSI) data.

Type:

pd.DataFrame

selected_sites

List of site names to be included in the analysis.

Type:

list

first_light

DataFrame containing first light timings.

Type:

pd.DataFrame

last_light

DataFrame containing last light timings.

Type:

pd.DataFrame

sg_mean_am

AM Solar generation mean values.

Type:

pd.DataFrame

sg_mean_pm

PM Solar generation mean values.

Type:

pd.DataFrame

csi_sd_am

AM Cloud Sky Irradiance standard deviation values.

Type:

pd.DataFrame

csi_sd_pm

PM Cloud Sky Irradiance standard deviation values.

Type:

pd.DataFrame

kmeans_df

DataFrame prepared for KMeans clustering.

Type:

pd.DataFrame

predicted_labels

Cluster labels for each data point.

Type:

array

silhouette

Silhouette score of the clustering.

Type:

float

calculate_cluster_probability()

Calculates the monthly probabilities for each cluster label.

This method generates a pivot DataFrame that shows the monthly probabilities for each cluster. It operates on the class attributes predicted_labels and kmeans_df.

The function goes through the following steps: 1. Resets the index of kmeans_df. 2. Extracts the month from the date column in kmeans_df. 3. Groups the data by cluster and month, calculating the count of data points in each group. 4. Calculates the total count for each month. 5. Merges both counts to calculate probabilities. 6. Saves the calculated probabilities in a pivot DataFrame.

The resulting DataFrame is saved to a CSV file named ‘percentage_probability.csv’.

Returns:

The function saves the output to a CSV file and modifies class attributes.

Return type:

None

create_kmeans_df(sg_mean_am, sg_mean_pm, csi_sd_am, csi_sd_pm, first_light, last_light)

Combines various DataFrames to form a DataFrame ready for K-means clustering.

This function performs several operations: - Sets the index to ‘date’ for all input DataFrames. - Adds appropriate suffixes to each DataFrame to maintain data context. - Concatenates all DataFrames along the columns. - Sorts the resulting DataFrame by the column names.

Parameters:
  • sg_mean_am (DataFrame) – DataFrame containing mean AM solar generation data.

  • sg_mean_pm (DataFrame) – DataFrame containing mean PM solar generation data.

  • csi_sd_am (DataFrame) – DataFrame containing standard deviation of AM cloud-to-sun irradiance data.

  • csi_sd_pm (DataFrame) – DataFrame containing standard deviation of PM cloud-to-sun irradiance data.

  • first_light (DataFrame) – DataFrame containing the first light hours for each day.

  • last_light (DataFrame) – DataFrame containing the last light hours for each day.

Returns:

A DataFrame that combines all input DataFrames, ready for K-means clustering.

Return type:

DataFrame

find_elbow(kmeans_df, clust_eval)

Finds the optimal number of clusters for KMeans clustering using the elbow method.

This method goes through the following steps: 1. Resets the DataFrame index. 2. Converts ‘date’ to ‘month’ and creates cyclical features for it. 3. Scales features and applies PCA via a preprocessing pipeline. 4. Loops through a predefined number of clusters (1 to 11). 5. Performs KMeans clustering with the specified number of clusters. 6. Calculates the sum of squared errors (SSE) for each KMeans run. 7. Uses the KneeLocator to identify the “elbow point” in the SSE curve, which indicates the optimal number of clusters.

Parameters:

kmeans_df (DataFrame) – The DataFrame to cluster.

Returns:

Contains the following elements:
  • The optimal number of clusters, as determined by the elbow method.

  • The sum of squared errors (SSE) for each number of clusters from 1 to 11.

  • Silhouette scores for each number of clusters from 2 to 11.

Return type:

tuple

process_csi_data(csi_info, selected_sites)

Processes cloud-to-sun irradiance data to get standard deviation for AM and PM periods.

This function performs several operations: - Filters the data for selected sites. - Scales the data using standardization. - Calculates the standard deviation for AM and PM slots.

Parameters:
  • csi_info (DataFrame) – DataFrame containing the cloud-to-sun irradiance data. The DataFrame should have a ‘datetime’ column and columns for each site’s irradiance.

  • selected_sites (list) – List of site names to include in the processing.

Returns:

Two DataFrames representing the standard deviation for AM and PM values, respectively.

Return type:

tuple

process_flh_and_llh(solar_gen_df, selected_sites)

Processes first and last light hours from solar generation data.

This function performs the following: - Identifies the first and last light hours for each day based on solar generation data. - Encodes these hours as cyclic features.

Parameters:
  • solar_gen_df (DataFrame) – DataFrame containing the solar generation data. The DataFrame should have a ‘datetime’ column and columns for each site’s solar generation.

  • selected_sites (list) – List of site columns to include from solar_gen_df.

Returns:

Two DataFrames representing first and last light hours, respectively,

with cyclic features encoded.

Return type:

tuple

Internal Functions:
  • get_first_non_zero(series): Returns the first non-zero hour in a time series.

  • get_last_non_zero(series): Returns the last non-zero hour in a time series.

  • encode_cyclic_features(df): Encodes specified features as cyclic.

process_solar_data(solar_gen_df, site_info_df, selected_sites)

Processes solar generation data to obtain mean AM and PM values.

This function performs several operations: - Filters the data for selected sites. - Normalizes solar generation data by the wattage limit for each site. - Calculates the mean solar generation for AM and PM slots.

Parameters:
  • solar_gen_df (DataFrame) – DataFrame containing the solar generation data. The DataFrame should have a ‘datetime’ column and columns for each site’s solar generation.

  • site_info_df (DataFrame) – DataFrame containing information about the sites, including ‘site_name’ and ‘MW’ (megawatt capacity).

  • selected_sites (list) – List of site names to include in the processing.

Returns:

Two DataFrames representing mean AM and mean PM values, respectively.

Return type:

tuple

run(**kwargs)
run_kmeans_pipeline(kmeans_df, **kwargs)

Executes the K-means clustering pipeline on the input DataFrame.

The function performs the following steps: 1. Resets the DataFrame index. 2. Converts ‘date’ to ‘month’ and creates cyclical features for it. 3. Scales features and applies PCA via a preprocessing pipeline. 4. Executes K-means clustering. 5. Combines preprocessing and clustering into one pipeline. 6. Fits the pipeline to the data. 7. Computes the silhouette score to evaluate clustering quality.

Parameters:
  • kmeans_df (DataFrame) – The DataFrame to cluster.

  • **kwargs – Additional keyword arguments to configure K-means clustering, such as ‘n_clusters’ to specify the number of clusters (default is 11).

Returns:

Contains the following elements:
  • Clustered DataFrame with a new ‘cluster’ column.

  • Predicted cluster labels.

  • The fitted pipeline object.

  • Silhouette score for the clustering.

Return type:

tuple

split_and_cluster_data()

Splits the solar generation data into clusters and saves each cluster’s data into separate CSV files.

This method first normalizes the solar generation data by site wattage and then transposes the data into 24-hour segments. Each segment is labeled with a cluster identifier based on previously determined labels. The data for each cluster is then saved into a dedicated directory and CSV file.

The method organizes the output by creating a directory for each cluster under a main ‘Clusters’ directory, where each directory contains CSV files for each data column, segmented by day.

Side Effects:
  • Creates a directory structure under ‘Clusters’ in the specified directory.

  • Writes multiple CSV files containing the clustered data.

Returns:

This method performs file I/O operations and does not return any values.

Return type:

None

Example Directory Structure:
Clusters/
cluster_1/

site1.csv site2.csv

cluster_2/

site1.csv site2.csv

test_metrics(clust_eval)

Calculates and writes K-means clustering metrics to a text file.

This method evaluates the clustering performance using the elbow method and silhouette scores, and writes the detailed results to a file named ‘clustering_results.txt’.

The method outputs a detailed explanation of the elbow method and silhouette scores, and it identifies the optimal number of clusters based on the provided dataset. It also captures the sum of squared errors (SSE) for different numbers of clusters, along with their respective silhouette scores.

Parameters:

clust_eval (int) – The maximum number of clusters to evaluate, which determines the range of clusters to consider for finding the elbow point.

Side Effects:
  • Writes to a file ‘clustering_results.txt’ in the specified directory.

  • Temporarily redirects stdout to this file to capture print statements.

Returns:

This method does not return a value but writes output to a file.

Return type:

None

Example of File Output:

The optimal number of clusters in a dataset is the number that… Optimal Number of Clusters: X SSE for Y Clusters: Z Silhouette Score for Y Clusters: W

update_progress(process, progress)