kmeans_functions.py

class lib.kmeans_functions.OptimumPoint(init_x, init_y)

This class is used in the Elbow method to identify the maximum distance between the end point and the start point of the curve of inertia as a function of number of clusters.

lib.kmeans_functions.calculate_stats_for_non_empty_rasters(paths, param)

This function calculates statistics for all non empty subrasters. These statistics include the number of rows and columns, the size (number of valid points), the standard deviation, the relative size (to the maximum) and the relative standard deviation, the product of the latter two, and the values of four mapping functions using the relative size and relative standard deviation.

The product of the relative size and relative standard deviation is used to identify the reference part. As of the four mapping functions, they are used to identify the four parts that lie in the corners of the cloud of points, where each point represents a part and is plotted on a graph with the relative size in one axis, and relative standard deviation on the other.

Parameters
  • paths (dict) – Dictionary containing the paths to the folder of inputs, to the CSV input_stats, to the folder of sub_rasters, and to the output CSV non_empty_rasters.

  • param (dict) – Dictionary of parameters containing the raster_names and the minimum valid value in the rasters, minimum_valid.

Returns

The results are directly saved in the desired CSV file non_empty_rasters, and the CSV file input_stats is also updated.

Return type

None

lib.kmeans_functions.choose_ref_part(paths)

This function chooses the reference part for the function identify_max_number_of_clusters_in_ref_part. The reference part is chosen based on the product of relative size and relative standard deviation. The part with the largest product in all the input files is chosen.

Parameters

paths (dict) – The paths to the CSV files non_empty_rasters and input_stats.

Returns

The CSV file input_stats is updated.

Return type

None

lib.kmeans_functions.identify_max_number_of_clusters_in_ref_part(paths, param)

This function identifies the maximum number of clusters for the reference part using the Elbow method. The number of clusters is varied between min and max by step, and in each case, the inertia (distances to the cluster centers) are calculated. If the slope of the change of the inertia goes below a certain threshold, the function is interrupted and the maximum number of clusters for the reference part is determined.

Parameters
  • paths (dict) – Dictionary containing the paths to the folder of inputs, to the CSV input_stats, to the folder of sub_rasters, and to the output CSV kmeans_stats.

  • param (dict) – Dictionary of parameters containing the raster_names and their weights, the minimum valid value in the rasters, minimum_valid, kmeans-related parameters for the iteration of the Elbow method, and the number of processes n_job.

Returns

The results are directly saved in the desired CSV file kmeans_stats, and the CSV file input_stats is also updated.

Return type

None

lib.kmeans_functions.identify_opt_number_of_clusters(paths, param, part, size_of_raster, std_of_raster)

This function identifies the optimal number of clusters which will be chosen for k-means in each part.

In case you are using a reference part, then the optimal number is a function of the number of clusters in the reference part, and of the relative size and relative standard deviation, which are weighted according to ratio_size_to_std.

In case you are using the maximum number for the whole map, then the optimal number in each part is a function of the total number, of the relative size and relative standard deviation, and the weights in ratio_size_to_std.

Parameters
  • paths (dict) – Dictionary of paths pointing to the location of the input CSV file non_empty_rasters and to input_stats.

  • param (dict) – Dictionary of parameters including the ratio between the relative size and the relative standard deviation ratio_size_to_std and the method for setting the number of clusters.

  • part (integer) – Counter for the raster parts.

  • size_of_raster (integer) – Number of valid data points in the raster part.

  • std_of_raster (float) – Standard deviation of the data in the raster part.

Return optimum_no_of_clusters_for_raster

Optimum number of clusters for the raster part according to the chosen method.

Return type

integer

lib.kmeans_functions.k_means_clustering(paths, param)

This function does the k-means clustering for every part.

Parameters
  • paths (dict) – Dictionary containing the paths to the folder of inputs, to the CSV input_stats and non_empty_rasters, to the folder of sub_rasters, and to the output folder kmeans.

  • param (dict) – Dictionary of parameters containing the raster_names and their weights and aggregation methods agg, the minimum valid value in the rasters, minimum_valid, the method for finding the number of kmeans clusters, and the number of processes n_job.

Returns

The results are directly saved in the desired CSV file kmeans_stats, and the CSV file input_stats is also updated.

Return type

None

lib.kmeans_functions.polygonize_after_k_means(paths, param)

This function converts the rasters created after k-means clustering into shapefiles of (multi)polygons which are used in the max-p algorithm.

Parameters
  • paths (dict) – Dictionary containing the paths to the folder of kmeans for retrieving inputs, to the CSV non_empty_rasters, and to the folder polygons for saving outputs.

  • param (dict) – Dictionary of parameters containing the minimum valid value in the rasters, minimum_valid, and the CRS of the shapefiles.

Returns

The results are directly saved in the desired paths for each part (folder polygons) and for the whole map (file polygonized_clusters).

Return type

None