RegressionErrorAnalysisReport¶

class olliepy.RegressionErrorAnalysisReport.RegressionErrorAnalysisReport(**kwargs)[source]¶

RegressionErrorAnalysisReport creates a report that analyzes the error in regression problems.

titlestr

the title of the report

output_directorystr

the directory where the report folder will be created

train_dfpd.DataFrame

the training pandas dataframe of the regression problem which should include the target feature

test_dfpd.DataFrame

the testing pandas dataframe of the regression problem which should include the target feature and the error column in order to calculate the error class

target_feature_namestr

the name of the regression target feature

error_column_namestr

the name of the calculated error column ‘Prediction - Target’ (see example on github for more information)

error_classesDict[str, Tuple]

a dictionary containing the definition of the error classes that will be created. The key is the error_class name and the value is the minimum (inclusive) and maximum (exclusive) which will be used to calculate the error_class of the test observations.

For example: error_classes = {

‘EXTREME_UNDER_ESTIMATION’: (-8.0, -4.0),
returns ‘EXTREME_UNDER_ESTIMATION’ if -8.0 <= error < -4.0

‘HIGH_UNDER_ESTIMATION’: (-4.0, -3.0),
returns ‘HIGH_UNDER_ESTIMATION’ if -4.0 <= error < -3.0

‘MEDIUM_UNDER_ESTIMATION’: (-3.0, -1.0),
returns ‘MEDIUM_UNDER_ESTIMATION’ if -3.0 <= error < -1.0

‘LOW_UNDER_ESTIMATION’: (-1.0, -0.5),
returns ‘LOW_UNDER_ESTIMATION’ if -1.0 <= error < -0.5

‘ACCEPTABLE’: (-0.5, 0.5),
returns ‘ACCEPTABLE’ if -0.5 <= error < 0.5

‘OVER_ESTIMATING’: (0.5, 3.0) }
returns ‘OVER_ESTIMATING’ if -0.5 <= error < 3.0

acceptable_error_class: str

the name of the acceptable error class that was defined in error_classes

numerical_featuresList[str] default=None

a list of the numerical features to be included in the report

categorical_featuresList[str] default=None

a list of the categorical features to be included in the report

subtitlestr default=None

an optional subtitle to describe your report

report_folder_namestr default=None

the name of the folder that will contain all the generated report files. If not set, the title of the report will be used.

encryption_secretstr default=None

the 16 characters secret that will be used to encrypt the generated report data. If it is not set, the generated data won’t be encrypted.

generate_encryption_secretbool default=False

the encryption_secret will be generated and its value returned as output. you can also view encryption_secret to get the generated secret.

create_report(): creates the error analysis report

create_report(enable_patterns_report: bool = True, patterns_report_group_by_categorical_features: Union[str, List[str]] = 'all', patterns_report_group_by_numerical_features: Union[str, List[str]] = 'all', patterns_report_number_of_bins: Union[int, List[int]] = 10, enable_parallel_coordinates_plot: bool = True, cosine_similarity_threshold: float = 0.8, parallel_coordinates_q1_threshold: float = 0.25, parallel_coordinates_q2_threshold: float = 0.75, parallel_coordinates_features: Union[str, List[str]] = 'auto') → None[source]¶

Creates a report using the user defined data and the data calculated based on the error.

Parameters

enable_patterns_report – enables the patterns report. default: True
patterns_report_group_by_categorical_features – categorical features to use in the patterns report. default: ‘all’
patterns_report_group_by_numerical_features – numerical features to use in the patterns report. default: ‘all’
patterns_report_number_of_bins – number of bins to use for each provided numerical feature or one number of bins to use for all provided numerical features. default: 10
enable_parallel_coordinates_plot – enables the parallel coordinates plot. default: True
cosine_similarity_threshold – The cosine similarity threshold to decide if the categorical distribution of the primary and secondary datasets are similar.
parallel_coordinates_q1_threshold – the first quantile threshold to be used if parallel_coordinates_features == ‘auto’. default: 0.25
parallel_coordinates_q2_threshold – the second quantile threshold to be used if parallel_coordinates_features == ‘auto’. default: 0.75
parallel_coordinates_features – The list of features to display on the parallel coordinates plot. default: ‘auto’

If parallel_coordinates_features is set to ‘auto’, OlliePy will select the features with a distribution shift based on 3 thresholds:
- cosine_similarity_threshold to be used to select categorical features if the cosine_similarity is lower than the threshold.
- parallel_coordinates_q1_threshold and parallel_coordinates_q2_threshold which are two quantile values.
  
  if primary_quantile_1 >= secondary_quantile_2 or secondary_quantile_1 >= primary_quantile_2
  then the numerical feature is selected and will be added to the plot.

Returns: None

save_report(zip_report: bool = False) → None[source]¶

Creates the report directory, copies the web application based on the template name, saves the report data.

Parameters: zip_report – enable it in order to zip the directory for downloading. default: False
Returns: None

serve_report_from_local_server(mode: str = 'server', port: int = None) → None[source]¶

Serve the report to the user using a web server. Available modes:

‘server’: will open a new tab in the default browser using webbrowser package

‘js’: will open a new tab in the default browser using IPython

‘jupyter’: will open the report in a jupyter notebook

Parameters

mode – the selected web server mode. default: ‘server’
port – the server port. default: None. a random port will be generated between (1024-49151)

Returns

None