Quality Management

Two types of classes can be defined in your pipeline to ensure standardized data meets quality requirements:

QualityChecker

Each Quality Checker performs a specific quality control (QC) test on one or more variables in your dataset. Quality checkers test a single data variable at a time and return a logical mask, where flagged values are marked as ‘True’.

QualityHandler

Each Quality Handler can be specified to run if a particular QC test fails. Quality handlers take the QC Checker’s logical mask and use it to apply any QC or custom method to the data variable of question. For instance, it can be used to remove flagged data altogether or correct flagged values, such as interpolating to fill gaps in data.

Custom QC Checkers and QC Handlers are stored (typically) in ingest/<ingest_name>/pipeline/qc.py Once written, they must be specified in the pipeline_config_<ingest_name>.yml file like shown:

quality_management:

  manage_missing_coordinates: # Tsdat built-in function
    checker:
      classname: tsdat.qc.checkers.CheckMissing
    handlers:
      - classname: tsdat.qc.handlers.FailPipeline
    variables:
      - time  # Coordinates to check

  despiking:  # Custom QC name
    checker:
      classname: ingest.wave.pipeline.qc.GoringNikora2002  # Custom QC checker function
      parameters:
        n_points: 1000  # parameters accessed in custom function via `self.params["<param_name>"]`
    handlers:
      - classname: ingest.wave.pipeline.qc.CubicSplineInterp  # Custom QC handler function
      - classname: tsdat.qc.handlers.RecordQualityResults  # Built-in tsdat error logging
        parameters:
          bit: 4
          assessment: Bad
          meaning: "Spike"
    variables:
      - DATA_VARS       # Catch-all for all variables
    exclude: [foo, bar] # Variables to exclude from test

Quality Checkers

Quality Checkers are classes that are used to perform a QC test on a specific variable. Each Quality Checker should extend the QualityChecker base class, which defines a run method that performs the check. Each QualityChecker defined in the pipeline config file will be automatically initialized by the pipeline and invoked on the specified variables.

Tsdat built-in quality checkers:

QualityChecker

Class containing the code to perform a single Quality Check on a Dataset variable.

CheckMissing

Checks if any values are assigned to _FillValue or ‘NaN’ (for non-time variables) or checks if values are assigned to ‘NaT’ (for time variables).

CheckMonotonic

Checks that all values for the specified variable are either strictly increasing or strictly decreasing.

CheckValidDelta

Check that the difference between any two consecutive values is not greater than the threshold set by the ‘valid_delta’ attribute.

CheckValidMin

Check that no values for the specified variable are less than the minimum vaue set by the ‘valid_range’ attribute.

CheckValidMax

Check that no values for the specified variable are greater than the maximum vaue set by the ‘valid_range’ attribute.

CheckFailMin

Check that no values for the specified variable are less than the minimum vaue set by the ‘fail_range’ attribute.

CheckFailMax

Check that no values for the specified variable greater less than the maximum vaue set by the ‘fail_range’ attribute.

CheckWarnMin

Check that no values for the specified variable are less than the minimum vaue set by the ‘warn_range’ attribute.

CheckWarnMax

Check that no values for the specified variable are greater than the maximum vaue set by the ‘warn_range’ attribute.

Quality Handlers

Quality Handlers are classes that are used to correct variable data when a specific quality test fails. An example is interpolating missing values to fill gaps. Each Quality Handler should extend the QualityHandler base class, which defines a run method that performs the correction. Each QualityHandler defined in the pipeline config file will be automatically initialized by the pipeline and invoked on the specified variables.

Tsdat built-in quality handlers:

QualityHandler

Class containing code to be executed if a particular quality check fails.

RecordQualityResults

Record the results of the quality check in an ancillary qc variable.

RemoveFailedValues

Replace all the failed values with _FillValue

SortDatasetByCoordinate

Sort coordinate data using xr.Dataset.sortby().

SendEmailAWS

Send an email to the recipients using AWS services.

FailPipeline

Throw an exception, halting the pipeline & indicating a critical error

class tsdat.qc.checkers.CheckFailMax(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters)[source]

Bases: tsdat.qc.checkers.CheckMax

Check that no values for the specified variable greater less than the maximum vaue set by the ‘fail_range’ attribute. If the variable in question does not posess the ‘fail_range’ attribute, this check will be skipped.

class tsdat.qc.checkers.CheckFailMin(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters)[source]

Bases: tsdat.qc.checkers.CheckMin

Check that no values for the specified variable are less than the minimum vaue set by the ‘fail_range’ attribute. If the variable in question does not posess the ‘fail_range’ attribute, this check will be skipped.

class tsdat.qc.checkers.CheckMax(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.checkers.QualityChecker

Check that no values for the specified variable are greater than a specified maximum threshold. The threshold value is an attribute set on the variable in question. The attribute name is specified in the quality checker definition in the pipeline config file by setting a param called ‘key: ATTRIBUTE_NAME’.

If the key parameter is not set or the variable does not possess the specified attribute, this check will be skipped.

run(variable_name: str)Optional[numpy.ndarray][source]

Check a dataset’s variable to see if it passes a quality check. These checks can be performed on the entire variable at one time by using xarray vectorized numerical operators.

Parameters

variable_name (str) – The name of the variable to check

Returns

If the check was performed, return a ndarray of the same shape as the variable. Each value in the data array will be either True or False, depending upon the results of the check. True means the check failed. False means it succeeded.

Note that we are using an np.ndarray instead of an xr.DataArray because the DataArray contains coordinate indexes which can sometimes get out of sync when performing np arithmectic vector operations. So it’s easier to just use numpy arrays.

If the check was skipped for some reason (i.e., it was not relevant given the current attributes defined for this dataset), then the run method should return None.

Return type

Optional[np.ndarray]

class tsdat.qc.checkers.CheckMin(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.checkers.QualityChecker

Check that no values for the specified variable are less than a specified minimum threshold. The threshold value is an attribute set on the variable in question. The attribute name is specified in the quality checker definition in the pipeline config file by setting a param called ‘key: ATTRIBUTE_NAME’.

If the key parameter is not set or the variable does not possess the specified attribute, this check will be skipped.

run(variable_name: str)Optional[numpy.ndarray][source]

Check a dataset’s variable to see if it passes a quality check. These checks can be performed on the entire variable at one time by using xarray vectorized numerical operators.

Parameters

variable_name (str) – The name of the variable to check

Returns

If the check was performed, return a ndarray of the same shape as the variable. Each value in the data array will be either True or False, depending upon the results of the check. True means the check failed. False means it succeeded.

Note that we are using an np.ndarray instead of an xr.DataArray because the DataArray contains coordinate indexes which can sometimes get out of sync when performing np arithmectic vector operations. So it’s easier to just use numpy arrays.

If the check was skipped for some reason (i.e., it was not relevant given the current attributes defined for this dataset), then the run method should return None.

Return type

Optional[np.ndarray]

class tsdat.qc.checkers.CheckMissing(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.checkers.QualityChecker

Checks if any values are assigned to _FillValue or ‘NaN’ (for non-time variables) or checks if values are assigned to ‘NaT’ (for time variables). Also, for non-time variables, checks if values are above or below valid_range, as this is considered missing as well.

run(variable_name: str)Optional[numpy.ndarray][source]

Check a dataset’s variable to see if it passes a quality check. These checks can be performed on the entire variable at one time by using xarray vectorized numerical operators.

Parameters

variable_name (str) – The name of the variable to check

Returns

If the check was performed, return a ndarray of the same shape as the variable. Each value in the data array will be either True or False, depending upon the results of the check. True means the check failed. False means it succeeded.

Note that we are using an np.ndarray instead of an xr.DataArray because the DataArray contains coordinate indexes which can sometimes get out of sync when performing np arithmectic vector operations. So it’s easier to just use numpy arrays.

If the check was skipped for some reason (i.e., it was not relevant given the current attributes defined for this dataset), then the run method should return None.

Return type

Optional[np.ndarray]

class tsdat.qc.checkers.CheckMonotonic(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.checkers.QualityChecker

Checks that all values for the specified variable are either strictly increasing or strictly decreasing.

run(variable_name: str)Optional[numpy.ndarray][source]

Check a dataset’s variable to see if it passes a quality check. These checks can be performed on the entire variable at one time by using xarray vectorized numerical operators.

Parameters

variable_name (str) – The name of the variable to check

Returns

If the check was performed, return a ndarray of the same shape as the variable. Each value in the data array will be either True or False, depending upon the results of the check. True means the check failed. False means it succeeded.

Note that we are using an np.ndarray instead of an xr.DataArray because the DataArray contains coordinate indexes which can sometimes get out of sync when performing np arithmectic vector operations. So it’s easier to just use numpy arrays.

If the check was skipped for some reason (i.e., it was not relevant given the current attributes defined for this dataset), then the run method should return None.

Return type

Optional[np.ndarray]

class tsdat.qc.checkers.CheckValidDelta(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.checkers.QualityChecker

Check that the difference between any two consecutive values is not greater than the threshold set by the ‘valid_delta’ attribute. If the variable in question does not posess the ‘valid_delta’ attribute, this check will be skipped.

run(variable_name: str)Optional[numpy.ndarray][source]

Check a dataset’s variable to see if it passes a quality check. These checks can be performed on the entire variable at one time by using xarray vectorized numerical operators.

Parameters

variable_name (str) – The name of the variable to check

Returns

If the check was performed, return a ndarray of the same shape as the variable. Each value in the data array will be either True or False, depending upon the results of the check. True means the check failed. False means it succeeded.

Note that we are using an np.ndarray instead of an xr.DataArray because the DataArray contains coordinate indexes which can sometimes get out of sync when performing np arithmectic vector operations. So it’s easier to just use numpy arrays.

If the check was skipped for some reason (i.e., it was not relevant given the current attributes defined for this dataset), then the run method should return None.

Return type

Optional[np.ndarray]

class tsdat.qc.checkers.CheckValidMax(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters)[source]

Bases: tsdat.qc.checkers.CheckMax

Check that no values for the specified variable are greater than the maximum vaue set by the ‘valid_range’ attribute. If the variable in question does not posess the ‘valid_range’ attribute, this check will be skipped.

class tsdat.qc.checkers.CheckValidMin(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters)[source]

Bases: tsdat.qc.checkers.CheckMin

Check that no values for the specified variable are less than the minimum vaue set by the ‘valid_range’ attribute. If the variable in question does not posess the ‘valid_range’ attribute, this check will be skipped.

class tsdat.qc.checkers.CheckWarnMax(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters)[source]

Bases: tsdat.qc.checkers.CheckMax

Check that no values for the specified variable are greater than the maximum vaue set by the ‘warn_range’ attribute. If the variable in question does not posess the ‘warn_range’ attribute, this check will be skipped.

class tsdat.qc.checkers.CheckWarnMin(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters)[source]

Bases: tsdat.qc.checkers.CheckMin

Check that no values for the specified variable are less than the minimum vaue set by the ‘warn_range’ attribute. If the variable in question does not posess the ‘warn_range’ attribute, this check will be skipped.

class tsdat.qc.checkers.QualityChecker(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, definition: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: abc.ABC

Class containing the code to perform a single Quality Check on a Dataset variable.

Parameters
  • ds (xr.Dataset) – The dataset the checker will be applied to

  • previous_data (xr.Dataset) – A dataset from the previous processing interval (i.e., file). This is used to check for consistency between files, such as for monitonic or delta checks when we need to check the previous value.

  • definition (QualityManagerDefinition) – The quality manager definition as specified in the pipeline config file

  • parameters (dict, optional) – A dictionary of checker-specific parameters specified in the pipeline config file. Defaults to {}

abstract run(variable_name: str)Optional[numpy.ndarray][source]

Check a dataset’s variable to see if it passes a quality check. These checks can be performed on the entire variable at one time by using xarray vectorized numerical operators.

Parameters

variable_name (str) – The name of the variable to check

Returns

If the check was performed, return a ndarray of the same shape as the variable. Each value in the data array will be either True or False, depending upon the results of the check. True means the check failed. False means it succeeded.

Note that we are using an np.ndarray instead of an xr.DataArray because the DataArray contains coordinate indexes which can sometimes get out of sync when performing np arithmectic vector operations. So it’s easier to just use numpy arrays.

If the check was skipped for some reason (i.e., it was not relevant given the current attributes defined for this dataset), then the run method should return None.

Return type

Optional[np.ndarray]

class tsdat.qc.handlers.FailPipeline(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, quality_manager: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.handlers.QualityHandler

Throw an exception, halting the pipeline & indicating a critical error

run(variable_name: str, results_array: numpy.ndarray)[source]

Perform a follow-on action if a quality check fails. This can be used to correct data if needed (such as replacing a bad value with missing value, emailing a contact persion, or raising an exception if the failure constitutes a critical error).

Parameters
  • variable_name (str) – Name of the variable that failed

  • results_array (np.ndarray) – An array of True/False values for each data value of the variable. True means the check failed.

class tsdat.qc.handlers.QCParamKeys[source]

Bases: object

Symbolic constants used for referencing QC-related fields in the pipeline config file

ASSESSMENT = 'assessment'[source]
CORRECTION = 'correction'[source]
QC_BIT = 'bit'[source]
TEST_MEANING = 'meaning'[source]
class tsdat.qc.handlers.QualityHandler(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, quality_manager: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: abc.ABC

Class containing code to be executed if a particular quality check fails.

Parameters
  • ds (xr.Dataset) – The dataset the handler will be applied to

  • previous_data (xr.Dataset) – A dataset from the previous processing interval (i.e., file). This is used to check for consistency between files, such as for monotonic or delta checks when we need to check the previous value.

  • quality_manager (QualityManagerDefinition) – The quality_manager definition as specified in the pipeline config file

  • parameters (dict, optional) – A dictionary of handler-specific parameters specified in the pipeline config file. Defaults to {}

record_correction(variable_name: str)[source]

If a correction was made to variable data to fix invalid values as detected by a quality check, this method will record the fix to the appropriate variable attribute. The correction description will come from the handler params which get set in the pipeline config file.

Parameters

variable_name (str) – Name

abstract run(variable_name: str, results_array: numpy.ndarray)[source]

Perform a follow-on action if a quality check fails. This can be used to correct data if needed (such as replacing a bad value with missing value, emailing a contact persion, or raising an exception if the failure constitutes a critical error).

Parameters
  • variable_name (str) – Name of the variable that failed

  • results_array (np.ndarray) – An array of True/False values for each data value of the variable. True means the check failed.

class tsdat.qc.handlers.RecordQualityResults(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, quality_manager: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.handlers.QualityHandler

Record the results of the quality check in an ancillary qc variable.

run(variable_name: str, results_array: numpy.ndarray)[source]

Perform a follow-on action if a quality check fails. This can be used to correct data if needed (such as replacing a bad value with missing value, emailing a contact persion, or raising an exception if the failure constitutes a critical error).

Parameters
  • variable_name (str) – Name of the variable that failed

  • results_array (np.ndarray) – An array of True/False values for each data value of the variable. True means the check failed.

class tsdat.qc.handlers.RemoveFailedValues(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, quality_manager: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.handlers.QualityHandler

Replace all the failed values with _FillValue

run(variable_name: str, results_array: numpy.ndarray)[source]

Perform a follow-on action if a quality check fails. This can be used to correct data if needed (such as replacing a bad value with missing value, emailing a contact persion, or raising an exception if the failure constitutes a critical error).

Parameters
  • variable_name (str) – Name of the variable that failed

  • results_array (np.ndarray) – An array of True/False values for each data value of the variable. True means the check failed.

class tsdat.qc.handlers.SendEmailAWS(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, quality_manager: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.handlers.QualityHandler

Send an email to the recipients using AWS services.

run(variable_name: str, results_array: numpy.ndarray)[source]

Perform a follow-on action if a quality check fails. This can be used to correct data if needed (such as replacing a bad value with missing value, emailing a contact persion, or raising an exception if the failure constitutes a critical error).

Parameters
  • variable_name (str) – Name of the variable that failed

  • results_array (np.ndarray) – An array of True/False values for each data value of the variable. True means the check failed.

class tsdat.qc.handlers.SortDatasetByCoordinate(ds: xarray.core.dataset.Dataset, previous_data: xarray.core.dataset.Dataset, quality_manager: tsdat.config.quality_manager_definition.QualityManagerDefinition, parameters: Optional[Dict] = None)[source]

Bases: tsdat.qc.handlers.QualityHandler

Sort coordinate data using xr.Dataset.sortby(). Accepts the following parameters:

parameters:
  # Whether or not to sort in ascending order. Defaults to True.
  ascending: True
run(variable_name: str, results_array: numpy.ndarray)[source]

Perform a follow-on action if a quality check fails. This can be used to correct data if needed (such as replacing a bad value with missing value, emailing a contact persion, or raising an exception if the failure constitutes a critical error).

Parameters
  • variable_name (str) – Name of the variable that failed

  • results_array (np.ndarray) – An array of True/False values for each data value of the variable. True means the check failed.