Quality Management

Two types of classes can be defined in your pipeline to ensure standardized data meets quality requirements:

QualityChecker

Each Quality Checker performs a specific quality control (QC) test on one or more variables in your dataset. Quality checkers test a single data variable at a time and return a logical mask, where flagged values are marked as ‘True’.

QualityHandler

Each Quality Handler can be specified to run if a particular QC test fails. Quality handlers take the QC Checker’s logical mask and use it to apply any QC or custom method to the data variable of question. For instance, it can be used to remove flagged data altogether or correct flagged values, such as interpolating to fill gaps in data.

Custom QC Checkers and QC Handlers are stored (typically) in pipelines/<pipeline_module>/qc.py. Once written, they must be specified in the config/quality.yaml file like shown:

managers:
  - name: Require Valid Coordinate Variables
    checker:
      classname: tsdat.qc.checkers.CheckMissing
    handlers:
      - classname: tsdat.qc.handlers.FailPipeline
    apply_to: [COORDS]

  - name: The name of this quality check
    checker:
      classname: pipelines.example_pipeline.qc.CustomQualityChecker
      parameters: {}
    handlers:
      - classname: pipelines.example_pipeline.qc.CustomQualityHandler
        parameters: {}
    apply_to: [COORDS, DATA_VARS]

Quality Checkers

Quality Checkers are classes that are used to perform a QC test on a specific variable. Each Quality Checker should extend the QualityChecker base class, and implement the abstract run method as shown below. Each QualityChecker defined in the pipeline config file will be automatically initialized by the pipeline and invoked on the specified variables.

@abstractmethod
def run(self, dataset: xr.Dataset, variable_name: str) -> NDArray[np.bool8]:
    """-----------------------------------------------------------------------------
    Checks the quality of a specific variable in the dataset and returns the results
    of the check as a boolean array where True values represent quality problems and
    False values represent data that passes the quality check.

    QualityCheckers should not modify dataset variables; changes to the dataset
    should be made by QualityHandler(s), which receive the results of a
    QualityChecker as input.

    Args:
        dataset (xr.Dataset): The dataset containing the variable to check.
        variable_name (str): The name of the variable to check.

    Returns:
        NDArray[np.bool8]: The results of the quality check, where True values
        indicate a quality problem.

    -----------------------------------------------------------------------------"""

Tsdat built-in quality checkers:

QualityChecker

Base class for code that checks the dataset / data variable quality.

CheckMissing

Checks if any data are missing.

CheckMonotonic

Checks if any values are not ordered strictly monotonically (i.e.

CheckValidDelta

Checks for deltas between consecutive values larger than 'valid_delta'.

CheckValidMin

Checks for values less than 'valid_min'.

CheckValidMax

Checks for values greater than 'valid_max'.

CheckFailMin

Checks for values less than 'fail_min'.

CheckFailMax

Checks for values greater than 'fail_max'.

CheckWarnMin

Checks for values less than 'warn_min'.

CheckWarnMax

Checks for values greater than 'warn_max'.

Quality Handlers

Quality Handlers are classes that are used to correct variable data when a specific quality test fails. An example is interpolating missing values to fill gaps. Each Quality Handler should extend the QualityHandler base class, and implement the abstract run method that performs the correction, as shown below. Each QualityHandler defined in the pipeline config file will be automatically initialized by the pipeline and invoked on the specified variables.

@abstractmethod
def run(
    self, dataset: xr.Dataset, variable_name: str, failures: NDArray[np.bool8]
) -> xr.Dataset:
    """-----------------------------------------------------------------------------
    Handles the quality of a variable in the dataset and returns the dataset after
    any corrections have been applied.

    Args:
        dataset (xr.Dataset): The dataset containing the variable to handle.
        variable_name (str): The name of the variable whose quality should be
            handled.
        failures (NDArray[np.bool8]): The results of the QualityChecker for the
            provided variable, where True values indicate a quality problem.

    Returns:
        xr.Dataset: The dataset after the QualityHandler has been run.

    -----------------------------------------------------------------------------"""

Tsdat built-in quality handlers:

QualityHandler

Base class for code that handles the dataset / data variable quality.

RecordQualityResults

Records the results of the quality check in an ancillary qc variable.

ReplaceFailedValues

Replaces all failed values with the variable's _FillValue.

SortDatasetByCoordinate

Sorts the dataset by the failed variable, if there are any failures.

FailPipeline

Raises a DataQualityError, halting the pipeline, if the data quality are sufficiently bad.

class tsdat.qc.checkers.CheckFailDelta(*, parameters: tsdat.qc.checkers._CheckDelta.Parameters = Parameters(dim='time'), allow_equal: bool = True, attribute_name: str = 'fail_delta')[source]

Bases: tsdat.qc.checkers._CheckDelta

Checks for deltas between consecutive values larger than ‘fail_delta’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckFailMax(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'fail_max')[source]

Bases: tsdat.qc.checkers._CheckMax

Checks for values greater than ‘fail_max’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckFailMin(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'fail_min')[source]

Bases: tsdat.qc.checkers._CheckMin

Checks for values less than ‘fail_min’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckFailRangeMax(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'fail_range')[source]

Bases: tsdat.qc.checkers._CheckMax

Checks for values greater than ‘fail_range’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckFailRangeMin(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'fail_range')[source]

Bases: tsdat.qc.checkers._CheckMin

Checks for values less than ‘fail_range’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckMissing(*, parameters: Any = {})[source]

Bases: tsdat.qc.base.QualityChecker

Checks if any data are missing. A variable’s data are considered missing if they are set to the variable’s _FillValue (if it has a _FillValue) or NaN (NaT for datetime- like variables).

parameters: Any[source]
run(dataset: xarray.core.dataset.Dataset, variable_name: str) numpy.ndarray[Any, numpy.dtype[numpy.bool_]][source]

Identifies and flags quality problems with the data.

Checks the quality of a specific variable in the dataset and returns the results of the check as a boolean array where True values represent quality problems and False values represent data that passes the quality check.

QualityCheckers should not modify dataset variables; changes to the dataset should be made by QualityHandler(s), which receive the results of a QualityChecker as input.

Parameters
  • dataset (xr.Dataset) – The dataset containing the variable to check.

  • variable_name (str) – The name of the variable to check.

Returns

NDArray[np.bool8] – The results of the quality check, where True values indicate a quality problem.

class tsdat.qc.checkers.CheckMonotonic(*, parameters: tsdat.qc.checkers.CheckMonotonic.Parameters = Parameters(require_decreasing=False, require_increasing=False, dim=None))[source]

Bases: tsdat.qc.base.QualityChecker

Checks if any values are not ordered strictly monotonically (i.e. values must all be increasing or all decreasing). The check marks all values as failed if any data values are not ordered monotonically.

class Parameters(*, require_decreasing: bool = False, require_increasing: bool = False, dim: str = None)[source]

Bases: pydantic.main.BaseModel

classmethod check_monotonic_not_increasing_and_decreasing(inc: bool, values: Dict[str, Any]) bool[source]
dim: Optional[str][source]
require_decreasing: bool[source]
require_increasing: bool[source]
get_axis(variable: xarray.core.dataarray.DataArray) int[source]
parameters: tsdat.qc.checkers.CheckMonotonic.Parameters[source]
run(dataset: xarray.core.dataset.Dataset, variable_name: str) numpy.ndarray[Any, numpy.dtype[numpy.bool_]][source]

Identifies and flags quality problems with the data.

Checks the quality of a specific variable in the dataset and returns the results of the check as a boolean array where True values represent quality problems and False values represent data that passes the quality check.

QualityCheckers should not modify dataset variables; changes to the dataset should be made by QualityHandler(s), which receive the results of a QualityChecker as input.

Parameters
  • dataset (xr.Dataset) – The dataset containing the variable to check.

  • variable_name (str) – The name of the variable to check.

Returns

NDArray[np.bool8] – The results of the quality check, where True values indicate a quality problem.

class tsdat.qc.checkers.CheckValidDelta(*, parameters: tsdat.qc.checkers._CheckDelta.Parameters = Parameters(dim='time'), allow_equal: bool = True, attribute_name: str = 'valid_delta')[source]

Bases: tsdat.qc.checkers._CheckDelta

Checks for deltas between consecutive values larger than ‘valid_delta’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckValidMax(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'valid_max')[source]

Bases: tsdat.qc.checkers._CheckMax

Checks for values greater than ‘valid_max’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckValidMin(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'valid_min')[source]

Bases: tsdat.qc.checkers._CheckMin

Checks for values less than ‘valid_min’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckValidRangeMax(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'valid_range')[source]

Bases: tsdat.qc.checkers._CheckMax

Checks for values greater than ‘valid_range’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckValidRangeMin(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'valid_range')[source]

Bases: tsdat.qc.checkers._CheckMin

Checks for values less than ‘valid_range’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckWarnDelta(*, parameters: tsdat.qc.checkers._CheckDelta.Parameters = Parameters(dim='time'), allow_equal: bool = True, attribute_name: str = 'warn_delta')[source]

Bases: tsdat.qc.checkers._CheckDelta

Checks for deltas between consecutive values larger than ‘warn_delta’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckWarnMax(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'warn_max')[source]

Bases: tsdat.qc.checkers._CheckMax

Checks for values greater than ‘warn_max’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckWarnMin(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'warn_min')[source]

Bases: tsdat.qc.checkers._CheckMin

Checks for values less than ‘warn_min’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckWarnRangeMax(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'warn_range')[source]

Bases: tsdat.qc.checkers._CheckMax

Checks for values greater than ‘warn_range’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

class tsdat.qc.checkers.CheckWarnRangeMin(*, parameters: Any = {}, allow_equal: bool = True, attribute_name: str = 'warn_range')[source]

Bases: tsdat.qc.checkers._CheckMin

Checks for values less than ‘warn_range’.

attribute_name: str[source]

The attribute on the data variable that should be used to get the threshold.

exception tsdat.qc.handlers.DataQualityError[source]

Bases: ValueError

Raised when the quality of a variable indicates a fatal error has occurred. Manual review of the data in question is often recommended in this case.

class tsdat.qc.handlers.FailPipeline(*, parameters: tsdat.qc.handlers.FailPipeline.Parameters = Parameters(tolerance=0, context=''))[source]

Bases: tsdat.qc.base.QualityHandler

Raises a DataQualityError, halting the pipeline, if the data quality are sufficiently bad. This usually indicates that a manual inspection of the data is recommended.

Raises

DataQualityError – DataQualityError

class Parameters(*, tolerance: float = 0, context: str = '', **extra_data: Any)[source]

Bases: pydantic.main.BaseModel

context: str[source]

Additional context set by users that ends up in the traceback message.

tolerance: float[source]

Tolerance for the number of allowable failures as the ratio of allowable failures to the total number of values checked. Defaults to 0, meaning that any failed checks will result in a DataQualityError being raised.

parameters: tsdat.qc.handlers.FailPipeline.Parameters[source]
run(dataset: xarray.core.dataset.Dataset, variable_name: str, failures: numpy.ndarray[Any, numpy.dtype[numpy.bool_]])[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool8]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.qc.handlers.RecordQualityResults(*, parameters: tsdat.qc.handlers.RecordQualityResults.Parameters)[source]

Bases: tsdat.qc.base.QualityHandler

Records the results of the quality check in an ancillary qc variable. Creates the ancillary qc variable if one does not already exist.

class Parameters(*, bit: tsdat.qc.handlers.ConstrainedIntValue, assessment: Literal['bad', 'indeterminate'], meaning: str)[source]

Bases: pydantic.main.BaseModel

assessment: Literal['bad', 'indeterminate'][source]

Indicates the quality of the data if the test results indicate a failure.

bit: int[source]

The bit number (e.g., 1, 2, 3, …) used to indicate if the check passed. The quality results are bitpacked into an integer array to preserve space. For example, if ‘check #0’ uses bit 0 and fails, and ‘check #1’ uses bit 1 and fails then the resulting value on the qc variable would be 2^(0) + 2^(1) = 3. If we had a third check it would be 2^(0) + 2^(1) + 2^(2) = 7.

meaning: str[source]

A string that describes the test applied.

parameters: tsdat.qc.handlers.RecordQualityResults.Parameters[source]
run(dataset: xarray.core.dataset.Dataset, variable_name: str, failures: numpy.ndarray[Any, numpy.dtype[numpy.bool_]]) xarray.core.dataset.Dataset[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool8]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.qc.handlers.ReplaceFailedValues(*, parameters: Any = {})[source]

Bases: tsdat.qc.base.QualityHandler

Replaces all failed values with the variable’s _FillValue. If the variable does not have a _FillValue attribute then nan is used instead

parameters: Any[source]
run(dataset: xarray.core.dataset.Dataset, variable_name: str, failures: numpy.ndarray[Any, numpy.dtype[numpy.bool_]]) xarray.core.dataset.Dataset[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool8]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.qc.handlers.SortDatasetByCoordinate(*, parameters: tsdat.qc.handlers.SortDatasetByCoordinate.Parameters = Parameters(ascending=True))[source]

Bases: tsdat.qc.base.QualityHandler

Sorts the dataset by the failed variable, if there are any failures.

class Parameters(*, ascending: bool = True)[source]

Bases: pydantic.main.BaseModel

ascending: bool[source]

Whether to sort the dataset in ascending order. Defaults to True.

parameters: tsdat.qc.handlers.SortDatasetByCoordinate.Parameters[source]
run(dataset: xarray.core.dataset.Dataset, variable_name: str, failures: numpy.ndarray[Any, numpy.dtype[numpy.bool_]]) xarray.core.dataset.Dataset[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool8]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns

xr.Dataset – The dataset after the QualityHandler has been run.