Quality Management#
Two types of classes can be defined in your pipeline to ensure standardized data meets quality requirements:
QualityChecker
-
Each Quality Checker performs a specific quality control (QC) test on one or more variables in your dataset. Quality checkers test a single data variable at a time and return a boolean mask, where flagged values are marked as
True
. QualityHandler
-
Each Quality Handler can be specified to run if a particular QC test fails. Quality handlers take the QC Checker's boolean mask and use it to apply any QC or custom method to the data variable of question. For instance, it can be used to remove flagged data altogether or correct flagged values, such as interpolating to fill gaps in data.
Custom Quality Checkers and Handlers should be stored in the folder where they will be used (i.e.,
pipelines/<pipeline_module>/qc.py
in most cases). In order to be used, they must be registered in the
quality config file like so (for an example lidar
pipeline module):
managers:
- name: Require Valid Coordinate Variables
checker:
classname: tsdat.CheckMissing
handlers:
- classname: tsdat.FailPipeline
apply_to: [COORDS]
- name: The name of this quality check
checker:
classname: pipelines.example_pipeline.qc.CustomQualityChecker
parameters: {}
handlers:
- classname: pipelines.example_pipeline.qc.CustomQualityHandler
parameters: {}
apply_to: [COORDS, DATA_VARS]
Quality Checkers#
Quality Checkers are classes that are used to perform a QC test on a specific variable. Each QualityChecker
should
extend the tsdat.QualityChecker
base class, and implement the run
method as shown below. Each QualityChecker
registered in the pipeline config file will be automatically initialized by the pipeline and invoked on the
specified variables.
def run(self, dataset: xr.Dataset, variable_name: str) -> NDArray[np.bool_] | None:
"""Checks the quality of a specific variable in the dataset and returns the results
of the check as a boolean array where True values represent quality problems and
False values represent data that passes the quality check.
QualityCheckers should not modify dataset variables; changes to the dataset
should be made by QualityHandler(s), which receive the results of a
QualityChecker as input.
Args:
dataset (xr.Dataset): The dataset containing the variable to check.
variable_name (str): The name of the variable to check.
Returns:
NDArray[np.bool_] | None: The results of the quality check, where True values
indicate a quality problem. May return None to indicate that no QualityHandlers
should be run on the results of this check.
"""
return None
Quality Handlers#
Quality Handlers are classes that are used to take some action following the result of a quality check. For example, if
you use the built-in tsdat.CheckMissing
QualityChecker
class to check for missing values in one or more of your data
variables, you could pair that with the tsdat.RecordQualityResults
QualityHandler
to add a new qc
variable that
provides metadata to indicate to users which data points are missing.
All QualityHandler
classes should extend the tsdat.QualityHandler
base class, and implement the run()
method as
shown below:
def run(
self, dataset: xr.Dataset, variable_name: str, failures: NDArray[np.bool_]
) -> xr.Dataset:
"""Handles the quality of a variable in the dataset and returns the dataset after
any corrections have been applied.
Args:
dataset (xr.Dataset): The dataset containing the variable to handle.
variable_name (str): The name of the variable whose quality should be
handled.
failures (NDArray[np.bool_]): The results of the QualityChecker for the
provided variable, where True values indicate a quality problem.
Returns:
xr.Dataset: The dataset after the QualityHandler has been run.
"""
return dataset
Additional Notes#
Using the default template, all variables are given a corollary quality control (QC) variable if QC tests for
data variables are run in a given pipeline. What this means that is data variables will not get a corollary QC variable
if the respective QC blocks for DATA_VARS
are commented out. The basic quality.yaml
file that is given in the
template is shown below:
managers:
- name: Fail if missing coordinates
checker:
classname: tsdat.CheckMissing
handlers:
- classname: tsdat.FailPipeline
parameters:
context: Coordinate variables cannot be missing.
apply_to: [COORDS]
- name: Fail if monotonic coordinates
checker:
classname: tsdat.CheckMonotonic
parameters:
require_increasing: true
handlers:
- classname: tsdat.FailPipeline
parameters:
context: Coordinate variables must be strictly increasing.
apply_to: [COORDS]
#---------------------------------------------------------------
- name: Remove missing data
checker:
classname: tsdat.CheckMissing
handlers:
- classname: tsdat.RemoveFailedValues
- classname: tsdat.RecordQualityResults
parameters:
assessment: bad
meaning: Value is equal to _FillValue or NaN
apply_to: [DATA_VARS]
exclude:
- altitude
- latitude
- longitude
- name: Flag data below minimum valid threshold
checker:
classname: tsdat.CheckValidMin
handlers:
- classname: tsdat.RecordQualityResults
parameters:
assessment: bad
meaning: Value is less than the valid_min.
apply_to: [DATA_VARS]
exclude:
- altitude
- latitude
- longitude
- name: Flag data above maximum valid threshold
checker:
classname: tsdat.CheckValidMax
handlers:
- classname: tsdat.RecordQualityResults
parameters:
assessment: bad
meaning: Value is greater than the valid_max.
apply_to: [DATA_VARS]
exclude:
- altitude
- latitude
- longitude
A QC block consists of
- the keyword
name
, simply the block's description - the keyword
checker
, and an associatedclassname
one line below it: the QC test to use. - the keyword
handler
, and an associated list of classname(s) (hence the extra hyphen in front ofclassname
) - the keyword
apply_to
: this can beCOORDS
,DATA_VARS
, or a list of variable names - the keyword
exclude
: this is a list of variable names that the check should ignore
These QC blocks can take additional parameters for more sophisticated QC algorithms. Customized qc.py
files that
require editable parameters can be set using the parameters keyword. In the QC block below, the custom QC class
CheckCorrelation
exists in the shared/qc.py
file. The classname is therefore set as shared.qc.CheckCorrelation
.
The exclude
keyword can be used to exclude certain variables from a QC test, and is typically used for variables that
you don't want to QC (e.g., location variables or other status variables).
Also know that whitespace is not critical for yaml
files, and it is good to be consistent with however yours is set.
- name: Flag data below correlation threshold
checker:
classname: shared.qc.CheckCorrelation
parameters:
correlation_threshold: 30
handlers:
- classname: tsdat.RemoveFailedValues
- classname: tsdat.RecordQualityResults
parameters:
assessment: bad
meaning: Value is less than correlation threshold
apply_to: [vel, corr, amp]
exclude: [vel_bt]
Finally, it's important to go over the parameters required for tsdat.RecordQualityResults
, which is the built-in
function that all QC blocks should use to record the QC test results.
It takes parameters: assessment
and meaning
. These parameters are turned into variable attributes in the pipeline
output dataset on each qc
variable: flag_mask
, flag_assessment
, and flag_meaning
respectively.
A technique called bit-packing is used to create the flag masks attribute. Each bit/flag represents one test that was
applied (e.g., each time tsdat.RecordQualityResults
is called). If a flag has the value 13, then that means it failed
the tests associated with flag masks 1, 4, and 8 (1 + 4 + 8 = 13), which are QC bits 1, 3, and 4 (2^0, 2^2, 2^3). This
scheme works because any addition of the flag masks can only come from a unique set of QC bits.
assessment
is one of two terms: bad or indeterminate. This simply flags if the test that failed did so because the
data point is of bad quality or if it may be cause for concern.
meaning
is the description of the failure. This is a short statement of which test failed, and flag_meaning
is
listed in the same order as flag_masks
.