tsdat.pipeline.ingest_pipeline

Module Contents

Classes

IngestPipeline

The IngestPipeline class is designed to read in raw, non-standardized

class tsdat.pipeline.ingest_pipeline.IngestPipeline(pipeline_config: Union[str, tsdat.config.Config], storage_config: Union[str, tsdat.io.DatastreamStorage])

Bases: tsdat.pipeline.pipeline.Pipeline

The IngestPipeline class is designed to read in raw, non-standardized data and convert it to a standardized format by embedding metadata, applying quality checks and quality controls, and by saving the now-processed data in a standard file format.

run(self, filepath: Union[str, List[str]])None

Runs the IngestPipeline from start to finish.

Parameters

filepath (Union[str, List[str]]) – The path or list of paths to the file(s) to run the pipeline on.

hook_customize_dataset(self, dataset: xarray.Dataset, raw_mapping: Dict[str, xarray.Dataset])xarray.Dataset

Hook to allow for user customizations to the standardized dataset such as inserting a derived variable based on other variables in the dataset. This method is called immediately after the standardize_dataset method and before QualityManagement has been run.

Parameters
  • dataset (xr.Dataset) – The dataset to customize.

  • raw_mapping (Dict[str, xr.Dataset]) – The raw dataset mapping.

Returns

The customized dataset.

Return type

xr.Dataset

hook_customize_raw_datasets(self, raw_dataset_mapping: Dict[str, xarray.Dataset])Dict[str, xarray.Dataset]

Hook to allow for user customizations to one or more raw xarray Datasets before they merged and used to create the standardized dataset. The raw_dataset_mapping will contain one entry for each file being used as input to the pipeline. The keys are the standardized raw file name, and the values are the datasets.

This method would typically only be used if the user is combining multiple files into a single dataset. In this case, this method may be used to correct coordinates if they don’t match for all the files, or to change variable (column) names if two files have the same name for a variable, but they are two distinct variables.

This method can also be used to check for unique conditions in the raw data that should cause a pipeline failure if they are not met.

This method is called before the inputs are merged and converted to standard format as specified by the config file.

Parameters

raw_dataset_mapping (Dict[str, xr.Dataset]) – The raw datasets to customize.

Returns

The customized raw datasets.

Return type

Dict[str, xr.Dataset]

hook_finalize_dataset(self, dataset: xarray.Dataset)xarray.Dataset

Hook to apply any final customizations to the dataset before it is saved. This hook is called after QualityManagement has been run and immediately before the dataset it saved to file.

Parameters

dataset (xr.Dataset) – The dataset to finalize.

Returns

The finalized dataset to save.

Return type

xr.Dataset

hook_generate_and_persist_plots(self, dataset: xarray.Dataset)None

Hook to allow users to create plots from the xarray dataset after the dataset has been finalized and just before the dataset is saved to disk.

To save on filesystem space (which is limited when running on the cloud via a lambda function), this method should only write one plot to local storage at a time. An example of how this could be done is below:

filename = DSUtil.get_plot_filename(dataset, "sea_level", "png")
with self.storage._tmp.get_temp_filepath(filename) as tmp_path:
    fig, ax = plt.subplots(figsize=(10,5))
    ax.plot(dataset["time"].data, dataset["sea_level"].data)
    fig.save(tmp_path)
    storage.save(tmp_path)

filename = DSUtil.get_plot_filename(dataset, "qc_sea_level", "png")
with self.storage._tmp.get_temp_filepath(filename) as tmp_path:
    fig, ax = plt.subplots(figsize=(10,5))
    DSUtil.plot_qc(dataset, "sea_level", tmp_path)
    storage.save(tmp_path)
Parameters

dataset (xr.Dataset) – The xarray dataset with customizations and QualityManagement applied.

read_and_persist_raw_files(self, file_paths: List[str])List[str]

Renames the provided raw files according to ME Data Standards file naming conventions for raw data files, and returns a list of the paths to the renamed files.

Parameters

file_paths (List[str]) – A list of paths to the original raw files.

Returns

A list of paths to the renamed files.

Return type

List[str]