tsdat

Framework for developing time-series data pipelines that are configurable through yaml configuration files and custom code hooks and components. Developed with Atmospheric, Oceanographic, and Renewable Energy domains in mind, but is generally applicable in other domains as well.

Subpackages

Submodules

Classes

CSVHandler

DataHandler specifically tailored to reading and writing files of a specific type.

CSVReader

Uses pandas and xarray functions to read a csv file and extract its contents into an

CSVWriter

Converts a xr.Dataset object to a pandas DataFrame and saves the result to a csv

CheckFailDelta

Checks for deltas between consecutive values larger than 'fail_delta'.

CheckFailMax

Checks for values greater than 'fail_max'.

CheckFailMin

Checks for values less than 'fail_min'.

CheckFailRangeMax

Checks for values greater than 'fail_range'.

CheckFailRangeMin

Checks for values less than 'fail_range'.

CheckMissing

Checks if any data are missing. A variable's data are considered missing if they are

CheckMonotonic

Checks if any values are not ordered strictly monotonically (i.e. values must all be

CheckValidDelta

Checks for deltas between consecutive values larger than 'valid_delta'.

CheckValidMax

Checks for values greater than 'valid_max'.

CheckValidMin

Checks for values less than 'valid_min'.

CheckValidRangeMax

Checks for values greater than 'valid_range'.

CheckValidRangeMin

Checks for values less than 'valid_range'.

CheckWarnDelta

Checks for deltas between consecutive values larger than 'warn_delta'.

CheckWarnMax

Checks for values greater than 'warn_max'.

CheckWarnMin

Checks for values less than 'warn_min'.

CheckWarnRangeMax

Checks for values greater than 'warn_range'.

CheckWarnRangeMin

Checks for values less than 'warn_range'.

DataConverter

Base class for running data conversions on retrieved raw data.

DataHandler

Groups a DataReader subclass and a DataWriter subclass together.

DataReader

Base class for reading data from an input source.

DataWriter

Base class for writing data to storage area(s).

DatasetConfig

Defines the structure and metadata of the dataset produced by a tsdat pipeline.

DefaultRetriever

Default API for retrieving data from one or more input sources.

FailPipeline

Raises a DataQualityError, halting the pipeline, if the data quality are

FileHandler

DataHandler specifically tailored to reading and writing files of a specific type.

FileSystem

Handles data storage and retrieval for file-based data formats.

FileSystemS3

Handles data storage and retrieval for file-based data in an AWS S3 bucket.

FileWriter

Base class for file-based DataWriters.

IngestPipeline

Pipeline class designed to read in raw, unstandardized time series data and enhance

NearestNeighbor

Maps data onto the specified coordinate grid using nearest-neighbor.

NetCDFHandler

DataHandler specifically tailored to reading and writing files of a specific type.

NetCDFReader

Thin wrapper around xarray's open_dataset() function, with optional parameters

NetCDFWriter

Thin wrapper around xarray's Dataset.to_netcdf() function for saving a dataset to a

Overrideable

Abstract base class for generic types.

ParameterizedClass

Base class for any class that accepts 'parameters' as an argument.

ParameterizedConfigClass

ParquetHandler

DataHandler specifically tailored to reading and writing files of a specific type.

ParquetReader

Uses pandas and xarray functions to read a parquet file and extract its contents

ParquetWriter

Writes the dataset to a parquet file.

Pipeline

Base class for tsdat data pipelines.

PipelineConfig

Contains configuration parameters for tsdat pipelines.

QualityChecker

Base class for code that checks the dataset / data variable quality.

QualityConfig

Contains quality configuration parameters for tsdat pipelines.

QualityHandler

Base class for code that handles the dataset / data variable quality.

QualityManagement

Main class for orchestrating the dispatch of QualityCheckers and QualityHandlers.

QualityManager

Groups a QualityChecker and one or more QualityHandlers together.

RecordQualityResults

Records the results of the quality check in an ancillary qc variable. Creates the

RemoveFailedValues

Replaces all failed values with the variable's _FillValue. If the variable does not

RetrievalRuleSelections

Maps variable names to the rules and conversions that should be applied.

RetrievedDataset

Maps variable names to the input DataArray the data are retrieved from.

Retriever

Base class for retrieving data used as input to tsdat pipelines.

RetrieverConfig

Contains configuration parameters for the tsdat retriever class.

SortDatasetByCoordinate

Sorts the dataset by the failed variable, if there are any failures.

SplitNetCDFHandler

DataHandler specifically tailored to reading and writing files of a specific type.

SplitNetCDFWriter

Wrapper around xarray's Dataset.to_netcdf() function for saving a dataset to a

Storage

Abstract base class for the tsdat Storage API. Subclasses of Storage are used in

StorageConfig

Contains configuration parameters for the data storage API used in tsdat pipelines.

StorageRetriever

Retriever API for pulling input data from the storage area.

StorageRetrieverInput

Returns an object representation of an input storage key.

StringToDatetime

Converts date strings into datetime64 data.

TransformationPipeline

Pipeline class designed to read in standardized time series data and enhance

UnitsConverter

Converts the units of a retrieved variable to specified output units.

YamlModel

ZarrHandler

DataHandler specifically tailored to reading and writing files of a specific type.

ZarrLocalStorage

Handles data storage and retrieval for zarr archives on a local filesystem.

ZarrReader

Uses xarray's Zarr capabilities to read a Zarr archive and extract its contents into

ZarrWriter

Writes the dataset to a basic zarr archive.

ZipReader

DataReader for reading from a zipped archive. Writing to this format is not supported.

Functions

assert_close

Thin wrapper around xarray.assert_allclose.

assign_data

Assigns the data to the specified variable in the dataset.

decode_cf

Wrapper around xarray.decode_cf() which handles additional edge cases.

generate_schema

get_code_version

get_datastream

get_fields_from_datastream

Extracts fields from the datastream.

get_filename

Returns the standardized filename for the provided dataset.

get_start_date_and_time_str

Gets the start date and start time strings from a Dataset.

get_start_time

Gets the earliest 'time' value and returns it as a pandas Timestamp.

get_version

read_yaml

record_corrections_applied

Records the message on the 'corrections_applied' attribute.

recursive_instantiate

Instantiates all ParametrizedClass components and subcomponents of a given model.

Function Descriptions

Attributes

DATASTREAM_TEMPLATE

FILENAME_TEMPLATE

exception tsdat.ConfigError[source]

Bases: Exception

Common base class for all non-exit exceptions.

Initialize self. See help(type(self)) for accurate signature.

exception tsdat.DataQualityError[source]

Bases: ValueError

Raised when the quality of a variable indicates a fatal error has occurred. Manual review of the data in question is often recommended in this case.

Initialize self. See help(type(self)) for accurate signature.

class tsdat.CSVHandler[source]

Bases: tsdat.io.base.FileHandler

DataHandler specifically tailored to reading and writing files of a specific type.

Parameters:
  • extension (str) – The specific file extension used for data files, e.g., “.nc”.

  • reader (DataReader) – The DataReader subclass responsible for reading input data.

  • writer (FileWriter) – The FileWriter subclass responsible for writing output

  • data.

extension: str = 'csv'
reader: tsdat.io.readers.CSVReader
writer: tsdat.io.writers.CSVWriter
class tsdat.CSVReader[source]

Bases: tsdat.io.base.DataReader

Uses pandas and xarray functions to read a csv file and extract its contents into an xarray Dataset object. Two parameters are supported: read_csv_kwargs and from_dataframe_kwargs, whose contents are passed as keyword arguments to pandas.read_csv() and xarray.Dataset.from_dataframe() respectively.

class Parameters

Bases: pydantic.BaseModel

from_dataframe_kwargs: Dict[str, Any]
read_csv_kwargs: Dict[str, Any]
parameters: CSVReader.Parameters

Class Methods

read

Reads data given an input key.

Method Descriptions

read(input_key: str) xarray.Dataset[source]

Reads data given an input key.

Uses the input key to open a resource and load data as a xr.Dataset object or as a mapping of strings to xr.Dataset objects.

In most cases DataReaders will only need to return a single xr.Dataset, but occasionally some types of inputs necessitate that the data loaded from the input_key be returned as a mapping. For example, if the input_key is a path to a zip file containing multiple disparate datasets, then returning a mapping is appropriate.

Parameters:

input_key (str) – An input key matching the DataReader’s regex pattern that should be used to load data.

Returns:

Union[xr.Dataset, Dict[str, xr.Dataset]]

The raw data extracted from the

provided input key.

class tsdat.CSVWriter[source]

Bases: tsdat.io.base.FileWriter

Converts a xr.Dataset object to a pandas DataFrame and saves the result to a csv file using pd.DataFrame.to_csv(). Properties under the to_csv_kwargs parameter are passed to pd.DataFrame.to_csv() as keyword arguments.

class Parameters

Bases: pydantic.BaseModel

dim_order: List[str] | None
to_csv_kwargs: Dict[str, Any]
file_extension: str = 'csv'
parameters: CSVWriter.Parameters

Class Methods

write

Writes the dataset to the provided filepath.

Method Descriptions

write(dataset: xarray.Dataset, filepath: pathlib.Path | None = None, **kwargs: Any) None[source]

Writes the dataset to the provided filepath.

This method is typically called by the tsdat storage API, which will be responsible for providing the filepath, including the file extension.

Parameters:
  • dataset (xr.Dataset) – The dataset to save.

  • filepath (Optional[Path]) – The path to the file to save.

class tsdat.CheckFailDelta[source]

Bases: _CheckDelta

Checks for deltas between consecutive values larger than ‘fail_delta’.

attribute_name: str = 'fail_delta'
class tsdat.CheckFailMax[source]

Bases: _CheckMax

Checks for values greater than ‘fail_max’.

attribute_name: str = 'fail_max'
class tsdat.CheckFailMin[source]

Bases: _CheckMin

Checks for values less than ‘fail_min’.

attribute_name: str = 'fail_min'
class tsdat.CheckFailRangeMax[source]

Bases: _CheckMax

Checks for values greater than ‘fail_range’.

attribute_name: str = 'fail_range'
class tsdat.CheckFailRangeMin[source]

Bases: _CheckMin

Checks for values less than ‘fail_range’.

attribute_name: str = 'fail_range'
class tsdat.CheckMissing[source]

Bases: tsdat.qc.base.QualityChecker

Checks if any data are missing. A variable’s data are considered missing if they are set to the variable’s _FillValue (if it has a _FillValue) or NaN (NaT for datetime- like variables).

Class Methods

run

Identifies and flags quality problems with the data.

Method Descriptions

run(dataset: xarray.Dataset, variable_name: str) numpy.typing.NDArray[numpy.bool_][source]

Identifies and flags quality problems with the data.

Checks the quality of a specific variable in the dataset and returns the results of the check as a boolean array where True values represent quality problems and False values represent data that passes the quality check.

QualityCheckers should not modify dataset variables; changes to the dataset should be made by QualityHandler(s), which receive the results of a QualityChecker as input.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to check.

  • variable_name (str) – The name of the variable to check.

Returns:

NDArray[np.bool_] – The results of the quality check, where True values indicate a quality problem.

class tsdat.CheckMonotonic[source]

Bases: tsdat.qc.base.QualityChecker

Checks if any values are not ordered strictly monotonically (i.e. values must all be increasing or all decreasing). The check marks values as failed if they break from a monotonic order.

class Parameters

Bases: pydantic.BaseModel

dim: str | None
require_decreasing: bool = False
require_increasing: bool = False

Class Methods

check_monotonic_not_increasing_and_decreasing

Method Descriptions

classmethod check_monotonic_not_increasing_and_decreasing(inc: bool, values: Dict[str, Any]) bool
parameters: CheckMonotonic.Parameters

Class Methods

get_axis

run

Identifies and flags quality problems with the data.

Method Descriptions

get_axis(variable: xarray.DataArray) int[source]
run(dataset: xarray.Dataset, variable_name: str) numpy.typing.NDArray[numpy.bool_] | None[source]

Identifies and flags quality problems with the data.

Checks the quality of a specific variable in the dataset and returns the results of the check as a boolean array where True values represent quality problems and False values represent data that passes the quality check.

QualityCheckers should not modify dataset variables; changes to the dataset should be made by QualityHandler(s), which receive the results of a QualityChecker as input.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to check.

  • variable_name (str) – The name of the variable to check.

Returns:

NDArray[np.bool_] – The results of the quality check, where True values indicate a quality problem.

class tsdat.CheckValidDelta[source]

Bases: _CheckDelta

Checks for deltas between consecutive values larger than ‘valid_delta’.

attribute_name: str = 'valid_delta'
class tsdat.CheckValidMax[source]

Bases: _CheckMax

Checks for values greater than ‘valid_max’.

attribute_name: str = 'valid_max'
class tsdat.CheckValidMin[source]

Bases: _CheckMin

Checks for values less than ‘valid_min’.

attribute_name: str = 'valid_min'
class tsdat.CheckValidRangeMax[source]

Bases: _CheckMax

Checks for values greater than ‘valid_range’.

attribute_name: str = 'valid_range'
class tsdat.CheckValidRangeMin[source]

Bases: _CheckMin

Checks for values less than ‘valid_range’.

attribute_name: str = 'valid_range'
class tsdat.CheckWarnDelta[source]

Bases: _CheckDelta

Checks for deltas between consecutive values larger than ‘warn_delta’.

attribute_name: str = 'warn_delta'
class tsdat.CheckWarnMax[source]

Bases: _CheckMax

Checks for values greater than ‘warn_max’.

attribute_name: str = 'warn_max'
class tsdat.CheckWarnMin[source]

Bases: _CheckMin

Checks for values less than ‘warn_min’.

attribute_name: str = 'warn_min'
class tsdat.CheckWarnRangeMax[source]

Bases: _CheckMax

Checks for values greater than ‘warn_range’.

attribute_name: str = 'warn_range'
class tsdat.CheckWarnRangeMin[source]

Bases: _CheckMin

Checks for values less than ‘warn_range’.

attribute_name: str = 'warn_range'
class tsdat.DataConverter[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Base class for running data conversions on retrieved raw data.

Class Methods

convert

Runs the data converter on the retrieved data.

Method Descriptions

abstract convert(data: xarray.DataArray, variable_name: str, dataset_config: tsdat.config.dataset.DatasetConfig, retrieved_dataset: RetrievedDataset, **kwargs: Any) xarray.DataArray | None[source]

Runs the data converter on the retrieved data.

Parameters:
  • data (xr.DataArray) – The retrieved DataArray to convert.

  • retrieved_dataset (RetrievedDataset) – The retrieved dataset containing data to convert.

  • dataset_config (DatasetConfig) – The output dataset configuration.

  • variable_name (str) – The name of the variable to convert.

Returns:

Optional[xr.DataArray]

The converted DataArray for the specified variable,

or None if the conversion was done in-place.

class tsdat.DataHandler[source]

Bases: tsdat.utils.ParameterizedClass

Groups a DataReader subclass and a DataWriter subclass together.

This provides a unified approach to data I/O. DataHandlers are typically expected to be able to round-trip the data, i.e. the following psuedocode is generally true:

handler.read(handler.write(dataset))) == dataset

Parameters:
  • reader (DataReader) – The DataReader subclass responsible for reading input data.

  • writer (FileWriter) – The FileWriter subclass responsible for writing output

  • data.

parameters: Dict[str, Any]
reader: DataReader
writer: DataWriter

Class Methods

patch_parameters

Method Descriptions

patch_parameters(v: DataReader, values: Dict[str, Any], field: pydantic.fields.ModelField)[source]
class tsdat.DataReader[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Base class for reading data from an input source.

Parameters:
  • regex (Pattern[str]) – The regex pattern associated with the DataReader. If

  • pipeline (calling the DataReader from a tsdat) –

  • checked (this pattern will be) –

  • called. (against each possible input key before the read() method is) –

Class Methods

read

Reads data given an input key.

Method Descriptions

abstract read(input_key: str) xarray.Dataset | Dict[str, xarray.Dataset][source]

Reads data given an input key.

Uses the input key to open a resource and load data as a xr.Dataset object or as a mapping of strings to xr.Dataset objects.

In most cases DataReaders will only need to return a single xr.Dataset, but occasionally some types of inputs necessitate that the data loaded from the input_key be returned as a mapping. For example, if the input_key is a path to a zip file containing multiple disparate datasets, then returning a mapping is appropriate.

Parameters:

input_key (str) – An input key matching the DataReader’s regex pattern that should be used to load data.

Returns:

Union[xr.Dataset, Dict[str, xr.Dataset]]

The raw data extracted from the

provided input key.

class tsdat.DataWriter[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Base class for writing data to storage area(s).

Class Methods

write

Writes the dataset to the storage area.

Method Descriptions

abstract write(dataset: xarray.Dataset, **kwargs: Any) None[source]

Writes the dataset to the storage area.

This method is typically called by the tsdat storage API, which will be responsible for providing any additional parameters required by subclasses of the tsdat.io.base.DataWriter class.

Parameters:

dataset (xr.Dataset) – The dataset to save.

class tsdat.DatasetConfig[source]

Bases: tsdat.config.utils.YamlModel

Defines the structure and metadata of the dataset produced by a tsdat pipeline.

Also provides methods to support yaml parsing and validation, including generation of json schema.

Parameters:
  • attrs (GlobalAttributes) – Attributes that pertain to the dataset as a whole.

  • coords (Dict[str, Coordinate]) – The dataset’s coordinate variables.

  • data_vars (Dict[str, Variable]) – The dataset’s data variables.

attrs: tsdat.config.attributes.GlobalAttributes
coords: Dict[str, tsdat.config.variables.Coordinate]
data_vars: Dict[str, tsdat.config.variables.Variable]

Class Methods

__contains__

__getitem__

set_variable_name_property

time_in_coords

validate_variable_name_uniqueness

variable_names_are_legal

Method Descriptions

__contains__(__o: object) bool[source]
__getitem__(name: str) tsdat.config.variables.Variable | tsdat.config.variables.Coordinate[source]
classmethod set_variable_name_property(vars: Dict[str, Dict[str, Any]]) Dict[str, Dict[str, Any]][source]
classmethod time_in_coords(coords: Dict[str, tsdat.config.variables.Coordinate]) Dict[str, tsdat.config.variables.Coordinate][source]
classmethod validate_variable_name_uniqueness(values: Any) Any[source]
class tsdat.DefaultRetriever[source]

Bases: tsdat.io.base.Retriever

Default API for retrieving data from one or more input sources.

Reads data from one or more inputs, renames coordinates and data variables according to retrieval and dataset configurations, and applies registered DataConverters to retrieved data.

Parameters:
  • readers (Dict[Pattern[str], DataReader]) – A mapping of patterns to DataReaders that the retriever uses to determine which DataReader to use for reading any given input key.

  • coords (Dict[str, Dict[Pattern[str], VariableRetriever]]) – A dictionary mapping output coordinate variable names to rules for how they should be retrieved.

  • data_vars (Dict[str, Dict[Pattern[str], VariableRetriever]]) – A dictionary mapping output data variable names to rules for how they should be retrieved.

class Parameters

Bases: pydantic.BaseModel

merge_kwargs: Dict[str, Any]

Keyword arguments passed to xr.merge(). This is only relevant if multiple input keys are provided simultaneously, or if any registered DataReader objects could return a dataset mapping instead of a single dataset.

parameters: DefaultRetriever.Parameters
readers: Dict[Pattern, tsdat.io.base.DataReader]

A dictionary of DataReaders that should be used to read data provided an input key.

Class Methods

retrieve

Prepares the raw dataset mapping for use in downstream pipeline processes.

Method Descriptions

retrieve(input_keys: List[str], dataset_config: tsdat.config.dataset.DatasetConfig, **kwargs: Any) xarray.Dataset[source]

Prepares the raw dataset mapping for use in downstream pipeline processes.

This is done by consolidating the data into a single xr.Dataset object. The retrieved dataset may contain additional coords and data_vars that are not defined in the output dataset. Input data converters are applied as part of the preparation process.

Parameters:
  • input_keys (List[str]) – The input keys the registered DataReaders should read from.

  • dataset_config (DatasetConfig) – The specification of the output dataset.

Returns:

xr.Dataset – The retrieved dataset.

class tsdat.FailPipeline[source]

Bases: tsdat.qc.base.QualityHandler

Raises a DataQualityError, halting the pipeline, if the data quality are sufficiently bad. This usually indicates that a manual inspection of the data is recommended.

Raises:

DataQualityError – DataQualityError

class Parameters

Bases: pydantic.BaseModel

context: str = ''

Additional context set by users that ends up in the traceback message.

display_limit: int = 5
tolerance: float = 0

Tolerance for the number of allowable failures as the ratio of allowable failures to the total number of values checked. Defaults to 0, meaning that any failed checks will result in a DataQualityError being raised.

parameters: FailPipeline.Parameters

Class Methods

run

Takes some action on data that has had quality issues identified.

Method Descriptions

run(dataset: xarray.Dataset, variable_name: str, failures: numpy.typing.NDArray[numpy.bool_])[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool_]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns:

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.FileHandler[source]

Bases: DataHandler

DataHandler specifically tailored to reading and writing files of a specific type.

Parameters:
  • extension (str) – The specific file extension used for data files, e.g., “.nc”.

  • reader (DataReader) – The DataReader subclass responsible for reading input data.

  • writer (FileWriter) – The FileWriter subclass responsible for writing output

  • data.

extension: str
reader: DataReader
writer: FileWriter

Class Methods

no_leading_dot

Method Descriptions

no_leading_dot(v: str, values: Dict[str, Any]) str[source]
class tsdat.FileSystem[source]

Bases: tsdat.io.base.Storage

Handles data storage and retrieval for file-based data formats.

Formats that write to directories (such as zarr) are not supported by the FileSystem storage class.

Parameters:
  • parameters (Parameters) – File-system specific parameters, such as the root path to where files should be saved, or additional keyword arguments to specific functions used by the storage API. See the FileSystemStorage.Parameters class for more details.

  • handler (FileHandler) – The FileHandler class that should be used to handle data I/O within the storage API.

class Parameters

Bases: tsdat.io.base.Storage.Parameters

data_filename_template: str = '{datastream}.{date_time}.{extension}'

Template string to use for data filenames.

Allows substitution of the following parameters using curly braces ‘{}’:

  • ext: the file extension from the storage data handler

  • datastream from the dataset’s global attributes

  • location_id from the dataset’s global attributes

  • data_level from the dataset’s global attributes

  • date_time: the first timestamp in the file formatted as “YYYYMMDD.hhmmss”

  • Any other global attribute that has a string or integer data type.

At a minimum the template must include {date_time}.

data_storage_path: pathlib.Path

The directory structure under storage_root where ancillary files are saved.

Allows substitution of the following parameters using curly braces ‘{}’:

  • storage_root: the value from the storage_root parameter.

  • datastream: the datastream as defined in the dataset config file.

  • location_id: the location_id as defined in the dataset config file.

  • data_level: the data_level as defined in the dataset config file.

  • year: the year of the first timestamp in the file.

  • month: the month of the first timestamp in the file.

  • day: the day of the first timestamp in the file.

  • extension: the file extension used by the output file writer.

Defaults to data/{location_id}/{datastream}.

merge_fetched_data_kwargs: Dict[str, Any]

Keyword arguments passed to xr.merge.

Note that this will only be called if the DataReader returns a dictionary of xr.Datasets for a single input key.

handler: tsdat.io.handlers.FileHandler
parameters: FileSystem.Parameters

Class Methods

fetch_data

Fetches data for a given datastream between a specified time range.

save_ancillary_file

Saves an ancillary filepath to the datastream's ancillary storage area.

save_data

Saves a dataset to the storage area.

Method Descriptions

fetch_data(start: datetime.datetime, end: datetime.datetime, datastream: str, metadata_kwargs: Dict[str, str] | None = None, **kwargs: Any) xarray.Dataset[source]

Fetches data for a given datastream between a specified time range.

Parameters:
  • start (datetime) – The minimum datetime to fetch.

  • end (datetime) – The maximum datetime to fetch.

  • datastream (str) – The datastream id to search for.

  • metadata_kwargs (dict[str, str], optional) – Metadata substitutions to help resolve the data storage path. This is only required if the template data storage path includes any properties other than datastream or fields contained in the datastream. Defaults to None.

Returns:

xr.Dataset – A dataset containing all the data in the storage area that spans the specified datetimes.

save_ancillary_file(filepath: pathlib.Path, target_path: pathlib.Path | None = None)[source]

Saves an ancillary filepath to the datastream’s ancillary storage area.

NOTE: In most cases this function should not be used directly. Instead, prefer using the self.uploadable_dir(*args, **kwargs) method.

Parameters:
  • filepath (Path) – The path to the ancillary file. This is expected to have a standardized filename and should be saved under the ancillary storage path.

  • target_path (str) – The path to where the data should be saved.

save_data(dataset: xarray.Dataset, **kwargs: Any)[source]

Saves a dataset to the storage area.

At a minimum, the dataset must have a ‘datastream’ global attribute and must have a ‘time’ variable with a np.datetime64-like data type.

Parameters:

dataset (xr.Dataset) – The dataset to save.

class tsdat.FileSystemS3[source]

Bases: FileSystem

Handles data storage and retrieval for file-based data in an AWS S3 bucket.

Parameters:
  • parameters (Parameters) – File-system and AWS-specific parameters, such as the path to where files should be saved or additional keyword arguments to specific functions used by the storage API. See the FileSystemS3.Parameters class for more details.

  • handler (FileHandler) – The FileHandler class that should be used to handle data I/O within the storage API.

class Parameters

Bases: FileSystem

Additional parameters for S3 storage.

Note that all settings and parameters from Filesystem.Parameters are also supported by FileSystemS3.Parameters.

bucket: str

The name of the S3 bucket that the storage class should use.

Note

This parameter can also be set via the TSDAT_S3_BUCKET_NAME environment variable.

region: str

The AWS region of the storage bucket.

Note

This parameter can also be set via the AWS_DEFAULT_REGION environment variable.

Defaults to us-west-2.

parameters: FileSystemS3.Parameters

Class Methods

last_modified

Returns the datetime of the last modification to the datastream's storage area.

modified_since

Returns the data times of all files modified after the specified datetime.

save_ancillary_file

Saves an ancillary filepath to the datastream's ancillary storage area.

save_data

Saves a dataset to the storage area.

Method Descriptions

last_modified(datastream: str) datetime.datetime | None[source]

Returns the datetime of the last modification to the datastream’s storage area.

modified_since(datastream: str, last_modified: datetime.datetime) List[datetime.datetime][source]

Returns the data times of all files modified after the specified datetime.

save_ancillary_file(filepath: pathlib.Path, target_path: pathlib.Path | None = None)[source]

Saves an ancillary filepath to the datastream’s ancillary storage area.

NOTE: In most cases this function should not be used directly. Instead, prefer using the self.uploadable_dir(*args, **kwargs) method.

Parameters:
  • filepath (Path) – The path to the ancillary file. This is expected to have a standardized filename and should be saved under the ancillary storage path.

  • target_path (str) – The path to where the data should be saved.

save_data(dataset: xarray.Dataset, **kwargs: Any)[source]

Saves a dataset to the storage area.

At a minimum, the dataset must have a ‘datastream’ global attribute and must have a ‘time’ variable with a np.datetime64-like data type.

Parameters:

dataset (xr.Dataset) – The dataset to save.

class tsdat.FileWriter[source]

Bases: DataWriter, abc.ABC

Base class for file-based DataWriters.

Parameters:

file_extension (str) – The file extension that the FileHandler should be used for, e.g., “.nc”, “.csv”, …

file_extension: str

Class Methods

no_leading_dot

write

Writes the dataset to the provided filepath.

Method Descriptions

classmethod no_leading_dot(v: str) str[source]
abstract write(dataset: xarray.Dataset, filepath: pathlib.Path | None = None, **kwargs: Any) None[source]

Writes the dataset to the provided filepath.

This method is typically called by the tsdat storage API, which will be responsible for providing the filepath, including the file extension.

Parameters:
  • dataset (xr.Dataset) – The dataset to save.

  • filepath (Optional[Path]) – The path to the file to save.

class tsdat.IngestPipeline[source]

Bases: tsdat.pipeline.base.Pipeline

Pipeline class designed to read in raw, unstandardized time series data and enhance its quality and usability by converting it into a standard format, embedding metadata, applying quality checks and controls, generating reference plots, and saving the data in an accessible format so it can be used later in scientific analyses or in higher-level tsdat Pipelines.

Class Methods

get_ancillary_filepath

Returns the path to where an ancillary file should be saved so that it can be

hook_customize_dataset

Code hook to customize the retrieved dataset prior to qc being applied.

hook_finalize_dataset

Code hook to finalize the dataset after qc is applied but before it is saved.

hook_plot_dataset

Code hook to create plots for the data which runs after the dataset has been saved.

run

Runs the data pipeline on the provided inputs.

Method Descriptions

get_ancillary_filepath(title: str, extension: str = 'png', **kwargs: Any) pathlib.Path[source]

Returns the path to where an ancillary file should be saved so that it can be synced to the storage area automatically.

Parameters:
  • title (str) – The title to use for the plot filename. Should only contain alphanumeric and ‘_’ characters.

  • extension (str, optional) – The file extension. Defaults to “png”.

Returns:

Path – The ancillary filepath.

hook_customize_dataset(dataset: xarray.Dataset) xarray.Dataset[source]

Code hook to customize the retrieved dataset prior to qc being applied.

Parameters:

dataset (xr.Dataset) – The output dataset structure returned by the retriever API.

Returns:

xr.Dataset – The customized dataset.

hook_finalize_dataset(dataset: xarray.Dataset) xarray.Dataset[source]

Code hook to finalize the dataset after qc is applied but before it is saved.

Parameters:

dataset (xr.Dataset) – The output dataset returned by the retriever API and modified by the hook_customize_dataset user code hook.

Returns:

xr.Dataset – The finalized dataset, ready to be saved.

hook_plot_dataset(dataset: xarray.Dataset)[source]

Code hook to create plots for the data which runs after the dataset has been saved.

Parameters:

dataset (xr.Dataset) – The dataset to plot.

run(inputs: List[str], **kwargs: Any) xarray.Dataset[source]

Runs the data pipeline on the provided inputs.

Parameters:
  • inputs (List[str]) – A list of input keys that the pipeline’s Retriever class

  • pipeline. (can use to load data into the) –

Returns:

xr.Dataset – The processed dataset.

class tsdat.NearestNeighbor[source]

Bases: tsdat.io.base.DataConverter

Maps data onto the specified coordinate grid using nearest-neighbor.

coord: str = 'time'

The coordinate axis this converter should be applied on. Defaults to ‘time’.

Class Methods

convert

Runs the data converter on the retrieved data.

Method Descriptions

convert(data: xarray.DataArray, variable_name: str, dataset_config: tsdat.config.dataset.DatasetConfig, retrieved_dataset: tsdat.io.base.RetrievedDataset, **kwargs: Any) xarray.DataArray | None[source]

Runs the data converter on the retrieved data.

Parameters:
  • data (xr.DataArray) – The retrieved DataArray to convert.

  • retrieved_dataset (RetrievedDataset) – The retrieved dataset containing data to convert.

  • dataset_config (DatasetConfig) – The output dataset configuration.

  • variable_name (str) – The name of the variable to convert.

Returns:

Optional[xr.DataArray]

The converted DataArray for the specified variable,

or None if the conversion was done in-place.

class tsdat.NetCDFHandler[source]

Bases: tsdat.io.base.FileHandler

DataHandler specifically tailored to reading and writing files of a specific type.

Parameters:
  • extension (str) – The specific file extension used for data files, e.g., “.nc”.

  • reader (DataReader) – The DataReader subclass responsible for reading input data.

  • writer (FileWriter) – The FileWriter subclass responsible for writing output

  • data.

extension: str = 'nc'
reader: tsdat.io.readers.NetCDFReader
writer: tsdat.io.writers.NetCDFWriter
class tsdat.NetCDFReader[source]

Bases: tsdat.io.base.DataReader

Thin wrapper around xarray’s open_dataset() function, with optional parameters used as keyword arguments in the function call.

parameters: Dict[str, Any]

Class Methods

read

Reads data given an input key.

Method Descriptions

read(input_key: str) xarray.Dataset[source]

Reads data given an input key.

Uses the input key to open a resource and load data as a xr.Dataset object or as a mapping of strings to xr.Dataset objects.

In most cases DataReaders will only need to return a single xr.Dataset, but occasionally some types of inputs necessitate that the data loaded from the input_key be returned as a mapping. For example, if the input_key is a path to a zip file containing multiple disparate datasets, then returning a mapping is appropriate.

Parameters:

input_key (str) – An input key matching the DataReader’s regex pattern that should be used to load data.

Returns:

Union[xr.Dataset, Dict[str, xr.Dataset]]

The raw data extracted from the

provided input key.

class tsdat.NetCDFWriter[source]

Bases: tsdat.io.base.FileWriter

Thin wrapper around xarray’s Dataset.to_netcdf() function for saving a dataset to a netCDF file. Properties under the to_netcdf_kwargs parameter will be passed to Dataset.to_netcdf() as keyword arguments.

File compression is used by default to save disk space. To disable compression set the compression_level parameter to 0.

class Parameters

Bases: pydantic.BaseModel

compression_engine: str = 'zlib'

The compression engine to use.

compression_level: int = 1

The level of compression to use (0-9). Set to 0 to not use compression.

to_netcdf_kwargs: Dict[str, Any]

Keyword arguments passed directly to xr.Dataset.to_netcdf().

file_extension: str = 'nc'
parameters: NetCDFWriter.Parameters

Class Methods

write

Writes the dataset to the provided filepath.

Method Descriptions

write(dataset: xarray.Dataset, filepath: pathlib.Path | None = None, **kwargs: Any) None[source]

Writes the dataset to the provided filepath.

This method is typically called by the tsdat storage API, which will be responsible for providing the filepath, including the file extension.

Parameters:
  • dataset (xr.Dataset) – The dataset to save.

  • filepath (Optional[Path]) – The path to the file to save.

class tsdat.Overrideable[source]

Bases: YamlModel, pydantic.generics.GenericModel, Generic[Config]

Abstract base class for generic types.

A generic type is typically declared by inheriting from this class parameterized with one or more type variables. For example, a generic mapping type might be defined as:

class Mapping(Generic[KT, VT]):
    def __getitem__(self, key: KT) -> VT:
        ...
    # Etc.

This class can then be used as follows:

def lookup_name(mapping: Mapping[KT, VT], key: KT, default: VT) -> VT:
    try:
        return mapping[key]
    except KeyError:
        return default
overrides: Dict[str, Any]
path: pydantic.FilePath
class tsdat.ParameterizedClass[source]

Bases: pydantic.BaseModel

Base class for any class that accepts ‘parameters’ as an argument.

Sets the default ‘parameters’ to {}. Subclasses of ParameterizedClass should override the ‘parameters’ properties to support custom required or optional arguments from configuration files.

parameters: Any
class tsdat.ParameterizedConfigClass[source]

Bases: pydantic.BaseModel

classname: pydantic.StrictStr
parameters: Dict[str, Any]

Class Methods

classname_looks_like_a_module

instantiate

Instantiates and returns the class specified by the 'classname' parameter.

Method Descriptions

classmethod classname_looks_like_a_module(v: pydantic.StrictStr) pydantic.StrictStr[source]
instantiate() Any[source]

Instantiates and returns the class specified by the ‘classname’ parameter.

Returns:

Any – An instance of the specified class.

class tsdat.ParquetHandler[source]

Bases: tsdat.io.base.FileHandler

DataHandler specifically tailored to reading and writing files of a specific type.

Parameters:
  • extension (str) – The specific file extension used for data files, e.g., “.nc”.

  • reader (DataReader) – The DataReader subclass responsible for reading input data.

  • writer (FileWriter) – The FileWriter subclass responsible for writing output

  • data.

extension: str = 'parquet'
reader: tsdat.io.readers.ParquetReader
writer: tsdat.io.writers.ParquetWriter
class tsdat.ParquetReader[source]

Bases: tsdat.io.base.DataReader

Uses pandas and xarray functions to read a parquet file and extract its contents into an xarray Dataset object. Two parameters are supported: read_parquet_kwargs and from_dataframe_kwargs, whose contents are passed as keyword arguments to pandas.read_parquet() and xarray.Dataset.from_dataframe() respectively.

class Parameters

Bases: pydantic.BaseModel

from_dataframe_kwargs: Dict[str, Any]
read_parquet_kwargs: Dict[str, Any]
parameters: ParquetReader.Parameters

Class Methods

read

Reads data given an input key.

Method Descriptions

read(input_key: str) xarray.Dataset[source]

Reads data given an input key.

Uses the input key to open a resource and load data as a xr.Dataset object or as a mapping of strings to xr.Dataset objects.

In most cases DataReaders will only need to return a single xr.Dataset, but occasionally some types of inputs necessitate that the data loaded from the input_key be returned as a mapping. For example, if the input_key is a path to a zip file containing multiple disparate datasets, then returning a mapping is appropriate.

Parameters:

input_key (str) – An input key matching the DataReader’s regex pattern that should be used to load data.

Returns:

Union[xr.Dataset, Dict[str, xr.Dataset]]

The raw data extracted from the

provided input key.

class tsdat.ParquetWriter[source]

Bases: tsdat.io.base.FileWriter

Writes the dataset to a parquet file.

Converts a xr.Dataset object to a pandas DataFrame and saves the result to a parquet file using pd.DataFrame.to_parquet(). Properties under the to_parquet_kwargs parameter are passed to pd.DataFrame.to_parquet() as keyword arguments.

class Parameters

Bases: pydantic.BaseModel

dim_order: List[str] | None
to_parquet_kwargs: Dict[str, Any]
file_extension: str = 'parquet'
parameters: ParquetWriter.Parameters

Class Methods

write

Writes the dataset to the provided filepath.

Method Descriptions

write(dataset: xarray.Dataset, filepath: pathlib.Path | None = None, **kwargs: Any) None[source]

Writes the dataset to the provided filepath.

This method is typically called by the tsdat storage API, which will be responsible for providing the filepath, including the file extension.

Parameters:
  • dataset (xr.Dataset) – The dataset to save.

  • filepath (Optional[Path]) – The path to the file to save.

class tsdat.Pipeline[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Base class for tsdat data pipelines.

dataset_config: tsdat.config.dataset.DatasetConfig

Describes the structure and metadata of the output dataset.

quality: tsdat.qc.base.QualityManagement

Manages the dataset quality through checks and corrections.

retriever: tsdat.io.base.Retriever

Retrieves data from input keys.

settings: Any
storage: tsdat.io.base.Storage

Stores the dataset so it can be retrieved later.

triggers: List[Pattern] = []

Regex patterns matching input keys to determine when the pipeline should run.

Class Methods

prepare_retrieved_dataset

Modifies the retrieved dataset by dropping variables not declared in the

run

Runs the data pipeline on the provided inputs.

Method Descriptions

prepare_retrieved_dataset(dataset: xarray.Dataset) xarray.Dataset[source]

Modifies the retrieved dataset by dropping variables not declared in the DatasetConfig, adding static variables, initializing non-retrieved variables, and importing global and variable-level attributes from the DatasetConfig.

Parameters:

dataset (xr.Dataset) – The retrieved dataset.

Returns:

xr.Dataset – The dataset with structure and metadata matching the DatasetConfig.

abstract run(inputs: List[str], **kwargs: Any) Any[source]

Runs the data pipeline on the provided inputs.

Parameters:
  • inputs (List[str]) – A list of input keys that the pipeline’s Retriever class

  • pipeline. (can use to load data into the) –

Returns:

xr.Dataset – The processed dataset.

class tsdat.PipelineConfig[source]

Bases: tsdat.config.utils.ParameterizedConfigClass, tsdat.config.utils.YamlModel

Contains configuration parameters for tsdat pipelines.

This class is ultimately converted into a tsdat.pipeline.base.Pipeline subclass that will be used to process data.

Provides methods to support yaml parsing and validation, including the generation of json schema for immediate validation. This class also provides a method to instantiate a tsdat.pipeline.base.Pipeline subclass from a parsed configuration file.

Parameters:
  • classname (str) – The dotted module path to the pipeline that the specified configurations should apply to. To use the built-in IngestPipeline, for example, you would set ‘tsdat.pipeline.pipelines.IngestPipeline’ as the classname.

  • triggers (List[Pattern[str]]) – A list of regex patterns that should trigger this pipeline when matched with an input key.

  • retriever (Union[Overrideable[RetrieverConfig], RetrieverConfig]) – Either the path to the retriever configuration yaml file and any overrides that should be applied, or the retriever configurations themselves.

  • dataset (Union[Overrideable[DatasetConfig], DatasetConfig]) – Either the path to the dataset configuration yaml file and any overrides that should be applied, or the dataset configurations themselves.

  • quality (Union[Overrideable[QualityConfig], QualityConfig]) – Either the path to the quality configuration yaml file and any overrides that should be applied, or the quality configurations themselves.

  • storage (Union[Overrideable[StorageConfig], StorageConfig]) – Either the path to the storage configuration yaml file and any overrides that should be applied, or the storage configurations themselves.

dataset: tsdat.config.utils.Overrideable[tsdat.config.dataset.DatasetConfig] | tsdat.config.dataset.DatasetConfig
quality: tsdat.config.utils.Overrideable[tsdat.config.quality.QualityConfig] | tsdat.config.quality.QualityConfig
retriever: tsdat.config.utils.Overrideable[tsdat.config.retriever.RetrieverConfig] | tsdat.config.retriever.RetrieverConfig
storage: tsdat.config.utils.Overrideable[tsdat.config.storage.StorageConfig] | tsdat.config.storage.StorageConfig
triggers: List[Pattern]

Class Methods

instantiate_pipeline

Loads the tsdat.pipeline.BasePipeline subclass specified by the classname property.

merge_overrideable_yaml

Method Descriptions

instantiate_pipeline() tsdat.pipeline.base.Pipeline[source]

Loads the tsdat.pipeline.BasePipeline subclass specified by the classname property.

Properties and sub-properties of the PipelineConfig class that are subclasses of tsdat.config.utils.ParameterizedConfigClass (e.g, classes that define a ‘classname’ and optional ‘parameters’ properties) will also be instantiated in similar fashion. See tsdat.config.utils.recursive_instantiate for implementation details.

Returns:

Pipeline – An instance of a tsdat.pipeline.base.Pipeline subclass.

classmethod merge_overrideable_yaml(v: Dict[str, Any], values: Dict[str, Any], field: pydantic.fields.ModelField)[source]
class tsdat.QualityChecker[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Base class for code that checks the dataset / data variable quality.

Class Methods

run

Identifies and flags quality problems with the data.

Method Descriptions

abstract run(dataset: xarray.Dataset, variable_name: str) numpy.typing.NDArray[numpy.bool_] | None[source]

Identifies and flags quality problems with the data.

Checks the quality of a specific variable in the dataset and returns the results of the check as a boolean array where True values represent quality problems and False values represent data that passes the quality check.

QualityCheckers should not modify dataset variables; changes to the dataset should be made by QualityHandler(s), which receive the results of a QualityChecker as input.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to check.

  • variable_name (str) – The name of the variable to check.

Returns:

NDArray[np.bool_] – The results of the quality check, where True values indicate a quality problem.

class tsdat.QualityConfig[source]

Bases: tsdat.config.utils.YamlModel

Contains quality configuration parameters for tsdat pipelines.

This class will ultimately be converted into a tsdat.qc.base.QualityManagement class for use in downstream tsdat pipeline code.

Provides methods to support yaml parsing and validation, including the generation of json schema for immediate validation.

Parameters:

managers (List[ManagerConfig]) – A list of quality checks and controls that should be applied.

managers: List[ManagerConfig]

Class Methods

validate_manager_names_are_unique

Method Descriptions

classmethod validate_manager_names_are_unique(v: List[ManagerConfig]) List[ManagerConfig][source]
class tsdat.QualityHandler[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Base class for code that handles the dataset / data variable quality.

Class Methods

run

Takes some action on data that has had quality issues identified.

Method Descriptions

abstract run(dataset: xarray.Dataset, variable_name: str, failures: numpy.typing.NDArray[numpy.bool_]) xarray.Dataset[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool_]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns:

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.QualityManagement[source]

Bases: pydantic.BaseModel

Main class for orchestrating the dispatch of QualityCheckers and QualityHandlers.

Parameters:

managers (List[QualityManager]) – The list of QualityManagers that should be run.

managers: List[QualityManager]

Class Methods

manage

Runs the registered QualityManagers on the dataset.

Method Descriptions

manage(dataset: xarray.Dataset) xarray.Dataset[source]

Runs the registered QualityManagers on the dataset.

Parameters:

dataset (xr.Dataset) – The dataset to apply quality checks and controls to.

Returns:

xr.Dataset – The quality-checked dataset.

class tsdat.QualityManager[source]

Bases: pydantic.BaseModel

Groups a QualityChecker and one or more QualityHandlers together.

Parameters:
  • name (str) – The name of the quality manager.

  • checker (QualityChecker) – The quality check that should be run.

  • handlers (QualityHandler) – One or more QualityHandlers that should be run given the results of the checker.

  • apply_to (List[str]) – A list of variables that the check should run for. Accepts keywords of ‘COORDS’ or ‘DATA_VARS’, or any number of specific variables that should be run.

  • exclude (List[str]) – A list of variables that the check should exclude. Accepts the same keywords as apply_to.

apply_to: List[str]
checker: QualityChecker
exclude: List[str] = []
handlers: List[QualityHandler]
name: str

Class Methods

run

Runs the quality manager on the dataset.

Method Descriptions

run(dataset: xarray.Dataset) xarray.Dataset[source]

Runs the quality manager on the dataset.

Parameters:

dataset (xr.Dataset) – The dataset to apply quality checks / controls to.

Returns:

xr.Dataset – The dataset after the quality check and controls have been applied.

class tsdat.RecordQualityResults[source]

Bases: tsdat.qc.base.QualityHandler

Records the results of the quality check in an ancillary qc variable. Creates the ancillary qc variable if one does not already exist.

class Parameters

Bases: pydantic.BaseModel

assessment: Literal[bad, indeterminate]

Indicates the quality of the data if the test results indicate a failure.

bit: int | None

DEPRECATED

The bit number (e.g., 1, 2, 3, …) used to indicate if the check passed.

The quality results are bitpacked into an integer array to preserve space. For example, if ‘check #0’ uses bit 0 and fails, and ‘check #1’ uses bit 1 and fails then the resulting value on the qc variable would be 2^(0) + 2^(1) = 3. If we had a third check it would be 2^(0) + 2^(1) + 2^(2) = 7.

meaning: str

A string that describes the test applied.

Class Methods

deprecate_bit_parameter

to_lower

Method Descriptions

deprecate_bit_parameter(values: Dict[str, Any]) Dict[str, Any]
to_lower(assessment: Any) str
parameters: RecordQualityResults.Parameters

Class Methods

get_next_bit_number

run

Takes some action on data that has had quality issues identified.

Method Descriptions

get_next_bit_number(dataset: xarray.Dataset, variable_name: str) int[source]
run(dataset: xarray.Dataset, variable_name: str, failures: numpy.typing.NDArray[numpy.bool_]) xarray.Dataset[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool_]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns:

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.RemoveFailedValues[source]

Bases: tsdat.qc.base.QualityHandler

Replaces all failed values with the variable’s _FillValue. If the variable does not have a _FillValue attribute then nan is used instead

Class Methods

run

Takes some action on data that has had quality issues identified.

Method Descriptions

run(dataset: xarray.Dataset, variable_name: str, failures: numpy.typing.NDArray[numpy.bool_]) xarray.Dataset[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool_]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns:

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.RetrievalRuleSelections[source]

Bases: NamedTuple

Maps variable names to the rules and conversions that should be applied.

coords: Dict[VarName, RetrievedVariable]
data_vars: Dict[VarName, RetrievedVariable]
class tsdat.RetrievedDataset[source]

Bases: NamedTuple

Maps variable names to the input DataArray the data are retrieved from.

coords: Dict[VarName, xarray.DataArray]
data_vars: Dict[VarName, xarray.DataArray]

Class Methods

from_xr_dataset

Method Descriptions

classmethod from_xr_dataset(dataset: xarray.Dataset)[source]
class tsdat.Retriever[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Base class for retrieving data used as input to tsdat pipelines.

Parameters:

readers (Dict[str, DataReader]) – The mapping of readers that should be used to retrieve data given input_keys and optional keyword arguments provided by subclasses of Retriever.

coords: Dict[str, Dict[Pattern, RetrievedVariable]]

A dictionary mapping output coordinate names to the retrieval rules and preprocessing actions (e.g., DataConverters) that should be applied to each retrieved coordinate variable.

data_vars: Dict[str, Dict[Pattern, RetrievedVariable]]

A dictionary mapping output data variable names to the retrieval rules and preprocessing actions (e.g., DataConverters) that should be applied to each retrieved data variable.

readers: Dict[Pattern, Any] | None

Mapping of readers that should be used to read data given input keys.

Class Methods

retrieve

Prepares the raw dataset mapping for use in downstream pipeline processes.

Method Descriptions

abstract retrieve(input_keys: List[str], dataset_config: tsdat.config.dataset.DatasetConfig, **kwargs: Any) xarray.Dataset[source]

Prepares the raw dataset mapping for use in downstream pipeline processes.

This is done by consolidating the data into a single xr.Dataset object. The retrieved dataset may contain additional coords and data_vars that are not defined in the output dataset. Input data converters are applied as part of the preparation process.

Parameters:
  • input_keys (List[str]) – The input keys the registered DataReaders should read from.

  • dataset_config (DatasetConfig) – The specification of the output dataset.

Returns:

xr.Dataset – The retrieved dataset.

class tsdat.RetrieverConfig[source]

Bases: tsdat.config.utils.ParameterizedConfigClass, tsdat.config.utils.YamlModel

Contains configuration parameters for the tsdat retriever class.

This class will ultimately be converted into a tsdat.io.base.Retriever subclass for use in tsdat pipelines.

Provides methods to support yaml parsing and validation, including the generation of json schema for immediate validation. This class also provides a method to instantiate a tsdat.io.base.Retriever subclass from a parsed configuration file.

Parameters:
  • classname (str) – The dotted module path to the pipeline that the specified configurations should apply to. To use the built-in IngestPipeline, for example, you would set ‘tsdat.pipeline.pipelines.IngestPipeline’ as the classname.

  • readers (Dict[str, DataReaderConfig]) – The DataReaders to use for reading input data.

coords: Dict[str, Dict[Pattern, RetrievedVariableConfig] | RetrievedVariableConfig]
data_vars: Dict[str, Dict[Pattern, RetrievedVariableConfig] | RetrievedVariableConfig]
readers: Dict[Pattern, DataReaderConfig] | None

Class Methods

coerce_to_patterned_retriever

Method Descriptions

classmethod coerce_to_patterned_retriever(var_dict: Dict[str, Dict[Pattern, RetrievedVariableConfig] | RetrievedVariableConfig]) Dict[str, Dict[Pattern[str], RetrievedVariableConfig]][source]
class tsdat.SortDatasetByCoordinate[source]

Bases: tsdat.qc.base.QualityHandler

Sorts the dataset by the failed variable, if there are any failures.

class Parameters

Bases: pydantic.BaseModel

ascending: bool = True

Whether to sort the dataset in ascending order. Defaults to True.

correction: str = 'Coordinate data was sorted in order to ensure monotonicity.'
parameters: SortDatasetByCoordinate.Parameters

Class Methods

run

Takes some action on data that has had quality issues identified.

Method Descriptions

run(dataset: xarray.Dataset, variable_name: str, failures: numpy.typing.NDArray[numpy.bool_]) xarray.Dataset[source]

Takes some action on data that has had quality issues identified.

Handles the quality of a variable in the dataset and returns the dataset after any corrections have been applied.

Parameters:
  • dataset (xr.Dataset) – The dataset containing the variable to handle.

  • variable_name (str) – The name of the variable whose quality should be handled.

  • failures (NDArray[np.bool_]) – The results of the QualityChecker for the provided variable, where True values indicate a quality problem.

Returns:

xr.Dataset – The dataset after the QualityHandler has been run.

class tsdat.SplitNetCDFHandler[source]

Bases: tsdat.io.base.FileHandler

DataHandler specifically tailored to reading and writing files of a specific type.

Parameters:
  • extension (str) – The specific file extension used for data files, e.g., “.nc”.

  • reader (DataReader) – The DataReader subclass responsible for reading input data.

  • writer (FileWriter) – The FileWriter subclass responsible for writing output

  • data.

extension: str = 'nc'
reader: tsdat.io.readers.NetCDFReader
writer: tsdat.io.writers.SplitNetCDFWriter
class tsdat.SplitNetCDFWriter[source]

Bases: NetCDFWriter

Wrapper around xarray’s Dataset.to_netcdf() function for saving a dataset to a netCDF file based on a particular time interval, and is an extension of the NetCDFWriter. Files are split (sliced) via a time interval specified in two parts, time_interval a literal value, and a time_unit character (year: “Y”, month: “M”, day: “D”, hour: “h”, minute: “m”, second: “s”).

Properties under the to_netcdf_kwargs parameter will be passed to Dataset.to_netcdf() as keyword arguments. File compression is used by default to save disk space. To disable compression set the compression_level parameter to 0.

class Parameters

Bases: NetCDFWriter

time_interval: int = 1

Time interval value.

time_unit: str = 'D'

Time interval unit.

file_extension: str = 'nc'
parameters: SplitNetCDFWriter.Parameters

Class Methods

write

Writes the dataset to the provided filepath.

Method Descriptions

write(dataset: xarray.Dataset, filepath: pathlib.Path | None = None, **kwargs: Any) None[source]

Writes the dataset to the provided filepath.

This method is typically called by the tsdat storage API, which will be responsible for providing the filepath, including the file extension.

Parameters:
  • dataset (xr.Dataset) – The dataset to save.

  • filepath (Optional[Path]) – The path to the file to save.

class tsdat.Storage[source]

Bases: tsdat.utils.ParameterizedClass, abc.ABC

Abstract base class for the tsdat Storage API. Subclasses of Storage are used in pipelines to persist data and ancillary files (e.g., plots).

Parameters:
  • parameters (Any) – Configuration parameters for the Storage API. The specific parameters that are allowed will be defined by subclasses of this base class.

  • handler (DataHandler) – The DataHandler responsible for handling both read and write operations needed by the storage API.

class Parameters

Bases: pydantic.BaseSettings

ancillary_filename_template: str = '{datastream}.{date_time}.{title}.{extension}'

Template string to use for ancillary filenames.

Allows substitution of the following parameters using curly braces ‘{}’:

  • title: a provided label for the ancillary file or plot.

  • extension: the file extension (e.g., ‘png’, ‘gif’).

  • datastream from the related xr.Dataset object’s global attributes.

  • location_id from the related xr.Dataset object’s global attributes.

  • data_level from the related xr.Dataset object’s global attributes.

  • year, month, day, hour, minute, second of the first timestamp in the data.

  • date_time: the first timestamp in the file formatted as “YYYYMMDD.hhmmss”.

  • The names of any other global attributes of the related xr.Dataset object.

At a minimum the template must include {date_time}.

ancillary_storage_path: str = 'ancillary/{location_id}/{datastream}'

The directory structure under storage_root where ancillary files are saved.

Allows substitution of the following parameters using curly braces ‘{}’:

  • extension: the file extension (e.g., ‘png’, ‘gif’).

  • datastream from the related xr.Dataset object’s global attributes.

  • location_id from the related xr.Dataset object’s global attributes.

  • data_level from the related xr.Dataset object’s global attributes.

  • year, month, day, hour, minute, second of the first timestamp in the data.

  • date_time: the first timestamp in the file formatted as “YYYYMMDD.hhmmss”.

  • The names of any other global attributes of the related xr.Dataset object.

Defaults to ancillary/{location_id}/{datastream}.

storage_root: pathlib.Path

The path on disk where at least ancillary files will be saved to. For file-based storage classes this is also the root path for data files. Defaults to the storage/root folder in the active working directory.

NOTE: This parameter can also be set via the TSDAT_STORAGE_ROOT environment variable.

handler: DataHandler

Defines methods for reading and writing datasets from the storage area.

parameters: Storage.Parameters

Parameters used by the storage API that can be set through configuration files, environment variables, or directly.

Class Methods

fetch_data

Fetches a dataset from the storage area.

get_ancillary_filepath

Returns the filepath for the given datastream and title of an ancillary file

last_modified

Find the last modified time for any data in that datastream.

modified_since

Find the list of data dates that have been modified since the passed

save_ancillary_file

Saves an ancillary filepath to the datastream's ancillary storage area.

save_data

Saves the dataset to the storage area.

uploadable_dir

Context manager that can be used to upload many ancillary files at once.

Method Descriptions

abstract fetch_data(start: datetime.datetime, end: datetime.datetime, datastream: str, metadata_kwargs: Dict[str, str] | None = None, **kwargs: Any) xarray.Dataset[source]

Fetches a dataset from the storage area.

The timespan of the returned dataset is between the specified start and end times.

Parameters:
  • start (datetime) – The start time bound.

  • end (datetime) – The end time bound.

  • datastream (str) – The name of the datastream to fetch.

  • metadata_kwargs (dict[str, str], optional) – Metadata substitutions to help resolve the data storage path. This is only required if the template data storage path includes any properties other than datastream or fields contained in the datastream. Defaults to None.

Returns:

xr.Dataset – The fetched dataset.

get_ancillary_filepath(title: str, extension: str = 'png', dataset: xarray.Dataset | None = None, datastream: str | None = None, start: datetime.datetime | None = None, root_dir: pathlib.Path | None = None, mkdirs: bool = True, **kwargs: str) pathlib.Path[source]

Returns the filepath for the given datastream and title of an ancillary file to be created.

This method is typically used in the plotting hook of pipelines to get the path to where the plot file should be saved. In this case, it is recommend to use this in conjunction with with self.storage.uploadable_dir() as tmp_dir and use root_dir=tmp_dir as an argument to this function.

Example:

# in ``hook_plot_dataset(self, dataset: xr.Dataset)``
with self.storage.uploadable_dir() as tmp_dir:
    fig, ax = plt.subplots()

    # plotting code ...

    plot_file = self.storage.get_ancillary_filepath(
        title="wind_speed",
        extension="png",
        root_dir=tmp_dir,
        dataset=dataset,
    )
    fig.savefig(plot_file)
    plt.close(fig)
Parameters:
  • title (str) – The title of the ancillary file or plot. Should be lowercase and use _ instead of spaces.

  • extension (str) – The file extension to be used. Defaults to “png”.

  • dataset (xr.Dataset | None, optional) – The dataset relating to the ancillary file. If provided, this is used to populate defaults for the datastream, start datetime, and other substitutions used to fill out the storage path template. Values from these other fields, if present, will take precedence.

  • datastream (str | None, optional) – The datastream relating to the ancillary file to be saved. Defaults to dataset.attrs["datastream"].

  • start (datetime | None, optional) – The datetime relating to the ancillary file to be saved. Defaults to dataset.time[0].

  • root_dir (Path | None, optional) – The root directory. If using a temporary (uploadable) directory, it is recommended to use that as the root_dir. Defaults to None.

  • mkdirs (bool, optional) – True if directories should be created, False otherwise. Defaults to True.

  • **kwargs (str) – Extra kwargs to use as substitutions for the ancillary storage path or filename templates, which may require more parameters than those already specified as arguments here. Defaults to **dataset.attrs.

Returns:

Path – The path to the ancillary file.

last_modified(datastream: str) datetime.datetime | None[source]

Find the last modified time for any data in that datastream.

Parameters:

datastream (str) – The datastream.

Returns:

datetime – The datetime of the last modification.

modified_since(datastream: str, last_modified: datetime.datetime) List[datetime.datetime][source]

Find the list of data dates that have been modified since the passed last modified date.

Parameters:
  • datastream (str) – _description_

  • last_modified (datetime) – Should be equivalent to run date (the last time data were changed)

Returns:

List[datetime]

The data dates of files that were changed since the last

modified date

abstract save_ancillary_file(filepath: pathlib.Path, target_path: pathlib.Path | None = None)[source]

Saves an ancillary filepath to the datastream’s ancillary storage area.

NOTE: In most cases this function should not be used directly. Instead, prefer using the self.uploadable_dir(*args, **kwargs) method.

Parameters:
  • filepath (Path) – The path to the ancillary file. This is expected to have a standardized filename and should be saved under the ancillary storage path.

  • target_path (str) – The path to where the data should be saved.

abstract save_data(dataset: xarray.Dataset, **kwargs: Any)[source]

Saves the dataset to the storage area.

Parameters:

dataset (xr.Dataset) – The dataset to save.

uploadable_dir(**kwargs: Any) Generator[pathlib.Path, None, None][source]

Context manager that can be used to upload many ancillary files at once.

This method yields the path to a temporary directory whose contents will be saved to the storage area using the save_ancillary_file method upon exiting the context manager.

Example:

# in ``hook_plot_dataset(self, dataset: xr.Dataset)``
with self.storage.uploadable_dir() as tmp_dir:
    fig, ax = plt.subplots()

    # plotting code ...

    plot_file = self.storage.get_ancillary_filepath(
        title="wind_speed",
        extension="png",
        root_dir=tmp_dir,
        dataset=dataset,
    )
    fig.savefig(plot_file)
    plt.close(fig)
Parameters:

kwargs (Any) – Unused. Included for backwards compatibility.

Yields:

Path – A temporary directory where files can be saved.

class tsdat.StorageConfig[source]

Bases: tsdat.config.utils.ParameterizedConfigClass, tsdat.config.utils.YamlModel

Contains configuration parameters for the data storage API used in tsdat pipelines.

This class will ultimately be converted into a tsdat.io.base.Storage subclass for use in tsdat pipelines.

Provides methods to support yaml parsing and validation, including the generation of json schema for immediate validation. This class also provides a method to instantiate a tsdat.io.base.Storage subclass from a parsed configuration file.

Parameters:
  • classname (str) – The dotted module path to the storage class that the specified configurations should apply to. To use the built-in FileSystem storage class, for example, you would set ‘tsdat.io.storage.FileSystem’ as the classname.

  • handler (DataHandlerConfig) – Config class that should be used for data I/O within the storage area.

handler: DataHandlerConfig
class tsdat.StorageRetriever[source]

Bases: tsdat.io.base.Retriever

Retriever API for pulling input data from the storage area.

class TransParameters

Bases: pydantic.BaseModel

trans_params: GlobalARMTransformParams | None
parameters: StorageRetriever.TransParameters | None

Class Methods

retrieve

Retrieves input data from the storage area.

Method Descriptions

retrieve(input_keys: List[str], dataset_config: tsdat.config.dataset.DatasetConfig, storage: tsdat.io.base.Storage | None = None, input_data_hook: Callable[[Dict[str, xarray.Dataset]], Dict[str, xarray.Dataset]] | None = None, **kwargs: Any) xarray.Dataset[source]

Retrieves input data from the storage area.

Note that each input_key is expected to be formatted according to the following format:

`python "--key1 value1 --key2 value2", `

e.g.,

`python "--datastream sgp.met.b0 --start 20230801 --end 20230901" "--datastream sgp.met.b0 --start 20230801 --end 20230901 --location_id sgp --data_level b0" `

This format allows the retriever to pull datastream data from the Storage API for the desired dates for each desired input source.

Parameters:
  • input_keys (List[str]) – A list of input keys formatted as described above.

  • dataset_config (DatasetConfig) – The output dataset configuration.

  • storage (Storage) – Instance of a Storage class used to fetch saved data.

Returns:

xr.Dataset – The retrieved dataset

class tsdat.StorageRetrieverInput(input_key: str)[source]

Returns an object representation of an input storage key.

Input storage keys should be formatted like:

`python "--datastream sgp.met.b0 --start 20230801 --end 20230901" "--datastream sgp.met.b0 --start 20230801 --end 20230901 --location_id sgp --data_level b0" `

Class Methods

__repr__

Return repr(self).

Method Descriptions

__repr__() str[source]

Return repr(self).

class tsdat.StringToDatetime[source]

Bases: tsdat.io.base.DataConverter

Converts date strings into datetime64 data.

Allows parameters to specify the string format of the input data, as well as the timezone the input data are reported in. If the input timezone is not UTC, the data are converted to UTC time.

Parameters:
  • format (Optional[str]) – The format of the string data. See strftime.org for more information on what components can be used. If None (the default), then pandas will try to interpret the format and convert it automatically. This can be unsafe but is not explicitly prohibited, so a warning is issued if format is not set explicitly.

  • timezone (Optional[str]) – The timezone of the input data. If not specified it is assumed to be UTC.

  • to_datetime_kwargs (Dict[str, Any]) – A set of keyword arguments passed to the pandas.to_datetime() function as keyword arguments. Note that ‘format’ is already included as a keyword argument. Defaults to {}.

format: str | None

%S’ for date strings such as ‘2022-04-13 23:59:00’), or None (the default) to have pandas guess the format automatically.

Type:

The date format the string is using (e.g., ‘%Y-%m-%d %H

Type:

%M

timezone: str | None

The timezone of the data to convert. If provided, this converter will apply the appropriate offset to convert data from the specified timezone to UTC. The timezone of the output data is assumed to always be UTC.

to_datetime_kwargs: Dict[str, Any]

Any parameters set here will be passed to pd.to_datetime as keyword arguments.

Class Methods

convert

Runs the data converter on the retrieved data.

warn_if_no_format_set

Method Descriptions

convert(data: xarray.DataArray, variable_name: str, dataset_config: tsdat.config.dataset.DatasetConfig, retrieved_dataset: tsdat.io.base.RetrievedDataset, **kwargs: Any) xarray.DataArray | None[source]

Runs the data converter on the retrieved data.

Parameters:
  • data (xr.DataArray) – The retrieved DataArray to convert.

  • retrieved_dataset (RetrievedDataset) – The retrieved dataset containing data to convert.

  • dataset_config (DatasetConfig) – The output dataset configuration.

  • variable_name (str) – The name of the variable to convert.

Returns:

Optional[xr.DataArray]

The converted DataArray for the specified variable,

or None if the conversion was done in-place.

classmethod warn_if_no_format_set(format: str | None) str | None[source]
class tsdat.TransformationPipeline[source]

Bases: IngestPipeline

Pipeline class designed to read in standardized time series data and enhance its quality and usability by combining multiple sources of data, using higher-level processing techniques, etc.

class Parameters

Bases: pydantic.BaseModel

datastreams: List[str]

A list of datastreams that the pipeline should be configured to run for. Datastreams should include the location and data level information.

parameters: TransformationPipeline.Parameters
retriever: tsdat.io.retrievers.StorageRetriever

Class Methods

hook_customize_input_datasets

Code hook to customize any input datasets prior to datastreams being combined

run

Runs the data pipeline on the provided inputs.

Method Descriptions

hook_customize_input_datasets(input_datasets: Dict[str, xarray.Dataset], **kwargs: Any) Dict[str, xarray.Dataset][source]

Code hook to customize any input datasets prior to datastreams being combined and data converters being run.

Parameters:

input_datasets (Dict[str, xr.Dataset]) – The dictionary of input key (str) to input dataset. Note that for transformation pipelines, input keys != input filename, rather each input key is a combination of the datastream and date range used to pull the input data from the storage retriever.

Returns:

Dict[str, xr.Dataset] – The customized input datasets.

run(inputs: List[str], **kwargs: Any) xarray.Dataset[source]

Runs the data pipeline on the provided inputs.

Parameters:

inputs (List[str]) – A 2-element list of start-date, end-date that the pipeline should process.

Returns:

xr.Dataset – The processed dataset.

class tsdat.UnitsConverter[source]

Bases: tsdat.io.base.DataConverter

Converts the units of a retrieved variable to specified output units.

If the ‘input_units’ property is set then that string is used to determine the input input units, otherwise the converter will attempt to look up and use the ‘units’ attribute on the specified variable in the dataset provided to the convert method. If the input units cannot be set then a warning is issued and the original dataset is returned. The output units are specified by the output dataset configuration.

Parameters:

input_units (Optional[str]) – The units that the retrieved data comes in.

input_units: str | None

The units of the input data.

Class Methods

convert

Runs the data converter on the retrieved data.

Method Descriptions

convert(data: xarray.DataArray, variable_name: str, dataset_config: tsdat.config.dataset.DatasetConfig, retrieved_dataset: tsdat.io.base.RetrievedDataset, **kwargs: Any) xarray.DataArray | None[source]

Runs the data converter on the retrieved data.

Parameters:
  • data (xr.DataArray) – The retrieved DataArray to convert.

  • retrieved_dataset (RetrievedDataset) – The retrieved dataset containing data to convert.

  • dataset_config (DatasetConfig) – The output dataset configuration.

  • variable_name (str) – The name of the variable to convert.

Returns:

Optional[xr.DataArray]

The converted DataArray for the specified variable,

or None if the conversion was done in-place.

class tsdat.YamlModel[source]

Bases: pydantic.BaseModel

Class Methods

from_yaml

Creates a python configuration object from a yaml file.

generate_schema

Generates JSON schema from the model fields and type annotations.

Method Descriptions

classmethod from_yaml(filepath: pathlib.Path, overrides: Dict[str, Any] | None = None)[source]

Creates a python configuration object from a yaml file.

Parameters:
  • filepath (Path) – The path to the yaml file

  • overrides (Optional[Dict[str, Any]], optional) – Overrides to apply to the yaml before instantiating the YamlModel object. Defaults to None.

Returns:

YamlModel – A YamlModel subclass

classmethod generate_schema(output_file: pathlib.Path)[source]

Generates JSON schema from the model fields and type annotations.

Parameters:

output_file (Path) – The path to store the JSON schema.

class tsdat.ZarrHandler[source]

Bases: tsdat.io.base.FileHandler

DataHandler specifically tailored to reading and writing files of a specific type.

Parameters:
  • extension (str) – The specific file extension used for data files, e.g., “.nc”.

  • reader (DataReader) – The DataReader subclass responsible for reading input data.

  • writer (FileWriter) – The FileWriter subclass responsible for writing output

  • data.

extension: str = 'zarr'
reader: tsdat.io.readers.ZarrReader
writer: tsdat.io.writers.ZarrWriter
class tsdat.ZarrLocalStorage[source]

Bases: FileSystem

Handles data storage and retrieval for zarr archives on a local filesystem.

Zarr is a special format that writes chunked data to a number of files underneath a given directory. This distribution of data into chunks and distinct files makes zarr an extremely well-suited format for quickly storing and retrieving large quantities of data.

Parameters:
  • parameters (Parameters) – File-system specific parameters, such as the root path to where the Zarr archives should be saved, or additional keyword arguments to specific functions used by the storage API. See the Parameters class for more details.

  • handler (ZarrHandler) – The ZarrHandler class that should be used to handle data I/O within the storage API.

class Parameters

Bases: FileSystem

data_filename_template: str = '{datastream}.{extension}'

Template string to use for data filenames.

Allows substitution of the following parameters using curly braces ‘{}’:

  • ext: the file extension from the storage data handler

  • datastream from the dataset’s global attributes

  • location_id from the dataset’s global attributes

  • data_level from the dataset’s global attributes

  • Any other global attribute that has a string or integer data type.

data_storage_path: pathlib.Path

The directory structure under storage_root where ancillary files are saved.

Allows substitution of the following parameters using curly braces ‘{}’:

  • storage_root: the value from the storage_root parameter.

  • datastream: the datastream as defined in the dataset config file.

  • location_id: the location_id as defined in the dataset config file.

  • data_level: the data_level as defined in the dataset config file.

  • year: the year of the first timestamp in the file.

  • month: the month of the first timestamp in the file.

  • day: the day of the first timestamp in the file.

  • extension: the file extension used by the output file writer.

handler: tsdat.io.handlers.ZarrHandler
parameters: ZarrLocalStorage.Parameters
class tsdat.ZarrReader[source]

Bases: tsdat.io.base.DataReader

Uses xarray’s Zarr capabilities to read a Zarr archive and extract its contents into an xarray Dataset object.

class Parameters

Bases: pydantic.BaseModel

open_zarr_kwargs: Dict[str, Any]
parameters: ZarrReader.Parameters

Class Methods

read

Reads data given an input key.

Method Descriptions

read(input_key: str) xarray.Dataset[source]

Reads data given an input key.

Uses the input key to open a resource and load data as a xr.Dataset object or as a mapping of strings to xr.Dataset objects.

In most cases DataReaders will only need to return a single xr.Dataset, but occasionally some types of inputs necessitate that the data loaded from the input_key be returned as a mapping. For example, if the input_key is a path to a zip file containing multiple disparate datasets, then returning a mapping is appropriate.

Parameters:

input_key (str) – An input key matching the DataReader’s regex pattern that should be used to load data.

Returns:

Union[xr.Dataset, Dict[str, xr.Dataset]]

The raw data extracted from the

provided input key.

class tsdat.ZarrWriter[source]

Bases: tsdat.io.base.FileWriter

Writes the dataset to a basic zarr archive.

Advanced features such as specifying the chunk size or writing the zarr archive in AWS S3 will be implemented later.

class Parameters

Bases: pydantic.BaseModel

to_zarr_kwargs: Dict[str, Any]
file_extension: str = 'zarr'
parameters: ZarrWriter.Parameters

Class Methods

write

Writes the dataset to the provided filepath.

Method Descriptions

write(dataset: xarray.Dataset, filepath: pathlib.Path | None = None, **kwargs: Any) None[source]

Writes the dataset to the provided filepath.

This method is typically called by the tsdat storage API, which will be responsible for providing the filepath, including the file extension.

Parameters:
  • dataset (xr.Dataset) – The dataset to save.

  • filepath (Optional[Path]) – The path to the file to save.

class tsdat.ZipReader(parameters: Dict = None)[source]

Bases: tsdat.io.base.ArchiveReader

DataReader for reading from a zipped archive. Writing to this format is not supported.

This class requires a that `readers be specified in the parameters section of the storage configuration file. The structure of the `readers section should mirror the structure of its parent `readers section. To illustrate, consider the following configuration block:

readers:
  .*:
    zip:
      file_pattern: '.*\.zip'
      classname: "tsdat.io.readers.ZipReader"
      parameters:
        # Parameters to specify how the ZipReader should read/unpack the archive.
        # Parameters here are passed to the Python open() method as kwargs. The
        # default value is shown below.
        open_zip_kwargs:
          mode: "rb"

        # Parameters here are passed to zipfile.ZipFile.open() as kwargs. Useful
        # for specifying the system encoding or compression algorithm to use for
        # unpacking the archive. These are optional.
        read_zip_kwargs:
          mode: "r"


        # The readers section tells the ZipReaders which DataReaders should be
        # used to read the unpacked files.
        readers:
          r".*\.csv":
            classname: tsdat.io.readers.CSVReader
            parameters:  # Parameters specific to tsdat.io.readers.CsvReader
                read_csv_kwargs:
                sep: '\t'

        # Pattern(s) used to exclude certain files in the archive from being handled.
        # This parameter is optional, and the default value is shown below:
        exclude: ['.*\_\_MACOSX/.*', '.*\.DS_Store']
class Parameters

Bases: pydantic.BaseModel

exclude: List[str] = []
open_zip_kwargs: Dict[str, Any]
read_zip_kwargs: Dict[str, Any]
readers: Dict[str, Any]
parameters: ZipReader.Parameters

Class Methods

read

Extracts the file into memory and uses registered DataReaders to read each relevant

Method Descriptions

read(input_key: str) Dict[str, xarray.Dataset][source]

Extracts the file into memory and uses registered DataReaders to read each relevant extracted file into its own xarray Dataset object. Returns a mapping like {filename: xr.Dataset}.

Parameters:
  • input_key (Union[str, BytesIO]) – The file to read in. Can be provided as a filepath or

  • file. (a bytes-like object. It is used to open the zip) –

  • name (str, optional) – A label used to help trace the origin of the data read-in.

  • file (It is used in the key in the returned dictionary. Must be provided if the) –

  • then (argument is not string-like. If file is a string and name is not specified) –

  • None. (the label will be set by file. Defaults to) –

Returns:

Dict[str, xr.Dataset] – A mapping of {label: xr.Dataset}.

tsdat.assert_close(a: xarray.Dataset, b: xarray.Dataset, check_attrs: bool = True, check_fill_value: bool = True, **kwargs: Any) None[source]

Thin wrapper around xarray.assert_allclose.

Also checks dataset and variable attrs. Removes global attributes that are allowed to be different, which are currently just the ‘history’ attribute and the ‘code_version’ attribute. Also handles some obscure edge cases for variable attributes.

Parameters:
  • a (xr.Dataset) – The first dataset to compare.

  • b (xr.Dataset) – The second dataset to compare.

  • check_attrs (bool) – Check global and variable attributes in addition to the data. Defaults to True.

  • check_fill_value (bool) – Check the _FillValue attribute. This is a special case because xarray moves the _FillValue from a variable’s attributes to its encoding upon saving the dataset. Defaults to True.

tsdat.assign_data(dataset: xarray.Dataset, data: numpy.typing.NDArray[Any], variable_name: str) xarray.Dataset[source]

Assigns the data to the specified variable in the dataset.

If the variable exists and it is a data variable, then the DataArray for the specified variable in the dataset will simply have its data replaced with the new numpy array. If the variable exists and it is a coordinate variable, then the data will replace the coordinate data. If the variable does not exist in the dataset then a KeyError will be raised.

Parameters:
  • dataset (xr.Dataset) – The dataset where the data should be assigned.

  • data (NDArray[Any]) – The data to assign.

  • variable_name (str) – The name of the variable in the dataset to assign data to.

Raises:

KeyError – Raises a KeyError if the specified variable is not in the dataset’s coords or data_vars dictionary.

Returns:

xr.Dataset – The dataset with data assigned to it.

tsdat.decode_cf(dataset: xarray.Dataset) xarray.Dataset[source]

Wrapper around xarray.decode_cf() which handles additional edge cases.

This helps ensure that the dataset is formatted and encoded correctly after it has been constructed or modified. Handles edge cases for units and data type encodings on datetime variables.

Parameters:

dataset (xr.Dataset) – The dataset to decode.

Returns:

xr.Dataset – The decoded dataset.

tsdat.generate_schema(dir: pathlib.Path = typer.Option(Path('.vscode/schema/'), file_okay=False, dir_okay=True), schema_type: SchemaType = typer.Option(SchemaType.all))[source]
tsdat.get_code_version() str[source]
tsdat.get_datastream(**global_attrs: str) str[source]
tsdat.get_fields_from_datastream(datastream: str) Dict[str, str][source]

Extracts fields from the datastream.

WARNING: this only works for the default datastream template.

tsdat.get_filename(dataset: xarray.Dataset, extension: str, title: str | None = None) str[source]

Returns the standardized filename for the provided dataset.

Returns a key consisting of the dataset’s datastream, starting date/time, the extension, and an optional title. For file-based storage systems this method may be used to generate the basename of the output data file by providing extension as ‘.nc’, ‘.csv’, or some other file ending type. For ancillary plot files this can be used in the same way by specifying extension as ‘.png’, ‘.jpeg’, etc and by specifying the title, resulting in files named like ‘<datastream>.20220424.165314.plot_title.png’.

Parameters:
  • dataset (xr.Dataset) – The dataset (used to extract the datastream and starting / ending times).

  • extension (str) – The file extension that should be used.

  • title (Optional[str]) – An optional title that will be placed between the start time and the extension in the generated filename.

Returns:

str – The filename constructed from provided parameters.

tsdat.get_start_date_and_time_str(dataset: xarray.Dataset) Tuple[str, str][source]

Gets the start date and start time strings from a Dataset.

The strings are formatted using strftime and the following formats:
  • date: “%Y%m%d”

  • time: “”%H%M%S”

Parameters:

dataset (xr.Dataset) – The dataset whose start date and time should be retrieved.

Returns:

Tuple[str, str] – The start date and time as strings like “YYYYmmdd”, “HHMMSS”.

tsdat.get_start_time(dataset: xarray.Dataset) pandas.Timestamp[source]

Gets the earliest ‘time’ value and returns it as a pandas Timestamp.

Parameters:

dataset (xr.Dataset) – The dataset whose start time should be retrieved.

Returns:

pd.Timestamp – The timestamp of the earliest time value in the dataset.

tsdat.get_version() str[source]
tsdat.read_yaml(filepath: pathlib.Path) Dict[Any, Any][source]
tsdat.record_corrections_applied(dataset: xarray.Dataset, variable_name: str, message: str) None[source]

Records the message on the ‘corrections_applied’ attribute.

Parameters:
  • dataset (xr.Dataset) – The corrected dataset.

  • variable_name (str) – The name of the variable in the dataset.

  • message (str) – The message to record.

tsdat.recursive_instantiate(model: Any) Any[source]

Instantiates all ParametrizedClass components and subcomponents of a given model.

Recursively calls model.instantiate() on all ParameterizedConfigClass instances under the the model, resulting in a new model which follows the same general structure as the given model, but possibly containing totally different properties and methods.

Note that this method does a depth-first traversal of the model tree to to instantiate leaf nodes first. Traversing breadth-first would result in new pydantic models attempting to call the __init__ method of child models, which is not valid because the child models are ParameterizedConfigClass instances. Traversing depth-first allows us to first transform child models into the appropriate type using the classname of the ParameterizedConfigClass.

This method is primarily used to instantiate a Pipeline subclass and all of its properties from a yaml pipeline config file, but it can be applied to any other pydantic model.

Parameters:

model (Any) – The object to recursively instantiate.

Returns:

Any – The recursively-instantiated object.

tsdat.DATASTREAM_TEMPLATE[source]
tsdat.FILENAME_TEMPLATE[source]