extra tools#

This repository contains helpful scripts and notes for several tsdat-related tools.

Some tools are available as jupyter notebooks, and others are available as a command-line utility.

To get access to the command-line utilities, just run:

pip install tsdat-tools

To use all the other tools, we recommend cloning this repository.

Data to Yaml#

The goal of this tool is to reduce the tediousness of writing tsdat configuration files for data that you can already read and convert into an xr.Dataset object in tsdat. It generates two output files: dataset.yaml and retriever.yaml, which are used by tsdat to define metadata and how the input variables should be mapped to output variables.

If your file is in one of the following formats, this tool can already do this for you. Formats supported out-of-box:

netCDF: Files ending with .nc or .cdf will use the tsdat.NetCDFReader class
csv: Files ending with .csv will use the tsdat.CSVReader class
parquet: Files ending with .parquet or .pq or .pqt will use the tsdat.ParquetReader class
zarr: Files/folders ending with .zarr will use the tsdat.ZarrReader class

Usage#

Then you can run the tool with:

tsdat-tools data2yaml path/to/data/file --input-config path/to/current/dataset.yaml

Full usage instructions can be obtained using the --help flag:

>>> tsdat-tools data2yaml --help

Usage: tsdat-tools data2yaml [OPTIONS] DATAPATH

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    datapath   PATH  Path to the input data file that should be used to generate tsdat configurations. │
│                       [default: None]                                                                   │
│                       [required]                                                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────╮
│ --outdir                               DIRECTORY                      The path to the directory where   │
│                                                                       the 'dataset.yaml' and            │
│                                                                       'retriever.yaml' files should be  │
│                                                                       written.                          │
│                                                                       [default: .]                      │
│ --input-config                         PATH                           Path to a dataset.yaml file to be │
│                                                                       used in addition to               │
│                                                                       configurations derived from the   │
│                                                                       input data file. Configurations   │
│                                                                       defined here take priority over   │
│                                                                       auto-detected properties in the   │
│                                                                       input file.                       │
│                                                                       [default: None]                   │
│ --help                                                                Show this message and exit.       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

This tool is designed to be run in the following workflow:

Generate new ingest/pipeline from cookiecutter template (e.g., make cookies command)
Put an example data file for your pipeline in the test/data/input folder
Clean up the autogenerated dataset.yaml file.
- Add metadata and remove any unused variables
- Don't add additional variables yet; just make sure that the info in the current file is accurate
Commit your changes in git or back up your changes so you can compare before & after the script runs.
Run this script, passing it the path to your input data file and using the --input-config option to tell it where your cleaned dataset.yaml file is. By default this will generate a new dataset.yaml file in the current working directory (location of pwd on the command line), but you can also use the --outdir option to specify the path where it should write to.
Review the changes the script made to each file. Note that it is not capable of standardizing units or other metadata, so you will still need to clean those up manually.
Continue with the rest of the ingest/pipeline development steps

Excel to Yaml#

Excel to Yaml Instructions#

The goal of this tool is to simplify the process of creating dataset configuration files for tsdat. There are two parts to this tool: an input excel file and a jupyter notebook to run the conversion code. The excel file contains the user-provided dataset metadata in a simplified format, and the notebook takes the information provided in the excel file and reorganizes it into yaml-format, tsdat configuration files. These configuration files can then be added to a pipeline generated by the pipeline-template.

Excel File Description#

The provided excel file template is titled "excel to yaml template.xlsx" and contains 3 sheets labeled "Metadata", "Independent Variables" and "Dependent Variables". Each sheet contains a table listing information for different parts of what will be the finalized dataset.

The "Metadata" sheet includes the global "attributes" (another word for metadata) for the dataset, typically describing the what-when-where-how" of the collected input data.
The other two sheets break down the input data into independent and dependent variables.
Variables listed on the "Independent Variables" sheet will become the "dimensions" or "coordinates" of the dataset, like time, latitude, or longitude.
Variables listed under "Dependent Variables" are the primary measurements contained in the input datafile.

Information in the "Metadata" sheet exists in key-item pairs, where elements in the top row are the keys and the bottom row are the corresponding items. Additional key-item pairs can be appended to additional columns in this sheet.

Information in the other two sheets is organized by variable. Each variable gets assigned a name, units, datatype, dimensions, and additional metadata. These inputs are described in tables below:

Independent Variable Sheet Details#

Key	Definition
New Name	New variable name
Original Name	Name of variable in raw input file
Standardized Unit	New variable unit
Original Unit	Variable unit in raw input file
Timezone	Timezone identifier, specifically for "time" variables, from TZ database
Datatype	Variable datatype (e.g. "float32")
Long Name	"Human readable" version of variable name
Standard Name	Name of variable from CF conventions lookup table

Note: A list of timezones is available on wikipedia, under the column labeled "TZ database name".

Dependent Variable Sheet Details#

Key	Definition
New Name	New variable name
Original Name	Name of variable in raw input file
Standardized Unit	New variable unit
Original Unit	Variable unit in raw input file
Datatype	Variable datatype (e.g. "float32")
Dimensions	Independent variable(s) corresponding to the variable, separated by a comma, if multiple
Long Name	"Human readable" version of variable name
Standard Name	Name of variable from CF conventions lookup table
Description	Optional, additional information for the variable. Good to use if a standard name isn't available
Valid Minimum Value	Lower range limit expected for a variable, for quality control
Valid Maximum Value	Upper range limit expected for a variable, for quality control

NetCDF to CSV#

The goal of the netcdf2csv tool is to convert a NetCDF dataset to CSV format. The tool includes a Python script that takes a NetCDF dataset as input and generates CSV files for 1D and 2D variables.

File Structure#

The CSV files generated by netcdf2csv are organized based on the dimensions of the variables in the NetCDF dataset. Header Data: The global attributes of the NetCDF dataset are saved in a separate file with a ".hdr.csv" extension. Variable Metadata: Metadata for each variable, including attributes and name, is saved in a file with a ".attrs.csv" extension. 1D Variables: Variables with one dimension are saved in a file with a ".time.1d.csv" extension. 2D Variables: Variables with two dimensions are saved in separate files, each named after the coordinating dimension, with a ".{coord}.2d.csv" extension.

Parameters#

Key	Definition
dataset	An xarray Dataset containing the NetCDF data to be converted.
filepath	(Optional) Path to the directory where the CSV files will be saved.
parameters	(Optional) A dictionary containing additional parameters for customizing the conversion process.
Optional Parameters	---
dim_order	Specifies the order of dimensions for multi-dimensional variables.
to_csv_kwargs	Dictionary of arguments passed to pandas.DataFrame.to_csv. (ex: specifying delimiters and line terminators.)

Usage Example#

import xarray as xr
from pathlib import Path
from netcdf2csv import write

filepath = "to_csv path/to/your/data/*.nc"
dataset = xr.open_dataset(filepath)

# Optional parameters
parameters = dict()
parameters['dim_order']: Optional[List[str]] = ['time', 'height']  
parameters['to_csv_kwargs']: Dict[str, Any] = {'sep': '\t', 'lineterminator': '\n'}  

write(dataset, Path(filepath).with_suffix(''), parameters)

Note#

The tool issues a warning for variables with more than 2 dimensions, as CSV format does not support such variables.

See https://github.com/tsdat/tools/tree/main/netcdf2csv for more details.