extra tools#
This repository contains helpful scripts and notes for several tsdat-related tools.
Some tools are available as jupyter notebooks, and others are available as a command-line utility.
To get access to the command-line utilities, just run:
To use all the other tools, we recommend cloning this repository.
Data to Yaml#
The goal of this tool is to reduce the tediousness of writing tsdat configuration files for data that you can already
read and convert into an xr.Dataset
object in tsdat. It generates two output files: dataset.yaml
and
retriever.yaml
, which are used by tsdat
to define metadata and how the input variables should be mapped to output
variables.
If your file is in one of the following formats, this tool can already do this for you. Formats supported out-of-box:
netCDF
: Files ending with.nc
or.cdf
will use thetsdat.NetCDFReader
classcsv
: Files ending with.csv
will use thetsdat.CSVReader
classparquet
: Files ending with.parquet
or.pq
or.pqt
will use thetsdat.ParquetReader
classzarr
: Files/folders ending with.zarr
will use thetsdat.ZarrReader
class
Usage#
Then you can run the tool with:
Full usage instructions can be obtained using the --help
flag:
>>> tsdat-tools data2yaml --help
Usage: tsdat-tools data2yaml [OPTIONS] DATAPATH
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────╮
│ * datapath PATH Path to the input data file that should be used to generate tsdat configurations. │
│ [default: None] │
│ [required] │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────╮
│ --outdir DIRECTORY The path to the directory where │
│ the 'dataset.yaml' and │
│ 'retriever.yaml' files should be │
│ written. │
│ [default: .] │
│ --input-config PATH Path to a dataset.yaml file to be │
│ used in addition to │
│ configurations derived from the │
│ input data file. Configurations │
│ defined here take priority over │
│ auto-detected properties in the │
│ input file. │
│ [default: None] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
This tool is designed to be run in the following workflow:
- Generate new ingest/pipeline from cookiecutter template (e.g.,
make cookies
command) - Put an example data file for your pipeline in the
test/data/input
folder - Clean up the autogenerated
dataset.yaml
file.- Add metadata and remove any unused variables
- Don't add additional variables yet; just make sure that the info in the current file is accurate
- Commit your changes in
git
or back up your changes so you can compare before & after the script runs. - Run this script, passing it the path to your input data file and using the
--input-config
option to tell it where your cleaneddataset.yaml
file is. By default this will generate a newdataset.yaml
file in the current working directory (location ofpwd
on the command line), but you can also use the--outdir
option to specify the path where it should write to. - Review the changes the script made to each file. Note that it is not capable of standardizing units or other metadata, so you will still need to clean those up manually.
- Continue with the rest of the ingest/pipeline development steps
Excel to Yaml#
Excel to Yaml Instructions#
The goal of this tool is to simplify the process of creating dataset configuration files for tsdat. There are two parts to this tool: an input excel file and a jupyter notebook to run the conversion code. The excel file contains the user-provided dataset metadata in a simplified format, and the notebook takes the information provided in the excel file and reorganizes it into yaml-format, tsdat configuration files. These configuration files can then be added to a pipeline generated by the pipeline-template.
Excel File Description#
The provided excel file template is titled "excel to yaml template.xlsx" and contains 3 sheets labeled "Metadata", "Independent Variables" and "Dependent Variables". Each sheet contains a table listing information for different parts of what will be the finalized dataset.
- The "Metadata" sheet includes the global "attributes" (another word for metadata) for the dataset, typically describing the what-when-where-how" of the collected input data.
- The other two sheets break down the input data into independent and dependent variables.
- Variables listed on the "Independent Variables" sheet will become the "dimensions" or "coordinates" of the dataset, like time, latitude, or longitude.
- Variables listed under "Dependent Variables" are the primary measurements contained in the input datafile.
Information in the "Metadata" sheet exists in key-item pairs, where elements in the top row are the keys and the bottom row are the corresponding items. Additional key-item pairs can be appended to additional columns in this sheet.
Information in the other two sheets is organized by variable. Each variable gets assigned a name, units, datatype, dimensions, and additional metadata. These inputs are described in tables below:
Independent Variable Sheet Details#
Key | Definition |
---|---|
New Name | New variable name |
Original Name | Name of variable in raw input file |
Standardized Unit | New variable unit |
Original Unit | Variable unit in raw input file |
Timezone | Timezone identifier, specifically for "time" variables, from TZ database |
Datatype | Variable datatype (e.g. "float32") |
Long Name | "Human readable" version of variable name |
Standard Name | Name of variable from CF conventions lookup table |
Note: A list of timezones is available on wikipedia, under the column labeled "TZ database name".
Dependent Variable Sheet Details#
Key | Definition |
---|---|
New Name | New variable name |
Original Name | Name of variable in raw input file |
Standardized Unit | New variable unit |
Original Unit | Variable unit in raw input file |
Datatype | Variable datatype (e.g. "float32") |
Dimensions | Independent variable(s) corresponding to the variable, separated by a comma, if multiple |
Long Name | "Human readable" version of variable name |
Standard Name | Name of variable from CF conventions lookup table |
Description | Optional, additional information for the variable. Good to use if a standard name isn't available |
Valid Minimum Value | Lower range limit expected for a variable, for quality control |
Valid Maximum Value | Upper range limit expected for a variable, for quality control |
NetCDF to CSV#
The goal of the netcdf2csv
tool is to convert a NetCDF dataset to CSV format. The tool includes a Python script that takes a NetCDF dataset as input and generates CSV files for 1D and 2D variables.
File Structure#
The CSV files generated by netcdf2csv are organized based on the dimensions of the variables in the NetCDF dataset. Header Data: The global attributes of the NetCDF dataset are saved in a separate file with a ".hdr.csv" extension. Variable Metadata: Metadata for each variable, including attributes and name, is saved in a file with a ".attrs.csv" extension. 1D Variables: Variables with one dimension are saved in a file with a ".time.1d.csv" extension. 2D Variables: Variables with two dimensions are saved in separate files, each named after the coordinating dimension, with a ".{coord}.2d.csv" extension.
Parameters#
Key | Definition |
---|---|
dataset | An xarray Dataset containing the NetCDF data to be converted. |
filepath | (Optional) Path to the directory where the CSV files will be saved. |
parameters | (Optional) A dictionary containing additional parameters for customizing the conversion process. |
Optional Parameters | --- |
dim_order | Specifies the order of dimensions for multi-dimensional variables. |
to_csv_kwargs | Dictionary of arguments passed to pandas.DataFrame.to_csv. (ex: specifying delimiters and line terminators.) |
Usage Example#
import xarray as xr
from pathlib import Path
from netcdf2csv import write
filepath = "to_csv path/to/your/data/*.nc"
dataset = xr.open_dataset(filepath)
# Optional parameters
parameters = dict()
parameters['dim_order']: Optional[List[str]] = ['time', 'height']
parameters['to_csv_kwargs']: Dict[str, Any] = {'sep': '\t', 'lineterminator': '\n'}
write(dataset, Path(filepath).with_suffix(''), parameters)
Note#
The tool issues a warning for variables with more than 2 dimensions, as CSV format does not support such variables.
See https://github.com/tsdat/tools/tree/main/netcdf2csv for more details.