tsdat.io
¶
The tsdat.io package provides the classes that the data pipeline uses to manage I/O for the pipeline. Specifically, it includes:
The FileHandler infrastructure used to read/write to/from specific file formats, and
The Storage infrastructure used to store/access processed data files
We warmly welcome community contribututions to increase the list of supported FileHandlers and Storage destinations.
Subpackages¶
Submodules¶
Classes¶
Abstract class to define methods required by all FileHandlers. Classes |
|
DatastreamStorage subclass for an AWS S3-based filesystem. |
|
FileHandler to read from and write to CSV files. Takes a number of |
|
DatastreamStorage is the base class for providing |
|
DisposableLocalTempFile is a context manager wrapper class for a temp file on |
|
Provides a context manager wrapper class for a list of |
|
Provides is a context manager wrapper class for a list of |
|
Class to provide methods to read and write files with a variety of |
|
Datastreamstorage subclass for a local Linux-based filesystem. |
|
FileHandler to read from and write to netCDF files. Takes a number of |
|
This class wraps a ‘special’ path string that lets us include the |
|
FileHandler to read from and write to netCDF files. Takes a number of |
|
Each DatastreamStorage should contain a corresponding |
Functions¶
Python decorator to register an AbstractFileHandler in the FileHandler |
Function Descriptions
-
class
tsdat.io.
AbstractFileHandler
(parameters: Union[Dict, None] = None)¶ Abstract class to define methods required by all FileHandlers. Classes derived from AbstractFileHandler should implement one or more of the following methods:
write(ds: xr.Dataset, filename: str, config: Config, **kwargs)
read(filename: str, **kwargs) -> xr.Dataset
- Parameters
parameters (Dict, optional) – Parameters that were passed to the FileHandler when it was registered in the storage config file, defaults to {}.
Class Methods
Reads in the given file and converts it into an Xarray dataset for
Saves the given dataset to a file.
Method Descriptions
-
read
(self, filename: str, **kwargs) → xarray.Dataset¶ Reads in the given file and converts it into an Xarray dataset for use in the pipeline.
- Parameters
filename (str) – The path to the file to read in.
- Returns
A xr.Dataset object.
- Return type
xr.Dataset
-
write
(self, ds: xarray.Dataset, filename: str, config: tsdat.config.Config = None, **kwargs) → None¶ Saves the given dataset to a file.
- Parameters
ds (xr.Dataset) – The dataset to save.
filename (str) – The path to where the file should be written to.
config (Config, optional) – Optional Config object, defaults to None
-
class
tsdat.io.
AwsStorage
(parameters: Union[Dict, None] = None)[source]¶ Bases:
tsdat.io.DatastreamStorage
DatastreamStorage subclass for an AWS S3-based filesystem.
- Parameters
parameters (dict, optional) –
Dictionary of parameters that should be set automatically from the storage config file when this class is intantiated via the DatstreamStorage.from-config() method. Defaults to {}
Key parameters that should be set in the config file include
- retain_input_files
Whether the input files should be cleaned up after they are done processing
- root_dir
The bucket ‘key’ to use to prepend to all processed files created in the persistent store. Defaults to ‘root’
- temp_dir
The bucket ‘key’ to use to prepend to all temp files created in the S3 bucket. Defaults to ‘temp’
- bucket_name
The name of the S3 bucket to store to
Class Methods
Deletes datastream data in the datastream store in between the
Checks if any data exists in the datastream store for the provided
Fetches files from the datastream store using the datastream_name,
Finds all files of the given type from the datastream store with the
Given a path to a local file, save that file to the storage.
Each subclass should define the tmp property, which provides
Method Descriptions
-
delete
(self, datastream_name: str, start_time: str, end_time: str, filetype: int = None) → None¶ Deletes datastream data in the datastream store in between the specified time range.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys. If no type is specified, all files will be deleted. Defaults to None.
-
exists
(self, datastream_name: str, start_time: str, end_time: str, filetype: int = None) → bool¶ Checks if any data exists in the datastream store for the provided datastream and time range.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys. If none specified, all files will be checked. Defaults to None.
- Returns
True if data exists, False otherwise.
- Return type
bool
-
fetch
(self, datastream_name: str, start_time: str, end_time: str, local_path: str = None, filetype: int = None) → tsdat.io.DisposableLocalTempFileList¶ Fetches files from the datastream store using the datastream_name, start_time, and end_time to specify the file(s) to retrieve. If the local path is not specified, it is up to the subclass to determine where to put the retrieved file(s).
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
local_path (str, optional) – The path to the directory where the data should be stored. Defaults to None.
filetype (int, optional) – A file type from the DatastreamStorage.file_filters keys If no type is specified, all files will be returned. Defaults to None.
- Returns
A list of paths where the retrieved files were stored in local storage. This is a context manager class, so it this method should be called via the ‘with’ statement and all files referenced by the list will be cleaned up when it goes out of scope.
- Return type
DisposableLocalTempFileList:
-
find
(self, datastream_name: str, start_time: str, end_time: str, filetype: str = None) → List[S3Path]¶ Finds all files of the given type from the datastream store with the given datastream_name and timestamps from start_time (inclusive) up to end_time (exclusive). Returns a list of paths to files that match the criteria.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106.000000” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108.000000” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys If no type is specified, all files will be returned. Defaults to None.
- Returns
A list of paths in datastream storage in ascending order
- Return type
List[str]
-
property
root
(self)¶
-
property
s3_client
(self)¶
-
property
s3_resource
(self)¶
-
save_local_path
(self, local_path: str, new_filename: str = None)¶ Given a path to a local file, save that file to the storage.
- Parameters
local_path (str) – Local path to the file to save. The file should be named according to ME Data Standards naming conventions so that this method can automatically parse the datastream, date, and time from the file name.
new_filename (str, optional) – If provided, the new filename to save as. This parameter should ONLY be provided if using a local path for dataset_or_path. Must also follow ME Data Standards naming conventions. Defaults to None.
- Returns
The path where this file was stored in storage. Path type is dependent upon the specific storage subclass.
- Return type
Any
-
property
temp_path
(self)¶
-
property
tmp
(self)¶ Each subclass should define the tmp property, which provides access to a TemporaryStorage object that is used to efficiently handle reading/writing temporary files used during the processing pipeline, or to perform fileystem actions on files other than processed datastream files that reside in the same filesystem as the DatastreamStorage. Is is not intended to be used outside of the pipeline.
- Raises
NotImplementedError – [description]
-
class
tsdat.io.
CsvHandler
(parameters: Union[Dict, None] = None)¶ Bases:
tsdat.io.filehandlers.file_handlers.AbstractFileHandler
FileHandler to read from and write to CSV files. Takes a number of parameters that are passed in from the storage config file. Parameters specified in the config file should follow the following example:
parameters: write: to_dataframe: # Parameters here will be passed to xr.Dataset.to_dataframe() to_csv: # Parameters here will be passed to pd.DataFrame.to_csv() read: read_csv: # Parameters here will be passed to pd.read_csv() to_xarray: # Parameters here will be passed to pd.DataFrame.to_xarray()
- Parameters
parameters (Dict, optional) – Parameters that were passed to the FileHandler when it was registered in the storage config file, defaults to {}.
Class Methods
Reads in the given file and converts it into an Xarray dataset for
Saves the given dataset to a csv file.
Method Descriptions
-
read
(self, filename: str, **kwargs) → xarray.Dataset¶ Reads in the given file and converts it into an Xarray dataset for use in the pipeline.
- Parameters
filename (str) – The path to the file to read in.
- Returns
A xr.Dataset object.
- Return type
xr.Dataset
-
write
(self, ds: xarray.Dataset, filename: str, config: tsdat.config.Config = None, **kwargs) → None¶ Saves the given dataset to a csv file.
- Parameters
ds (xr.Dataset) – The dataset to save.
filename (str) – The path to where the file should be written to.
config (Config, optional) – Optional Config object, defaults to None
-
class
tsdat.io.
DatastreamStorage
(parameters: Union[Dict, None] = None)[source]¶ Bases:
abc.ABC
DatastreamStorage is the base class for providing access to processed data files in a persistent archive. DatastreamStorage provides shortcut methods to find files based upon date, datastream name, file type, etc. This is the class that should be used to save and retrieve processed data files. Use the DatastreamStorage.from_config() method to construct the appropriate subclass instance based upon a storage config file.
-
default_file_type
¶
-
file_filters
¶
-
output_file_extensions
¶
Class Methods
Deletes datastream data in the datastream store in between the
Checks if any data exists in the datastream store for the provided
Fetches files from the datastream store using the datastream_name,
Finds all files of the given type from the datastream store with the
Load a yaml config file which provides the storage constructor
Saves a local file to the datastream store.
Given a path to a local file, save that file to the storage.
Each subclass should define the tmp property, which provides
Method Descriptions
-
abstract
delete
(self, datastream_name: str, start_time: str, end_time: str, filetype: str = None) → None¶ Deletes datastream data in the datastream store in between the specified time range.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys. If no type is specified, all files will be deleted. Defaults to None.
-
abstract
exists
(self, datastream_name: str, start_time: str, end_time: str, filetype: str = None) → bool¶ Checks if any data exists in the datastream store for the provided datastream and time range.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys. If none specified, all files will be checked. Defaults to None.
- Returns
True if data exists, False otherwise.
- Return type
bool
-
abstract
fetch
(self, datastream_name: str, start_time: str, end_time: str, local_path: str = None, filetype: int = None)¶ Fetches files from the datastream store using the datastream_name, start_time, and end_time to specify the file(s) to retrieve. If the local path is not specified, it is up to the subclass to determine where to put the retrieved file(s).
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
local_path (str, optional) – The path to the directory where the data should be stored. Defaults to None.
filetype (int, optional) – A file type from the DatastreamStorage.file_filters keys If no type is specified, all files will be returned. Defaults to None.
- Returns
A list of paths where the retrieved files were stored in local storage. This is a context manager class, so it this method should be called via the ‘with’ statement and all files referenced by the list will be cleaned up when it goes out of scope.
- Return type
DisposableLocalTempFileList:
-
abstract
find
(self, datastream_name: str, start_time: str, end_time: str, filetype: str = None) → List[str]¶ Finds all files of the given type from the datastream store with the given datastream_name and timestamps from start_time (inclusive) up to end_time (exclusive). Returns a list of paths to files that match the criteria.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106.000000” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108.000000” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys If no type is specified, all files will be returned. Defaults to None.
- Returns
A list of paths in datastream storage in ascending order
- Return type
List[str]
-
static
from_config
(storage_config_file: str)¶ Load a yaml config file which provides the storage constructor parameters.
- Parameters
storage_config_file (str) – The path to the config file to load
- Returns
A subclass instance created from the config file.
- Return type
-
save
(self, dataset_or_path: Union[str, xarray.Dataset], new_filename: str = None) → List[Any]¶ Saves a local file to the datastream store.
- Parameters
dataset_or_path (Union[str, xr.Dataset]) – The dataset or local path to the file to save. The file should be named according to ME Data Standards naming conventions so that this method can automatically parse the datastream, date, and time from the file name.
new_filename (str, optional) – If provided, the new filename to save as. This parameter should ONLY be provided if using a local path for dataset_or_path. Must also follow ME Data Standards naming conventions. Defaults to None.
- Returns
A list of paths where the saved files were stored in storage. Path type is dependent upon the specific storage subclass.
- Return type
List[Any]
-
abstract
save_local_path
(self, local_path: str, new_filename: str = None) → Any¶ Given a path to a local file, save that file to the storage.
- Parameters
local_path (str) – Local path to the file to save. The file should be named according to ME Data Standards naming conventions so that this method can automatically parse the datastream, date, and time from the file name.
new_filename (str, optional) – If provided, the new filename to save as. This parameter should ONLY be provided if using a local path for dataset_or_path. Must also follow ME Data Standards naming conventions. Defaults to None.
- Returns
The path where this file was stored in storage. Path type is dependent upon the specific storage subclass.
- Return type
Any
-
property
tmp
(self)¶ Each subclass should define the tmp property, which provides access to a TemporaryStorage object that is used to efficiently handle reading/writing temporary files used during the processing pipeline, or to perform fileystem actions on files other than processed datastream files that reside in the same filesystem as the DatastreamStorage. Is is not intended to be used outside of the pipeline.
- Raises
NotImplementedError – [description]
-
-
class
tsdat.io.
DisposableLocalTempFile
(filepath: str, disposable=True)[source]¶ DisposableLocalTempFile is a context manager wrapper class for a temp file on the LOCAL FILESYSTEM. It will ensure that the file is deleted when it goes out of scope.
- Parameters
filepath (str) – Path to a local temp file that could be deleted when it goes out of scope.
disposable (bool, optional) – True if this file should be automatically deleted when it goes out of scope. Defaults to True.
Class Methods
Method Descriptions
-
__enter__
(self)¶
-
__exit__
(self, type, value, traceback)¶
-
class
tsdat.io.
DisposableLocalTempFileList
(filepath_list: List[str], delete_on_exception=False, disposable=True)[source]¶ Bases:
list
Provides a context manager wrapper class for a list of temp files on the LOCAL FILESYSTEM. It ensures that if specified, the files will be auto-deleted when the list goes out of scope.
- Parameters
filepath_list (List[str]) – A list of local temp files
delete_on_exception (bool, optional) – Should the local temp files be deleted if an error was thrown when processing. Defaults to False.
disposable (bool, optional) – Should the local temp files be auto-deleted when they go out of scope. Defaults to True.
Initialize self. See help(type(self)) for accurate signature.
Class Methods
Method Descriptions
-
__enter__
(self)¶
-
__exit__
(self, type, value, traceback)¶
-
class
tsdat.io.
DisposableStorageTempFileList
(filepath_list: List[str], storage, disposable_files: Union[List, None] = None)[source]¶ Bases:
list
Provides is a context manager wrapper class for a list of temp files on the STORAGE FILESYSTEM. It will ensure that the specified files are deleted when the list goes out of scope.
- Parameters
filepath_list (List[str]) – A list of files in temporary storage area
storage (TemporaryStorage) – The temporary storage service used to clean up temporary files.
disposable_files (list, optional) – Which of the files from the filepath_list should be auto-deleted when the list goes out of scope. Defaults to []
Initialize self. See help(type(self)) for accurate signature.
Class Methods
Method Descriptions
-
__enter__
(self)¶
-
__exit__
(self, type, value, traceback)¶
-
class
tsdat.io.
FileHandler
¶ Class to provide methods to read and write files with a variety of extensions.
-
FILEREADERS
:Dict[str, AbstractFileHandler]¶
-
FILEWRITERS
:Dict[str, AbstractFileHandler]¶
Class Methods
Reads in the given file and converts it into an xarray dataset object using the
Method to register a FileHandler for reading from or writing to files matching one or
Calls the appropriate FileHandler to write the dataset to the provided filename.
Method Descriptions
-
static
read
(filename: str, **kwargs) → xarray.Dataset¶ Reads in the given file and converts it into an xarray dataset object using the registered FileHandler for the provided filepath.
- Parameters
filename (str) – The path to the file to read in.
- Returns
The raw file as an Xarray.Dataset object.
- Return type
xr.Dataset
-
static
register_file_handler
(method: Literal[read, write], patterns: Union[str, List[str]], handler: AbstractFileHandler)¶ Method to register a FileHandler for reading from or writing to files matching one or more provided file patterns.
- Parameters
method ("Literal") – The method the FileHandler should call if the pattern is
Must be one of (matched.) – “read”, “write”.
patterns (Union[str, List[str]]) – The file pattern(s) that determine if this
should be run on a given filepath. (FileHandler) –
handler (AbstractFileHandler) – The AbstractFileHandler to register.
-
static
write
(ds: xarray.Dataset, filename: str, config: tsdat.config.Config = None, **kwargs) → None¶ Calls the appropriate FileHandler to write the dataset to the provided filename.
- Parameters
ds (xr.Dataset) – The dataset to save.
filename (str) – The path to the file where the dataset should be written.
config (Config, optional) – Optional Config object. Defaults to None.
-
-
class
tsdat.io.
FilesystemStorage
(parameters: Union[Dict, None] = None)[source]¶ Bases:
tsdat.io.DatastreamStorage
Datastreamstorage subclass for a local Linux-based filesystem.
TODO: rename to LocalStorage as this is more intuitive.
- Parameters
parameters (dict, optional) –
Dictionary of parameters that should be set automatically from the storage config file when this class is intantiated via the DatstreamStorage.from-config() method. Defaults to {}
Key parameters that should be set in the config file include
- retain_input_files
Whether the input files should be cleaned up after they are done processing
- root_dir
The root path under which processed files will e stored.
Class Methods
Deletes datastream data in the datastream store in between the
Checks if any data exists in the datastream store for the provided
Fetches files from the datastream store using the datastream_name,
Finds all files of the given type from the datastream store with the
Given a path to a local file, save that file to the storage.
Each subclass should define the tmp property, which provides
Method Descriptions
-
delete
(self, datastream_name: str, start_time: str, end_time: str, filetype: int = None) → None¶ Deletes datastream data in the datastream store in between the specified time range.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys. If no type is specified, all files will be deleted. Defaults to None.
-
exists
(self, datastream_name: str, start_time: str, end_time: str, filetype: int = None) → bool¶ Checks if any data exists in the datastream store for the provided datastream and time range.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys. If none specified, all files will be checked. Defaults to None.
- Returns
True if data exists, False otherwise.
- Return type
bool
-
fetch
(self, datastream_name: str, start_time: str, end_time: str, local_path: str = None, filetype: int = None) → tsdat.io.DisposableLocalTempFileList¶ Fetches files from the datastream store using the datastream_name, start_time, and end_time to specify the file(s) to retrieve. If the local path is not specified, it is up to the subclass to determine where to put the retrieved file(s).
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108” to search for data ending before January 8th, 2021.
local_path (str, optional) – The path to the directory where the data should be stored. Defaults to None.
filetype (int, optional) – A file type from the DatastreamStorage.file_filters keys If no type is specified, all files will be returned. Defaults to None.
- Returns
A list of paths where the retrieved files were stored in local storage. This is a context manager class, so it this method should be called via the ‘with’ statement and all files referenced by the list will be cleaned up when it goes out of scope.
- Return type
DisposableLocalTempFileList:
-
find
(self, datastream_name: str, start_time: str, end_time: str, filetype: str = None) → List[str]¶ Finds all files of the given type from the datastream store with the given datastream_name and timestamps from start_time (inclusive) up to end_time (exclusive). Returns a list of paths to files that match the criteria.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106.000000” to search for data beginning on or after January 6th, 2021.
end_time (str) – The end time or date to stop searching for data (exclusive). Should be like “20210108.000000” to search for data ending before January 8th, 2021.
filetype (str, optional) – A file type from the DatastreamStorage.file_filters keys If no type is specified, all files will be returned. Defaults to None.
- Returns
A list of paths in datastream storage in ascending order
- Return type
List[str]
-
save_local_path
(self, local_path: str, new_filename: str = None) → Any¶ Given a path to a local file, save that file to the storage.
- Parameters
local_path (str) – Local path to the file to save. The file should be named according to ME Data Standards naming conventions so that this method can automatically parse the datastream, date, and time from the file name.
new_filename (str, optional) – If provided, the new filename to save as. This parameter should ONLY be provided if using a local path for dataset_or_path. Must also follow ME Data Standards naming conventions. Defaults to None.
- Returns
The path where this file was stored in storage. Path type is dependent upon the specific storage subclass.
- Return type
Any
-
property
tmp
(self)¶ Each subclass should define the tmp property, which provides access to a TemporaryStorage object that is used to efficiently handle reading/writing temporary files used during the processing pipeline, or to perform fileystem actions on files other than processed datastream files that reside in the same filesystem as the DatastreamStorage. Is is not intended to be used outside of the pipeline.
- Raises
NotImplementedError – [description]
-
class
tsdat.io.
NetCdfHandler
(parameters: Union[Dict, None] = None)¶ Bases:
tsdat.io.filehandlers.file_handlers.AbstractFileHandler
FileHandler to read from and write to netCDF files. Takes a number of parameters that are passed in from the storage config file. Parameters specified in the config file should follow the following example:
parameters: write: to_netcdf: # Parameters here will be passed to xr.Dataset.to_netcdf() read: load_dataset: # Parameters here will be passed to xr.load_dataset()
- Parameters
parameters (Dict, optional) – Parameters that were passed to the FileHandler when it was registered in the storage config file, defaults to {}.
Class Methods
Reads in the given file and converts it into an Xarray dataset for
Saves the given dataset to a netCDF file.
Method Descriptions
-
read
(self, filename: str, **kwargs) → xarray.Dataset¶ Reads in the given file and converts it into an Xarray dataset for use in the pipeline.
- Parameters
filename (str) – The path to the file to read in.
- Returns
A xr.Dataset object.
- Return type
xr.Dataset
-
write
(self, ds: xarray.Dataset, filename: str, config: tsdat.config.Config = None, **kwargs) → None¶ Saves the given dataset to a netCDF file.
- Parameters
ds (xr.Dataset) – The dataset to save.
filename (str) – The path to where the file should be written to.
config (Config, optional) – Optional Config object, defaults to None
-
class
tsdat.io.
S3Path
(bucket_name: str, bucket_path: str = '', region_name: str = None)[source]¶ Bases:
str
This class wraps a ‘special’ path string that lets us include the bucket name and region in the path, so that we can use it seamlessly in boto3 APIs. We are creating our own string to hold the region, bucket & key (i.e., path), since boto3 needs all three in order to access a file.
Example: .. code-block:: python
s3_client = boto3.client(‘s3’, region_name=’eu-central-1’) s3_client.download_file(bucket, key, download_path)
- Parameters
bucket_name (str) – The S3 bucket name where this file is located
bucket_path (str, optional) – The key to access this file in the bucket
region_name (str, optional) – The AWS region where this file is located, defaults to None, which inherits the default configured region.
Initialize self. See help(type(self)) for accurate signature.
Class Methods
Return str(self).
Joins segments in an S3 path. This method behaves
Method Descriptions
-
__str__
(self)¶ Return str(self).
-
property
bucket_name
(self)¶
-
property
bucket_path
(self)¶
-
join
(self, *args)¶ Joins segments in an S3 path. This method behaves exactly like os.path.join.
- Returns
A New S3Path with the additional segments added.
- Return type
-
property
region_name
(self)¶
-
class
tsdat.io.
SplitNetCdfHandler
(parameters: Union[Dict, None] = None)¶ Bases:
NetCdfHandler
FileHandler to read from and write to netCDF files. Takes a number of parameters that are passed in from the storage config file. Parameters specified in the config file should follow the following example:
parameters: write: to_netcdf: # Parameters here will be passed to xr.Dataset.to_netcdf() read: load_dataset: # Parameters here will be passed to xr.load_dataset()
- Parameters
parameters (Dict, optional) – Parameters that were passed to the FileHandler when it was registered in the storage config file, defaults to {}.
Class Methods
Reads in the given file and converts it into an Xarray dataset for
Saves the given dataset to netCDF file(s) based on the ‘time_interval’
Method Descriptions
-
abstract
read
(self, filename: str, **kwargs)¶ Reads in the given file and converts it into an Xarray dataset for use in the pipeline.
- Parameters
filename (str) – The path to the file to read in.
- Returns
A xr.Dataset object.
- Return type
xr.Dataset
-
write
(self, ds: xarray.Dataset, filename: str, config: tsdat.config.Config = None, **kwargs) → None¶ Saves the given dataset to netCDF file(s) based on the ‘time_interval’ and ‘time_unit’ config parameters.
- Parameters
ds (xr.Dataset) – The dataset to save.
filename (str) – The path to where the file should be written to.
config (Config, optional) – Optional Config object, defaults to None
-
class
tsdat.io.
TemporaryStorage
(storage: DatastreamStorage)[source]¶ Bases:
abc.ABC
Each DatastreamStorage should contain a corresponding TemporaryStorage class which provides access to a TemporaryStorage object that is used to efficiently handle reading/writing temporary files used during the processing pipeline, or to perform fileystem actions on files other than processed datastream files that reside in the same filesystem as the DatastreamStorage.
TemporaryStorage methods return a context manager so that the created temporary files can be automatically removed when they go out of scope.
TemporaryStorage is a helper class intended to be used in the internals of pipeline implementations only. It is not meant as an external API for interacting with files in DatastreamStorage.
TODO: rename to a more intuitive name…
- Parameters
storage (DatastreamStorage) – A reference to the corresponding DatastreamStorage
Class Methods
Clean any extraneous files from the temp working dirs. Temp files
Create a new, temporary directory under the local tmp area managed by
Remove a file from storage temp area if the file exists. If the file
If provided a path to an archive file, this function will extract the
Fetch a file from temp storage to a local temp folder. If
Look in DatastreamStorage for the first processed file before the given date.
Construct a filepath for a temporary file that will be located in the
Return true if this file should be excluded from the zip file check.
Default method to get a local temporary folder for use when retrieving
Method Descriptions
-
clean
(self)¶ Clean any extraneous files from the temp working dirs. Temp files could be in two places:
the local temp folder - used when fetching files from the store
the storage temp folder - used when extracting zip files in some stores (e.g., AWS)
This method removes the local temp folder. Child classes can extend this method to clean up their respective storage temp folders.
-
create_temp_dir
(self) → str¶ Create a new, temporary directory under the local tmp area managed by TemporaryStorage.
- Returns
Path to the local dir.
- Return type
str
-
abstract
delete
(self, file_path: str)¶ Remove a file from storage temp area if the file exists. If the file does not exist, this method will NOT raise an exception.
- Parameters
file_path (str) – The path of a file located in the same filesystem as the storage.
-
abstract
extract_files
(self, file_path: Union[str, List[str]]) → DisposableStorageTempFileList¶ If provided a path to an archive file, this function will extract the archive into a temp directory IN THE SAME FILESYSTEM AS THE STORAGE. This means, for example that if storage was in an s3 bucket ,then the files would be extracted to a temp dir in that s3 bucket. This is to prevent local disk limitations when running via Lambda.
If the file is not an archive, then the same file will be returned.
This method supports zip, tar, and tar.g file formats.
- Parameters
file_path (Union[str, List[str]]) – The path of a file or a list of files that should be processed together, located in the same filesystem as the storage.
- Returns
A list of paths to the files that were extracted. Files will be located in the temp area of the storage filesystem.
- Return type
-
abstract
fetch
(self, file_path: str, local_dir=None, disposable=True) → Union[DisposableLocalTempFile, str]¶ Fetch a file from temp storage to a local temp folder. If disposable is True, then a DisposableLocalTempFile will be returned so that it can be used with a context manager.
- Parameters
file_path (str) – The path of a file located in the same filesystem as the storage.
local_dir ([type], optional) – The destination folder for the file. If not specified, it will be created int the storage-approved local temp folder. defaults to None.
disposable (bool, optional) – True if this file should be auto-deleted when it goes out of scope. Defaults to True.
- Returns
If disposable, return a DisposableLocalTempFile, otherwise return the path to the local file.
- Return type
Union[DisposableLocalTempFile, str]
-
abstract
fetch_previous_file
(self, datastream_name: str, start_time: str) → DisposableLocalTempFile¶ Look in DatastreamStorage for the first processed file before the given date.
- Parameters
datastream_name (str) – The datastream_name as defined by ME Data Standards.
start_time (str) – The start time or date to start searching for data (inclusive). Should be like “20210106” to search for data beginning on or after January 6th, 2021.
- Returns
If a previous file was found, return the local path to the fetched file. Otherwise return None. (Return value wrapped in DisposableLocalTempFile so it can be auto-deleted if needed.)
- Return type
-
get_temp_filepath
(self, filename: str = None, disposable: bool = True) → DisposableLocalTempFile¶ Construct a filepath for a temporary file that will be located in the storage-approved local temp folder and will be deleted when it goes out of scope.
- Parameters
filename (str, optional) – The filename to use for the temp file. If no filename is provided, one will be created. Defaults to None
disposable (bool, optional) – If true, then wrap in DisposableLocalTempfile so that the file will be removed when it goes out of scope. Defaults to True.
- Returns
Path to the local file. The file will be automatically deleted when it goes out of scope.
- Return type
-
ignore_zip_check
(self, filepath: str) → bool¶ Return true if this file should be excluded from the zip file check. We need this for Office documents, since they are actually zip files under the hood, so we don’t want to try to unzip them.
- Parameters
filepath (str) – the file we are potentially extracting
- Returns
whether we should check if it is a zip or not
- Return type
bool
-
property
local_temp_folder
(self) → str¶ Default method to get a local temporary folder for use when retrieving files from temporary storage. This method should work for all filesystems, but can be overridden if needed by subclasses.
- Returns
Path to local temp folder
- Return type
str
-
tsdat.io.
register_filehandler
(patterns: Union[str, List[str]]) → AbstractFileHandler¶ Python decorator to register an AbstractFileHandler in the FileHandler object. The FileHandler object will be used by tsdat pipelines to read and write raw, intermediate, and processed data.
This decorator can be used to work with a specific AbstractFileHandler without having to specify a config file. This is useful when using an AbstractFileHandler for analysis or for tests outside of a pipeline. For tsdat pipelines, handlers should always be specified via the storage config file.
Example Usage:
import xarray as xr from tsdat.io import register_filehandler, AbstractFileHandler @register_filehandler(["*.nc", "*.cdf"]) class NetCdfHandler(AbstractFileHandler): def write(ds: xr.Dataset, filename: str, config: Config = None, **kwargs): ds.to_netcdf(filename) def read(filename: str, **kwargs) -> xr.Dataset: xr.load_dataset(filename)
- Parameters
patterns (Union[str, List[str]]) – The patterns (regex) that should be used to match a filepath to the AbstractFileHandler provided.
- Returns
The original AbstractFileHandler class, after it has been registered for use in tsdat pipelines.
- Return type