Skip to content

tar_reader

Classes:

Name Description
TarReader

DataReader for reading from a tarred archive. Writing to this format is not

Classes#

TarReader #

TarReader(parameters: Dict = None)

Bases: ArchiveReader

DataReader for reading from a tarred archive. Writing to this format is not supported.

This class requires a that readers be specified in the parameters section of the storage configuration file. The structure of thereaders section should mirror the structure of its parent `readers section. To illustrate, consider the following configuration block:

readers:
  .*:
    tar:
      file_pattern: .*tar
      classname: tsdat.io.readers.TarReader
      parameters:
        # Parameters to specify how the TarReader should read/unpack the archive.
        # Parameters here are passed to the Python open() method as kwargs. The
        # default value is shown below.
        open_tar_kwargs:
          mode: "rb"

        # Parameters here are passed to tarfile.open() as kwargs. Useful for
        # specifying the system encoding or compression algorithm to use for
        # unpacking the archive. These are optional.
        read_tar_kwargs:
          mode: "r:gz"


        # The readers section tells the TarReader which DataReaders should be
        # used to handle the unpacked files.
        readers:
          .*csv:
            classname: tsdat.io.readers.CSVReader
            parameters:  # Parameters specific to tsdat.io.readers.CSVReader
              read_csv_kwargs:
                sep: '\t'

        # Pattern(s) used to exclude certain files in the archive from being handled.
        # This parameter is optional, and the default value is shown below:
        exclude: ['.*__MACOSX/.*', '.*DS_Store']

Classes:

Name Description
Parameters

Methods:

Name Description
read

Attributes:

Name Type Description
parameters Parameters
Source code in tsdat/io/base/archive_reader.py
def __init__(self, parameters: Dict = None):  # type: ignore
    super().__init__(parameters=parameters)

    # Naively merge a list of regex patterns to exclude certain files from being
    # read. By default we exclude files that macOS creates when zipping a folder.
    exclude = [".*\\_\\_MACOSX/.*", ".*\\.DS_Store"]
    exclude.extend(getattr(self.parameters, "exclude", []))
    self.parameters.exclude = "(?:% s)" % "|".join(exclude)

Attributes#

parameters class-attribute instance-attribute #
parameters: Parameters = Parameters()

Classes#

Parameters #

Bases: BaseModel

Attributes:

Name Type Description
exclude List[str]
open_tar_kwargs Dict[str, Any]
read_tar_kwargs Dict[str, Any]
readers Dict[str, Any]
Attributes#
exclude class-attribute instance-attribute #
exclude: List[str] = []
open_tar_kwargs class-attribute instance-attribute #
open_tar_kwargs: Dict[str, Any] = {}
read_tar_kwargs class-attribute instance-attribute #
read_tar_kwargs: Dict[str, Any] = {}
readers class-attribute instance-attribute #
readers: Dict[str, Any] = {}

Functions#

read #
read(input_key: str) -> Dict[str, xr.Dataset]

Extracts the file into memory and uses registered DataReaders to read each relevant extracted file into its own xarray Dataset object. Returns a mapping like {filename: xr.Dataset}.

Parameters:

Name Type Description Default
input_key str

The file to read in. It is used to open the tar file.

required

Returns:

Type Description
Dict[str, Dataset]

Dict[str, xr.Dataset]: A mapping of {label: xr.Dataset}.


Source code in tsdat/io/readers/tar_reader.py
def read(self, input_key: str) -> Dict[str, xr.Dataset]:
    """------------------------------------------------------------------------------------
    Extracts the file into memory and uses registered `DataReaders` to read each relevant
    extracted file into its own xarray Dataset object. Returns a mapping like
    {filename: xr.Dataset}.

    Args:
        input_key (str): The file to read in. It is used to open the tar file.

    Returns:
        Dict[str, xr.Dataset]: A mapping of {label: xr.Dataset}.

    ------------------------------------------------------------------------------------
    """

    output: Dict[str, xr.Dataset] = {}

    # If we are reading from a string / filepath then add option to specify more
    # parameters for opening (i.e., mode or encoding options)
    if isinstance(input_key, str):  # Necessary for archiveReaders
        open_params = dict(mode="rb")
        open_params.update(self.parameters.open_tar_kwargs)
        fileobj = open(input_key, **open_params)  # type: ignore
    else:
        fileobj = input_key

    tar = tarfile.open(fileobj=fileobj, **self.parameters.read_tar_kwargs)  # type: ignore

    for info_obj in tar:  # type: ignore
        filename = info_obj.name  # type: ignore
        if re.match(self.parameters.exclude, filename):  # type: ignore
            continue

        for key in self.parameters.readers.keys():
            reader: Optional[DataReader] = self.parameters.readers.get(key, None)
            if reader:
                tar_bytes = BytesIO(tar.extractfile(filename).read())  # type: ignore
                data = reader.read(tar_bytes)  # type: ignore

                if isinstance(data, xr.Dataset):
                    data = {filename: data}  # type: ignore
                output.update(data)  # type: ignore

    return output