Module riid.data.converters
This module contains utilities for converting known datasets into SampleSets.
Expand source code Browse git
# Copyright 2021 National Technology & Engineering Solutions of Sandia, LLC (NTESS).
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
"""This module contains utilities for converting known datasets into `SampleSet`s."""
import glob
from pathlib import Path
from typing import Callable
from joblib import Parallel, delayed
def _validate_and_create_output_dir(output_dir: str):
output_dir_path = Path(output_dir)
output_dir_path.mkdir(exist_ok=True)
if not output_dir_path.is_dir:
raise ValueError("`output_dir` already exists but is not a directory.")
def convert_directory(input_dir_path: str, conversion_func: Callable, file_ext: str,
n_jobs: int = 8, **kwargs):
"""Convert and save every file in a specified directory.
Conversion functions can be found in sub-modules:
- AIPT: `riid.data.converters.aipt.convert_and_save()`
- TopCoder: `riid.data.converters.topcoder.convert_and_save()`
Due to usage of parallel processing, be sure to run this function as follows:
```python
if __name__ == "__main__":
convert_directory(...)
```
Tip: for max utilization, considering setting `n_jobs` to `multiprocessing.cpu_count()`.
Args:
input_dir_path: directory path containing the input files
conversion_func: function used to convert a data file to a `SampleSet`
file_ext: file extension to read in for conversion
n_jobs: `joblib.Parallel` parameter to set the # of jobs
kwargs: additional keyword args passed to conversion_func
"""
input_path = Path(input_dir_path)
if not input_path.exists() or not input_path.is_dir():
print(f"No directory at provided input path: '{input_dir_path}'")
return
input_file_paths = sorted(glob.glob(f"{input_dir_path}/*.{file_ext}"))
Parallel(n_jobs, verbose=10)(
delayed(conversion_func)(path, **kwargs) for path in input_file_paths
)
Sub-modules
riid.data.converters.aipt-
This module provides tools for handling data related to the multi-lab Algorithm Improvement Program Team (AIPT).
riid.data.converters.topcoder-
This module provides tools for handling data related to the "Detecting Radiological Threats in Urban Areas" TopCoder Challenge. …
Functions
def convert_directory(input_dir_path: str, conversion_func: Callable, file_ext: str, n_jobs: int = 8, **kwargs)-
Convert and save every file in a specified directory.
Conversion functions can be found in sub-modules:
- AIPT:
convert_and_save() - TopCoder:
convert_and_save()
Due to usage of parallel processing, be sure to run this function as follows:
if __name__ == "__main__": convert_directory(...)Tip: for max utilization, considering setting
n_jobstomultiprocessing.cpu_count().Args
input_dir_path- directory path containing the input files
conversion_func- function used to convert a data file to a
SampleSet file_ext- file extension to read in for conversion
n_jobsjoblib.Parallelparameter to set the # of jobskwargs- additional keyword args passed to conversion_func
Expand source code Browse git
def convert_directory(input_dir_path: str, conversion_func: Callable, file_ext: str, n_jobs: int = 8, **kwargs): """Convert and save every file in a specified directory. Conversion functions can be found in sub-modules: - AIPT: `riid.data.converters.aipt.convert_and_save()` - TopCoder: `riid.data.converters.topcoder.convert_and_save()` Due to usage of parallel processing, be sure to run this function as follows: ```python if __name__ == "__main__": convert_directory(...) ``` Tip: for max utilization, considering setting `n_jobs` to `multiprocessing.cpu_count()`. Args: input_dir_path: directory path containing the input files conversion_func: function used to convert a data file to a `SampleSet` file_ext: file extension to read in for conversion n_jobs: `joblib.Parallel` parameter to set the # of jobs kwargs: additional keyword args passed to conversion_func """ input_path = Path(input_dir_path) if not input_path.exists() or not input_path.is_dir(): print(f"No directory at provided input path: '{input_dir_path}'") return input_file_paths = sorted(glob.glob(f"{input_dir_path}/*.{file_ext}")) Parallel(n_jobs, verbose=10)( delayed(conversion_func)(path, **kwargs) for path in input_file_paths ) - AIPT: