Data Ops

Classes and functions to aid in random testing and training set generation

Data

class ia.gaius.data_ops.Data(data_directories=None, dataset=None)

Bases: object

__init__(data_directories=None, dataset=None)

Supply either a list of data_directories, or a dataset.

prep(percent_of_dataset_chosen: float, percent_reserved_for_training: float, shuffle: bool = False)

DataRecords

class ia.gaius.data_ops.DataRecords(original_dataset: str | Path, DR: float, DF: float, shuffle: bool, folder: bool = True)

Bases: object

Splits data into random sets for training and testing.

__init__(original_dataset: str | Path, DR: float, DF: float, shuffle: bool, folder: bool = True)
Parameters:
  • original_dataset (str or list, required) – location of dataset to use for training and testing sets

  • DR (float, required) – fraction of total data to use for testing and training. 0 < DR < 100

  • DF (float, required) – fraction of the DR to use for training. The rest of the DR is used for testing. 0 < DF < 100

  • shuffle (bool, required) – whether to shuffle the data when creating sets

  • folder (bool, optional) – set if the original dataset is a folder

After creating the class, utilize the member variables train_sequences and test_sequences for the data sets

Variables:
  • train_sequences – the files to use for training

  • test_sequences – the files to use for testing

data_ops module

This module includes the classes above in addition to the following:

class ia.gaius.data_ops.PreparedData(data_directories=None, dataset=None, prep_enabled: bool = False)

Bases: Data

Overloaded type for Data class to signify that train_sequences and test_sequences contain raw sequences, and not filepaths to sequences

Use flag prep_enabled to determine whether prep() will be executed during training. Shuffle will not happen if pre_enabled=False

prep(*args, **kwargs)
ia.gaius.data_ops.atoi(text: str)

Attempt to convert string to int

ia.gaius.data_ops.natural_keys(text: str)

alist.sort(key=natural_keys) sorts in human order http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy’s implementation in the comments)

ia.gaius.data_ops.raw_in_count(filename: str)
ia.gaius.data_ops.validate_data(data: dict)

Validates if the data is in correct GAIuS digestible format. Returns True if data validates. Returns False if data does not validate.

Parameters:

data (dict, required) – GDF to validate

Example

>>> gdf = {'strings':["hello"], 'vectors': [], 'emotives': {} }
>>> validate_data(gdf)
True
>>> bad_gdf = {'strings': []}
>>> validate_data(bad_gdf)
Exception: Dictionary requires "vectors", "emotives", and "strings" as keys!
>>> bad_gdf_2 = {'strings':["hello"], 'vectors': [], 'emotives': ['utility|5'] }
>>> validate_data(bad_gdf_2)
Exception: "emotives" must be a dict of <str, float>. Dict not provided!