Data Ops
Classes and functions to aid in random testing and training set generation
Data
DataRecords
- class ia.gaius.data_ops.DataRecords(original_dataset: str | Path, DR: float, DF: float, shuffle: bool, folder: bool = True)
Bases:
object
Splits data into random sets for training and testing.
- __init__(original_dataset: str | Path, DR: float, DF: float, shuffle: bool, folder: bool = True)
- Parameters:
original_dataset (str or list, required) – location of dataset to use for training and testing sets
DR (float, required) – fraction of total data to use for testing and training. 0 < DR < 100
DF (float, required) – fraction of the DR to use for training. The rest of the DR is used for testing. 0 < DF < 100
shuffle (bool, required) – whether to shuffle the data when creating sets
folder (bool, optional) – set if the original dataset is a folder
After creating the class, utilize the member variables train_sequences and test_sequences for the data sets
- Variables:
train_sequences – the files to use for training
test_sequences – the files to use for testing
data_ops module
This module includes the classes above in addition to the following:
- class ia.gaius.data_ops.PreparedData(data_directories=None, dataset=None, prep_enabled: bool = False)
Bases:
Data
Overloaded type for Data class to signify that train_sequences and test_sequences contain raw sequences, and not filepaths to sequences
Use flag prep_enabled to determine whether prep() will be executed during training. Shuffle will not happen if pre_enabled=False
- prep(*args, **kwargs)
- ia.gaius.data_ops.atoi(text: str)
Attempt to convert string to int
- ia.gaius.data_ops.natural_keys(text: str)
alist.sort(key=natural_keys) sorts in human order http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy’s implementation in the comments)
- ia.gaius.data_ops.raw_in_count(filename: str)
- ia.gaius.data_ops.validate_data(data: dict)
Validates if the data is in correct GAIuS digestible format. Returns True if data validates. Returns False if data does not validate.
- Parameters:
data (dict, required) – GDF to validate
Example
>>> gdf = {'strings':["hello"], 'vectors': [], 'emotives': {} } >>> validate_data(gdf) True >>> bad_gdf = {'strings': []} >>> validate_data(bad_gdf) Exception: Dictionary requires "vectors", "emotives", and "strings" as keys! >>> bad_gdf_2 = {'strings':["hello"], 'vectors': [], 'emotives': ['utility|5'] } >>> validate_data(bad_gdf_2) Exception: "emotives" must be a dict of <str, float>. Dict not provided!