Data Ops 

Classes and functions to aid in random testing and training set generation

Data 

class ia.gaius.data_ops.Data(data_directories=None, dataset=None)

Bases: object

__init__(data_directories=None, dataset=None): Supply either a list of data_directories, or a dataset.

prep(percent_of_dataset_chosen: float, percent_reserved_for_training: float, shuffle: bool = False)

DataRecords 

class ia.gaius.data_ops.DataRecords(original_dataset: str | Path, DR: float, DF: float, shuffle: bool, folder: bool = True)

Bases: object

Splits data into random sets for training and testing.

__init__(original_dataset: str | Path, DR: float, DF: float, shuffle: bool, folder: bool = True)

Parameters:

original_dataset (str or list, required) – location of dataset to use for training and testing sets
DR (float, required) – fraction of total data to use for testing and training. 0 < DR < 100
DF (float, required) – fraction of the DR to use for training. The rest of the DR is used for testing. 0 < DF < 100
shuffle (bool, required) – whether to shuffle the data when creating sets
folder (bool, optional) – set if the original dataset is a folder

After creating the class, utilize the member variables train_sequences and test_sequences for the data sets

Variables:

train_sequences – the files to use for training
test_sequences – the files to use for testing

data_ops module 

This module includes the classes above in addition to the following:

class ia.gaius.data_ops.PreparedData(data_directories=None, dataset=None, prep_enabled: bool = False)

Bases: Data

Overloaded type for Data class to signify that train_sequences and test_sequences contain raw sequences, and not filepaths to sequences

Use flag prep_enabled to determine whether prep() will be executed during training. Shuffle will not happen if pre_enabled=False

prep(*args, **kwargs)

ia.gaius.data_ops.atoi(text: str): Attempt to convert string to int

ia.gaius.data_ops.natural_keys(text: str): alist.sort(key=natural_keys) sorts in human order http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy’s implementation in the comments)

ia.gaius.data_ops.raw_in_count(filename: str)

ia.gaius.data_ops.validate_data(data: dict)

Validates if the data is in correct GAIuS digestible format. Returns True if data validates. Returns False if data does not validate.

Parameters:: data (dict, required) – GDF to validate

Example

>>> gdf = {'strings':["hello"], 'vectors': [], 'emotives': {} }
>>> validate_data(gdf)
True
>>> bad_gdf = {'strings': []}
>>> validate_data(bad_gdf)
Exception: Dictionary requires "vectors", "emotives", and "strings" as keys!
>>> bad_gdf_2 = {'strings':["hello"], 'vectors': [], 'emotives': ['utility|5'] }
>>> validate_data(bad_gdf_2)
Exception: "emotives" must be a dict of <str, float>. Dict not provided!

Data Ops

Data

DataRecords

data_ops module

Data Ops 

Data 

DataRecords 

data_ops module 