alex.components.nlg.tectotpl.tool.ml package

Submodules

alex.components.nlg.tectotpl.tool.ml.dataset module

Data set representation with ARFF input possibility.

class alex.components.nlg.tectotpl.tool.ml.dataset.Attribute(name, type_spec)[source]

Bases: object

This represents an attribute of the data set.

get_arff_type()[source]

Return the ARFF type of the given attribute (numeric, string or list of values for nominal attributes).

num_values

Return the number of distinct values found in this attribute. Returns -1 for numeric attributes where the number of values is not known.

numeric_value(value)[source]

Return a numeric representation of the given value. Raise a ValueError if the given value does not conform to the attribute type.

soft_numeric_value(value, add_values)[source]

Same as numeric_value(), but will not raise exceptions for unknown numeric/string values. Will either add the value to the list or return a NaN (depending on the add_values setting).

value(numeric_val)[source]

Given a numeric (int/float) value, returns the corresponding string value for string or nominal attributes, or the identical value for numeric attributes. Returns None for missing nominal/string values, NaN for missing numeric values.

values_set()[source]

Return a set of all possible values for this attribute.

class alex.components.nlg.tectotpl.tool.ml.dataset.DataSet[source]

Bases: object

ARFF relation data representation.

DENSE_FIELD = u'([^"\\\'][^,]*|\\\'[^\\\']*(\\\\\\\'[^\\\']*)*(?<!\\\\)\\\'|"[^"]*(\\\\"[^"]*)*(?<!\\\\)"),'
SPARSE_FIELD = u'([0-9]+)\\s+([^"\\\'\\s][^,]*|\\\'[^\\\']*(\\\\\\\'[^\\\']*)*\\\'|"[^"]*(\\\\"[^"]*)*"),'
SPEC_CHARS = u'[\\n\\r\\\'"\\\\\\t%]'
add_attrib(attrib, values=None)[source]

Add a new attribute to the data set, with pre-filled values (or missing, if not set).

append(other)[source]

Append instances from one data set to another. Their attributes must be compatible (of the same types).

as_bunch(target, mask_attrib=[], select_attrib=[])[source]

Return the data as a scikit-learn Bunch object. The target parameter specifies the class attribute.

as_dict(mask_attrib=[], select_attrib=[])[source]

Return the data as a list of dictionaries, which is useful as an input to DictVectorizer.

Attributes (numbers or indexes) listed in mask_attrib are not added to the dictionary. Missing values are also not added to the dictionary. If mask_attrib is not set but select_attrib is set, only attributes listed in select_attrib are added to the dictionary.

attrib_as_vect(attrib, dtype=None)[source]

Return the specified attribute (by index or name) as a list of values. If the data type parameter is left as default, the type of the returned values depends on the attribute type (strings for nominal or string attributes, floats for numeric ones). Set the data type parameter to int or float to override the data type.

attrib_index(attrib_name)[source]

Given an attribute name, return its number. Given a number, return precisely that number. Return -1 on failure.

delete_attrib(attribs)[source]

Given a list of attributes, delete them from the data set. Accepts a list of names or indexes, or one name, or one index.

filter(filter_func, keep_copy=True)[source]

Filter the data set using a filtering function and return a filtered data set.

The filtering function must take two arguments - current instance index and the instance itself in an attribute-value dictionary form - and return a boolean.

If keep_copy is set to False, filtered instances will be removed from the original data set.

get_attrib(attrib)[source]

Given an attribute name or index, return the Attribute object.

get_headers()[source]

Return a copy of the headers of this data set (just attributes list, relation name and sparse/dense setting)

instance(index, dtype=u'dict', do_copy=True)[source]

Return the given instance as a dictionary (or a list, if specified).

If do_copy is set to False, do not create a copy of the list for dense instances (other types must be copied anyway).

is_empty

Return true if the data structures are empty.

load_from_arff(filename, encoding=u'UTF-8')[source]

Load an ARFF file/stream, filling the data structures.

load_from_dict(data, attrib_types={})[source]

Fill in values from a list of dictionaries (=instances). Attributes are assumed to be of string type unless specified otherwise in the attrib_types variable. Currently only capable of creating dense data sets.

load_from_matrix(attr_list, matrix)[source]

Fill in values from a matrix.

load_from_vect(attrib, vect)[source]

Fill in values from a vector of values and an attribute (allow adding values for nominal attributes).

match_headers(other, add_values=False)[source]

Force this data set to have equal headers as the other data set. This cares for different values of nominal/numeric attributes – (numeric values will be the same, values unknown in the other data set will be set to NaNs). In other cases, such as a different number or type of attributes, an exception is thrown.

merge(other)[source]

Merge two DataSet objects. The list of attributes will be concatenated. The two data sets must have the same number of instances and be either both sparse or both non-sparse.

Instance weights are left unchanged (from this data set).

rename_attrib(old_name, new_name)[source]

Rename an attribute of this data set (find it by original name or by index).

save_to_arff(filename, encoding=u'UTF-8')[source]

Save the data set to an ARFF file

separate_attrib(attribs)[source]

Given a list of attributes, delete them from the data set and return them as a new separate data set. Accepts a list of names or indexes, or one name, or one index.

split(split_func, keep_copy=True)[source]

Split the data set using a splitting function and return a dictionary where keys are different return values of the splitting function and values are data sets containing instances which yield the respective splitting function return values.

The splitting function takes two arguments - the current instance index and the instance itself as an attribute-value dictionary. Its return value determines the split.

If keep_copy is set to False, ALL instances will be removed from the original data set.

subset(*args, **kwargs)[source]

Return a data set representing a subset of this data set’s values.

Args can be a slice or [start, ] stop [, stride] to create a slice. No arguments result in a complete copy of the original.

Kwargs may contain just one value – if copy is set to false, the sliced values are removed from the original data set.

value(instance, attr_idx)[source]

Return the value of the given instance and attribute.

class alex.components.nlg.tectotpl.tool.ml.dataset.DataSetIterator(dataset)[source]

Bases: object

An iterator over the instances of a data set.

next()[source]

Move to the next instance.

alex.components.nlg.tectotpl.tool.ml.model module

class alex.components.nlg.tectotpl.tool.ml.model.AbstractModel(config)[source]

Bases: object

Abstract ancestor of different model classes

check_classification_input(instances)[source]

Check classification input data format, convert to list if needed.

classify(instances)[source]

This must be implemented in derived classes.

evaluate(test_file, encoding=u'UTF-8', classif_file=None)[source]

Evaluate on the given test data file. Return accuracy. If classif_file is set, save the classification results to this file.

get_classes(data, dtype=<type 'int'>)[source]

Return a vector of class values from the given DataSet. If dtype is int, the integer values are returned. If dtype is None, the string values are returned.

static load_from_file(model_file)[source]

Load the model from a pickle file or stream (supports GZip compression).

load_training_set(filename, encoding=u'UTF-8')[source]

Load the given training data set into memory and strip it if configured to via the train_part parameter.

save_to_file(model_file)[source]

Save the model to a pickle file or stream (supports GZip compression).

class alex.components.nlg.tectotpl.tool.ml.model.Model(config)[source]

Bases: alex.components.nlg.tectotpl.tool.ml.model.AbstractModel

PREDICTED = u'PREDICTED'
classify(instances)[source]

Classify a set of instances (possibly one member).

construct_classifier(cfg)[source]

Given the config file, construct the classifier (based on the ‘classifier’ or ‘classifier_class’/’classifier_params’ settings. Defaults to DummyClassifier.

static create_training_job(config, work_dir, train_file, name=None, memory=8, encoding=u'UTF-8')[source]

Submit a training process on the cluster which will save the model to a pickle. Return the submitted job and the future location of the model pickle. train_file cannot be a stream, it must be an actual file.

train(train_file, encoding=u'UTF-8')[source]

Train the model on the specified training data file.

train_on_data(train)[source]

Train model on the specified training data set (which must be a loaded DataSet object).

class alex.components.nlg.tectotpl.tool.ml.model.SplitModel(config)[source]

Bases: alex.components.nlg.tectotpl.tool.ml.model.AbstractModel

A model that’s actually composed of several Model-s.

classify(instances)[source]

Classify a set of instances.

train(train_file, work_dir, memory=8, encoding=u'UTF-8')[source]

Read training data, split them and train the individual models (in cluster jobs).

Module contents