alex.components.nlg.tectotpl.tool.ml package¶
Submodules¶
alex.components.nlg.tectotpl.tool.ml.dataset module¶
Data set representation with ARFF input possibility.
-
class
alex.components.nlg.tectotpl.tool.ml.dataset.
Attribute
(name, type_spec)[source]¶ Bases:
object
This represents an attribute of the data set.
-
get_arff_type
()[source]¶ Return the ARFF type of the given attribute (numeric, string or list of values for nominal attributes).
-
num_values
¶ Return the number of distinct values found in this attribute. Returns -1 for numeric attributes where the number of values is not known.
-
numeric_value
(value)[source]¶ Return a numeric representation of the given value. Raise a ValueError if the given value does not conform to the attribute type.
-
soft_numeric_value
(value, add_values)[source]¶ Same as numeric_value(), but will not raise exceptions for unknown numeric/string values. Will either add the value to the list or return a NaN (depending on the add_values setting).
-
-
class
alex.components.nlg.tectotpl.tool.ml.dataset.
DataSet
[source]¶ Bases:
object
ARFF relation data representation.
-
DENSE_FIELD
= u'([^"\\\'][^,]*|\\\'[^\\\']*(\\\\\\\'[^\\\']*)*(?<!\\\\)\\\'|"[^"]*(\\\\"[^"]*)*(?<!\\\\)"),'¶
-
SPARSE_FIELD
= u'([0-9]+)\\s+([^"\\\'\\s][^,]*|\\\'[^\\\']*(\\\\\\\'[^\\\']*)*\\\'|"[^"]*(\\\\"[^"]*)*"),'¶
-
SPEC_CHARS
= u'[\\n\\r\\\'"\\\\\\t%]'¶
-
add_attrib
(attrib, values=None)[source]¶ Add a new attribute to the data set, with pre-filled values (or missing, if not set).
-
append
(other)[source]¶ Append instances from one data set to another. Their attributes must be compatible (of the same types).
-
as_bunch
(target, mask_attrib=[], select_attrib=[])[source]¶ Return the data as a scikit-learn Bunch object. The target parameter specifies the class attribute.
-
as_dict
(mask_attrib=[], select_attrib=[])[source]¶ Return the data as a list of dictionaries, which is useful as an input to DictVectorizer.
Attributes (numbers or indexes) listed in mask_attrib are not added to the dictionary. Missing values are also not added to the dictionary. If mask_attrib is not set but select_attrib is set, only attributes listed in select_attrib are added to the dictionary.
-
attrib_as_vect
(attrib, dtype=None)[source]¶ Return the specified attribute (by index or name) as a list of values. If the data type parameter is left as default, the type of the returned values depends on the attribute type (strings for nominal or string attributes, floats for numeric ones). Set the data type parameter to int or float to override the data type.
-
attrib_index
(attrib_name)[source]¶ Given an attribute name, return its number. Given a number, return precisely that number. Return -1 on failure.
-
delete_attrib
(attribs)[source]¶ Given a list of attributes, delete them from the data set. Accepts a list of names or indexes, or one name, or one index.
-
filter
(filter_func, keep_copy=True)[source]¶ Filter the data set using a filtering function and return a filtered data set.
The filtering function must take two arguments - current instance index and the instance itself in an attribute-value dictionary form - and return a boolean.
If keep_copy is set to False, filtered instances will be removed from the original data set.
-
get_headers
()[source]¶ Return a copy of the headers of this data set (just attributes list, relation name and sparse/dense setting)
-
instance
(index, dtype=u'dict', do_copy=True)[source]¶ Return the given instance as a dictionary (or a list, if specified).
If do_copy is set to False, do not create a copy of the list for dense instances (other types must be copied anyway).
-
is_empty
¶ Return true if the data structures are empty.
-
load_from_arff
(filename, encoding=u'UTF-8')[source]¶ Load an ARFF file/stream, filling the data structures.
-
load_from_dict
(data, attrib_types={})[source]¶ Fill in values from a list of dictionaries (=instances). Attributes are assumed to be of string type unless specified otherwise in the attrib_types variable. Currently only capable of creating dense data sets.
-
load_from_vect
(attrib, vect)[source]¶ Fill in values from a vector of values and an attribute (allow adding values for nominal attributes).
-
match_headers
(other, add_values=False)[source]¶ Force this data set to have equal headers as the other data set. This cares for different values of nominal/numeric attributes – (numeric values will be the same, values unknown in the other data set will be set to NaNs). In other cases, such as a different number or type of attributes, an exception is thrown.
-
merge
(other)[source]¶ Merge two DataSet objects. The list of attributes will be concatenated. The two data sets must have the same number of instances and be either both sparse or both non-sparse.
Instance weights are left unchanged (from this data set).
-
rename_attrib
(old_name, new_name)[source]¶ Rename an attribute of this data set (find it by original name or by index).
-
separate_attrib
(attribs)[source]¶ Given a list of attributes, delete them from the data set and return them as a new separate data set. Accepts a list of names or indexes, or one name, or one index.
-
split
(split_func, keep_copy=True)[source]¶ Split the data set using a splitting function and return a dictionary where keys are different return values of the splitting function and values are data sets containing instances which yield the respective splitting function return values.
The splitting function takes two arguments - the current instance index and the instance itself as an attribute-value dictionary. Its return value determines the split.
If keep_copy is set to False, ALL instances will be removed from the original data set.
-
subset
(*args, **kwargs)[source]¶ Return a data set representing a subset of this data set’s values.
Args can be a slice or [start, ] stop [, stride] to create a slice. No arguments result in a complete copy of the original.
Kwargs may contain just one value – if copy is set to false, the sliced values are removed from the original data set.
-
alex.components.nlg.tectotpl.tool.ml.model module¶
-
class
alex.components.nlg.tectotpl.tool.ml.model.
AbstractModel
(config)[source]¶ Bases:
object
Abstract ancestor of different model classes
-
check_classification_input
(instances)[source]¶ Check classification input data format, convert to list if needed.
-
evaluate
(test_file, encoding=u'UTF-8', classif_file=None)[source]¶ Evaluate on the given test data file. Return accuracy. If classif_file is set, save the classification results to this file.
-
get_classes
(data, dtype=<type 'int'>)[source]¶ Return a vector of class values from the given DataSet. If dtype is int, the integer values are returned. If dtype is None, the string values are returned.
-
static
load_from_file
(model_file)[source]¶ Load the model from a pickle file or stream (supports GZip compression).
-
-
class
alex.components.nlg.tectotpl.tool.ml.model.
Model
(config)[source]¶ Bases:
alex.components.nlg.tectotpl.tool.ml.model.AbstractModel
-
PREDICTED
= u'PREDICTED'¶
-
construct_classifier
(cfg)[source]¶ Given the config file, construct the classifier (based on the ‘classifier’ or ‘classifier_class’/’classifier_params’ settings. Defaults to DummyClassifier.
-
static
create_training_job
(config, work_dir, train_file, name=None, memory=8, encoding=u'UTF-8')[source]¶ Submit a training process on the cluster which will save the model to a pickle. Return the submitted job and the future location of the model pickle. train_file cannot be a stream, it must be an actual file.
-
-
class
alex.components.nlg.tectotpl.tool.ml.model.
SplitModel
(config)[source]¶ Bases:
alex.components.nlg.tectotpl.tool.ml.model.AbstractModel
A model that’s actually composed of several Model-s.