renn.data package¶

Submodules¶

renn.data.data_utils module¶

Data utils

renn.data.data_utils.column_parser(text_column)[source]¶

Returns a parser which parses a row of a csv file containing labeled data, extracting the label and the text

This parser assumes the label is the zeroth element of the row, and the text is the ‘text_column’ element

renn.data.data_utils.readfile(filename, parse_row)[source]¶: Reads a csv file containing labeled data, where the function parse_row() extracts a score and text from the labeled data

renn.data.data_utils.sentiment_relabel(num_classes)[source]¶: Returns a function which relabels (initially five-class) sentiment labels for subclassing the Yelp and Amazon datasets.

renn.data.datasets module¶

Datasets.

renn.data.datasets.ag_news(split, vocab_file, sequence_length=100, batch_size=64, transform_fn=<function identity>, filter_fn=None, data_dir=None)[source]¶: Loads the ag news dataset.

renn.data.datasets.goemotions(split, vocab_file, sequence_length=50, batch_size=64, emotions=None, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶: Loads the goemotions dataset.

renn.data.datasets.imdb(split, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶: Loads the imdb reviews dataset.

renn.data.datasets.snli(split, vocab_file, sequence_length=75, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶: Loads the SNLI dataset.

renn.data.datasets.tokenize_fun(tokenizer)[source]¶: Standard text processing function.

renn.data.datasets.mnist(split, order='row', batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None, classes=None)[source]¶

Loads the serialized MNIST dataset.

Parameters:	- the subset of classes to keep. (classes) – If None, all will be kept

renn.data.datasets.yelp(split, num_classes, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶: Loads the yelp reviews dataset.

renn.data.datasets.dbpedia(split, num_classes, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶: Loads the dpedia text classification dataset.

renn.data.datasets.amazon(split, num_classes, vocab_file, sequence_length=250, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶: Loads the yelp reviews dataset.

renn.data.synthetic module¶

Synthetic Datasets.

class renn.data.synthetic.Unordered(num_classes=3, batch_size=64, length_sampler='Constant', sampler_params={'value': 40})[source]¶

Bases: object

Synthetic dataset representing un-ordered classes, to mimic e.g. text-classification datasets like AG News (unlike, say, star-prediction or sentiment analysis, which features ordered classes

label_batch(batch)[source]¶: Calculates class labels for a batch of sentences

score(sentence, length)[source]¶: Calculates the score, i.e. the amount of accumulated evidence in the sentence, for each class

renn.data.tokenizers module¶

Text processing.

renn.data.tokenizers.build_vocab(corpus_generator, vocab_size, split_fun=<method 'split' of 'str' objects>)[source]¶: Builds a vocab file from a text generator.

renn.data.tokenizers.load_tokenizer(vocab_file, default_value=-1)[source]¶: Loads a tokenizer from a vocab file.

renn.data.wordpiece_tokenizer_learner_lib module¶

Algorithm for learning wordpiece vocabulary.

class renn.data.wordpiece_tokenizer_learner_lib.Params(upper_thresh, lower_thresh, num_iterations, max_input_tokens, max_token_length, max_unique_chars, vocab_size, slack_ratio, include_joiner_token, joiner, reserved_tokens)¶

Bases: tuple

include_joiner_token¶: Alias for field number 8

joiner¶: Alias for field number 9

lower_thresh¶: Alias for field number 1

max_input_tokens¶: Alias for field number 3

max_token_length¶: Alias for field number 4

max_unique_chars¶: Alias for field number 5

num_iterations¶: Alias for field number 2

reserved_tokens¶: Alias for field number 10

slack_ratio¶: Alias for field number 7

upper_thresh¶: Alias for field number 0

vocab_size¶: Alias for field number 6

renn.data.wordpiece_tokenizer_learner_lib.ensure_all_tokens_exist(input_tokens, output_tokens, include_joiner_token, joiner)[source]¶

Adds all tokens in input_tokens to output_tokens if not already present.

Parameters:	input_tokens – set of strings (tokens) we want to include output_tokens – string to int dictionary mapping token to count include_joiner_token – bool whether to include joiner token joiner – string used to indicate suffixes
Returns:	string to int dictionary with all tokens in input_tokens included

renn.data.wordpiece_tokenizer_learner_lib.extract_char_tokens(word_counts)[source]¶

Extracts all single-character tokens from word_counts.

Parameters:	word_counts – list of (string, int) tuples
Returns:	set of single-character strings contained within word_counts

renn.data.wordpiece_tokenizer_learner_lib.filter_input_words(all_counts, allowed_chars, max_input_tokens)[source]¶

Filters out words with unallowed chars and limits words to max_input_tokens.

Parameters:	all_counts – list of (string, int) tuples allowed_chars – list of single-character strings max_input_tokens – int, maximum number of tokens accepted as input
Returns:	list of (string, int) tuples of filtered wordcounts

renn.data.wordpiece_tokenizer_learner_lib.generate_final_vocabulary(reserved_tokens, char_tokens, curr_tokens)[source]¶

Generates final vocab given reserved, single-character, and current tokens.

Parameters:	reserved_tokens – list of strings (tokens) that must be included in vocab char_tokens – set of single-character strings curr_tokens – string to int dict mapping token to count
Returns:	list of strings representing final vocabulary

renn.data.wordpiece_tokenizer_learner_lib.get_allowed_chars(all_counts, max_unique_chars)[source]¶

Get the top max_unique_chars characters within our wordcounts.

We want each character to be in the vocabulary so that we can keep splitting down to the character level if necessary. However, in order not to inflate our vocabulary with rare characters, we only keep the top max_unique_chars characters.

Parameters:	all_counts – list of (string, int) tuples max_unique_chars – int, maximum number of unique single-character tokens
Returns:	set of strings containing top max_unique_chars characters in all_counts

renn.data.wordpiece_tokenizer_learner_lib.get_input_words(word_counts, reserved_tokens, max_token_length)[source]¶

Filters out words that are longer than max_token_length or are reserved.

Parameters:	word_counts – list of (string, int) tuples reserved_tokens – list of strings max_token_length – int, maximum length of a token
Returns:	list of (string, int) tuples of filtered wordcounts

renn.data.wordpiece_tokenizer_learner_lib.get_search_threshs(word_counts, upper_thresh, lower_thresh)[source]¶

Clips the thresholds for binary search based on current word counts.

The upper threshold parameter typically has a large default value that can result in many iterations of unnecessary search. Thus we clip the upper and lower bounds of search to the maximum and the minimum wordcount values.

Parameters:	word_counts – list of (string, int) tuples upper_thresh – int, upper threshold for binary search lower_thresh – int, lower threshold for binary search
Returns:	int, clipped upper threshold for binary search lower_search: int, clipped lower threshold for binary search
Return type:	upper_search

renn.data.wordpiece_tokenizer_learner_lib.get_split_indices(word, curr_tokens, include_joiner_token, joiner)[source]¶

Gets indices for valid substrings of word, for iterations > 0.

For iterations > 0, rather than considering every possible substring, we only want to consider starting points corresponding to the start of wordpieces in the current vocabulary.

Parameters:	word – string we want to split into substrings curr_tokens – string to int dict of tokens in vocab (from previous iteration) include_joiner_token – bool whether to include joiner token joiner – string used to indicate suffixes
Returns:	list of ints containing valid starting indices for word

renn.data.wordpiece_tokenizer_learner_lib.learn(word_counts, params)[source]¶

Takes in wordcounts and returns wordpiece vocabulary.

Parameters:	word_counts – list of (string, int) tuples params – Params namedtuple, parameters for learning
Returns:	string, final vocabulary with each word separated by newline

renn.data.wordpiece_tokenizer_learner_lib.learn_binary_search(word_counts, lower, upper, params)[source]¶

Performs binary search to find wordcount frequency threshold.

Given upper and lower bounds and a list of (word, count) tuples, performs binary search to find the threshold closest to producing a vocabulary of size vocab_size.

Parameters:	word_counts – list of (string, int) tuples lower – int, lower bound for binary search upper – int, upper bound for binary search params – Params namedtuple, parameters for learning
Returns:	list of strings, vocab that is closest to target vocab_size

renn.data.wordpiece_tokenizer_learner_lib.learn_with_thresh(word_counts, thresh, params)[source]¶

Wordpiece learning algorithm to produce a vocab given frequency threshold.

Parameters:	word_counts – list of (string, int) tuples thresh – int, frequency threshold for a token to be included in the vocab params – Params namedtuple, parameters for learning
Returns:	list of strings, vocabulary generated for the given thresh