renn.data package

Submodules

renn.data.data_utils module

Data utils

renn.data.data_utils.column_parser(text_column)[source]

Returns a parser which parses a row of a csv file containing labeled data, extracting the label and the text

This parser assumes the label is the zeroth element of the row, and the text is the ‘text_column’ element

renn.data.data_utils.readfile(filename, parse_row)[source]

Reads a csv file containing labeled data, where the function parse_row() extracts a score and text from the labeled data

renn.data.data_utils.sentiment_relabel(num_classes)[source]

Returns a function which relabels (initially five-class) sentiment labels for subclassing the Yelp and Amazon datasets.

renn.data.datasets module

Datasets.

renn.data.datasets.ag_news(split, vocab_file, sequence_length=100, batch_size=64, transform_fn=<function identity>, filter_fn=None, data_dir=None)[source]

Loads the ag news dataset.

renn.data.datasets.goemotions(split, vocab_file, sequence_length=50, batch_size=64, emotions=None, transform=<function identity>, filter_fn=None, data_dir=None)[source]

Loads the goemotions dataset.

renn.data.datasets.imdb(split, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]

Loads the imdb reviews dataset.

renn.data.datasets.snli(split, vocab_file, sequence_length=75, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]

Loads the SNLI dataset.

renn.data.datasets.tokenize_fun(tokenizer)[source]

Standard text processing function.

renn.data.datasets.mnist(split, order='row', batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None, classes=None)[source]

Loads the serialized MNIST dataset.

Parameters:- the subset of classes to keep. (classes) – If None, all will be kept
renn.data.datasets.yelp(split, num_classes, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]

Loads the yelp reviews dataset.

renn.data.datasets.dbpedia(split, num_classes, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]

Loads the dpedia text classification dataset.

renn.data.datasets.amazon(split, num_classes, vocab_file, sequence_length=250, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]

Loads the yelp reviews dataset.

renn.data.synthetic module

Synthetic Datasets.

class renn.data.synthetic.Unordered(num_classes=3, batch_size=64, length_sampler='Constant', sampler_params={'value': 40})[source]

Bases: object

Synthetic dataset representing un-ordered classes, to mimic e.g. text-classification datasets like AG News (unlike, say, star-prediction or sentiment analysis, which features ordered classes

label_batch(batch)[source]

Calculates class labels for a batch of sentences

score(sentence, length)[source]

Calculates the score, i.e. the amount of accumulated evidence in the sentence, for each class

renn.data.tokenizers module

Text processing.

renn.data.tokenizers.build_vocab(corpus_generator, vocab_size, split_fun=<method 'split' of 'str' objects>)[source]

Builds a vocab file from a text generator.

renn.data.tokenizers.load_tokenizer(vocab_file, default_value=-1)[source]

Loads a tokenizer from a vocab file.

renn.data.wordpiece_tokenizer_learner_lib module

Algorithm for learning wordpiece vocabulary.

class renn.data.wordpiece_tokenizer_learner_lib.Params(upper_thresh, lower_thresh, num_iterations, max_input_tokens, max_token_length, max_unique_chars, vocab_size, slack_ratio, include_joiner_token, joiner, reserved_tokens)

Bases: tuple

include_joiner_token

Alias for field number 8

joiner

Alias for field number 9

lower_thresh

Alias for field number 1

max_input_tokens

Alias for field number 3

max_token_length

Alias for field number 4

max_unique_chars

Alias for field number 5

num_iterations

Alias for field number 2

reserved_tokens

Alias for field number 10

slack_ratio

Alias for field number 7

upper_thresh

Alias for field number 0

vocab_size

Alias for field number 6

renn.data.wordpiece_tokenizer_learner_lib.ensure_all_tokens_exist(input_tokens, output_tokens, include_joiner_token, joiner)[source]

Adds all tokens in input_tokens to output_tokens if not already present.

Parameters:
  • input_tokens – set of strings (tokens) we want to include
  • output_tokens – string to int dictionary mapping token to count
  • include_joiner_token – bool whether to include joiner token
  • joiner – string used to indicate suffixes
Returns:

string to int dictionary with all tokens in input_tokens included

renn.data.wordpiece_tokenizer_learner_lib.extract_char_tokens(word_counts)[source]

Extracts all single-character tokens from word_counts.

Parameters:word_counts – list of (string, int) tuples
Returns:set of single-character strings contained within word_counts
renn.data.wordpiece_tokenizer_learner_lib.filter_input_words(all_counts, allowed_chars, max_input_tokens)[source]

Filters out words with unallowed chars and limits words to max_input_tokens.

Parameters:
  • all_counts – list of (string, int) tuples
  • allowed_chars – list of single-character strings
  • max_input_tokens – int, maximum number of tokens accepted as input
Returns:

list of (string, int) tuples of filtered wordcounts

renn.data.wordpiece_tokenizer_learner_lib.generate_final_vocabulary(reserved_tokens, char_tokens, curr_tokens)[source]

Generates final vocab given reserved, single-character, and current tokens.

Parameters:
  • reserved_tokens – list of strings (tokens) that must be included in vocab
  • char_tokens – set of single-character strings
  • curr_tokens – string to int dict mapping token to count
Returns:

list of strings representing final vocabulary

renn.data.wordpiece_tokenizer_learner_lib.get_allowed_chars(all_counts, max_unique_chars)[source]

Get the top max_unique_chars characters within our wordcounts.

We want each character to be in the vocabulary so that we can keep splitting down to the character level if necessary. However, in order not to inflate our vocabulary with rare characters, we only keep the top max_unique_chars characters.

Parameters:
  • all_counts – list of (string, int) tuples
  • max_unique_chars – int, maximum number of unique single-character tokens
Returns:

set of strings containing top max_unique_chars characters in all_counts

renn.data.wordpiece_tokenizer_learner_lib.get_input_words(word_counts, reserved_tokens, max_token_length)[source]

Filters out words that are longer than max_token_length or are reserved.

Parameters:
  • word_counts – list of (string, int) tuples
  • reserved_tokens – list of strings
  • max_token_length – int, maximum length of a token
Returns:

list of (string, int) tuples of filtered wordcounts

renn.data.wordpiece_tokenizer_learner_lib.get_search_threshs(word_counts, upper_thresh, lower_thresh)[source]

Clips the thresholds for binary search based on current word counts.

The upper threshold parameter typically has a large default value that can result in many iterations of unnecessary search. Thus we clip the upper and lower bounds of search to the maximum and the minimum wordcount values.

Parameters:
  • word_counts – list of (string, int) tuples
  • upper_thresh – int, upper threshold for binary search
  • lower_thresh – int, lower threshold for binary search
Returns:

int, clipped upper threshold for binary search lower_search: int, clipped lower threshold for binary search

Return type:

upper_search

renn.data.wordpiece_tokenizer_learner_lib.get_split_indices(word, curr_tokens, include_joiner_token, joiner)[source]

Gets indices for valid substrings of word, for iterations > 0.

For iterations > 0, rather than considering every possible substring, we only want to consider starting points corresponding to the start of wordpieces in the current vocabulary.

Parameters:
  • word – string we want to split into substrings
  • curr_tokens – string to int dict of tokens in vocab (from previous iteration)
  • include_joiner_token – bool whether to include joiner token
  • joiner – string used to indicate suffixes
Returns:

list of ints containing valid starting indices for word

renn.data.wordpiece_tokenizer_learner_lib.learn(word_counts, params)[source]

Takes in wordcounts and returns wordpiece vocabulary.

Parameters:
  • word_counts – list of (string, int) tuples
  • params – Params namedtuple, parameters for learning
Returns:

string, final vocabulary with each word separated by newline

Performs binary search to find wordcount frequency threshold.

Given upper and lower bounds and a list of (word, count) tuples, performs binary search to find the threshold closest to producing a vocabulary of size vocab_size.

Parameters:
  • word_counts – list of (string, int) tuples
  • lower – int, lower bound for binary search
  • upper – int, upper bound for binary search
  • params – Params namedtuple, parameters for learning
Returns:

list of strings, vocab that is closest to target vocab_size

renn.data.wordpiece_tokenizer_learner_lib.learn_with_thresh(word_counts, thresh, params)[source]

Wordpiece learning algorithm to produce a vocab given frequency threshold.

Parameters:
  • word_counts – list of (string, int) tuples
  • thresh – int, frequency threshold for a token to be included in the vocab
  • params – Params namedtuple, parameters for learning
Returns:

list of strings, vocabulary generated for the given thresh

Module contents