renn.data package¶
Submodules¶
renn.data.data_utils module¶
Data utils
-
renn.data.data_utils.
column_parser
(text_column)[source]¶ Returns a parser which parses a row of a csv file containing labeled data, extracting the label and the text
This parser assumes the label is the zeroth element of the row, and the text is the ‘text_column’ element
renn.data.datasets module¶
Datasets.
-
renn.data.datasets.
ag_news
(split, vocab_file, sequence_length=100, batch_size=64, transform_fn=<function identity>, filter_fn=None, data_dir=None)[source]¶ Loads the ag news dataset.
-
renn.data.datasets.
goemotions
(split, vocab_file, sequence_length=50, batch_size=64, emotions=None, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶ Loads the goemotions dataset.
-
renn.data.datasets.
imdb
(split, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶ Loads the imdb reviews dataset.
-
renn.data.datasets.
snli
(split, vocab_file, sequence_length=75, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶ Loads the SNLI dataset.
-
renn.data.datasets.
mnist
(split, order='row', batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None, classes=None)[source]¶ Loads the serialized MNIST dataset.
Parameters: - the subset of classes to keep. (classes) – If None, all will be kept
-
renn.data.datasets.
yelp
(split, num_classes, vocab_file, sequence_length=1000, batch_size=64, transform=<function identity>, filter_fn=None, data_dir=None)[source]¶ Loads the yelp reviews dataset.
renn.data.synthetic module¶
Synthetic Datasets.
-
class
renn.data.synthetic.
Unordered
(num_classes=3, batch_size=64, length_sampler='Constant', sampler_params={'value': 40})[source]¶ Bases:
object
Synthetic dataset representing un-ordered classes, to mimic e.g. text-classification datasets like AG News (unlike, say, star-prediction or sentiment analysis, which features ordered classes
renn.data.tokenizers module¶
Text processing.
renn.data.wordpiece_tokenizer_learner_lib module¶
Algorithm for learning wordpiece vocabulary.
-
class
renn.data.wordpiece_tokenizer_learner_lib.
Params
(upper_thresh, lower_thresh, num_iterations, max_input_tokens, max_token_length, max_unique_chars, vocab_size, slack_ratio, include_joiner_token, joiner, reserved_tokens)¶ Bases:
tuple
-
include_joiner_token
¶ Alias for field number 8
-
joiner
¶ Alias for field number 9
-
lower_thresh
¶ Alias for field number 1
-
max_input_tokens
¶ Alias for field number 3
-
max_token_length
¶ Alias for field number 4
-
max_unique_chars
¶ Alias for field number 5
-
num_iterations
¶ Alias for field number 2
-
reserved_tokens
¶ Alias for field number 10
-
slack_ratio
¶ Alias for field number 7
-
upper_thresh
¶ Alias for field number 0
-
vocab_size
¶ Alias for field number 6
-
-
renn.data.wordpiece_tokenizer_learner_lib.
ensure_all_tokens_exist
(input_tokens, output_tokens, include_joiner_token, joiner)[source]¶ Adds all tokens in input_tokens to output_tokens if not already present.
Parameters: - input_tokens – set of strings (tokens) we want to include
- output_tokens – string to int dictionary mapping token to count
- include_joiner_token – bool whether to include joiner token
- joiner – string used to indicate suffixes
Returns: string to int dictionary with all tokens in input_tokens included
-
renn.data.wordpiece_tokenizer_learner_lib.
extract_char_tokens
(word_counts)[source]¶ Extracts all single-character tokens from word_counts.
Parameters: word_counts – list of (string, int) tuples Returns: set of single-character strings contained within word_counts
-
renn.data.wordpiece_tokenizer_learner_lib.
filter_input_words
(all_counts, allowed_chars, max_input_tokens)[source]¶ Filters out words with unallowed chars and limits words to max_input_tokens.
Parameters: - all_counts – list of (string, int) tuples
- allowed_chars – list of single-character strings
- max_input_tokens – int, maximum number of tokens accepted as input
Returns: list of (string, int) tuples of filtered wordcounts
-
renn.data.wordpiece_tokenizer_learner_lib.
generate_final_vocabulary
(reserved_tokens, char_tokens, curr_tokens)[source]¶ Generates final vocab given reserved, single-character, and current tokens.
Parameters: - reserved_tokens – list of strings (tokens) that must be included in vocab
- char_tokens – set of single-character strings
- curr_tokens – string to int dict mapping token to count
Returns: list of strings representing final vocabulary
-
renn.data.wordpiece_tokenizer_learner_lib.
get_allowed_chars
(all_counts, max_unique_chars)[source]¶ Get the top max_unique_chars characters within our wordcounts.
We want each character to be in the vocabulary so that we can keep splitting down to the character level if necessary. However, in order not to inflate our vocabulary with rare characters, we only keep the top max_unique_chars characters.
Parameters: - all_counts – list of (string, int) tuples
- max_unique_chars – int, maximum number of unique single-character tokens
Returns: set of strings containing top max_unique_chars characters in all_counts
-
renn.data.wordpiece_tokenizer_learner_lib.
get_input_words
(word_counts, reserved_tokens, max_token_length)[source]¶ Filters out words that are longer than max_token_length or are reserved.
Parameters: - word_counts – list of (string, int) tuples
- reserved_tokens – list of strings
- max_token_length – int, maximum length of a token
Returns: list of (string, int) tuples of filtered wordcounts
-
renn.data.wordpiece_tokenizer_learner_lib.
get_search_threshs
(word_counts, upper_thresh, lower_thresh)[source]¶ Clips the thresholds for binary search based on current word counts.
The upper threshold parameter typically has a large default value that can result in many iterations of unnecessary search. Thus we clip the upper and lower bounds of search to the maximum and the minimum wordcount values.
Parameters: - word_counts – list of (string, int) tuples
- upper_thresh – int, upper threshold for binary search
- lower_thresh – int, lower threshold for binary search
Returns: int, clipped upper threshold for binary search lower_search: int, clipped lower threshold for binary search
Return type: upper_search
-
renn.data.wordpiece_tokenizer_learner_lib.
get_split_indices
(word, curr_tokens, include_joiner_token, joiner)[source]¶ Gets indices for valid substrings of word, for iterations > 0.
For iterations > 0, rather than considering every possible substring, we only want to consider starting points corresponding to the start of wordpieces in the current vocabulary.
Parameters: - word – string we want to split into substrings
- curr_tokens – string to int dict of tokens in vocab (from previous iteration)
- include_joiner_token – bool whether to include joiner token
- joiner – string used to indicate suffixes
Returns: list of ints containing valid starting indices for word
-
renn.data.wordpiece_tokenizer_learner_lib.
learn
(word_counts, params)[source]¶ Takes in wordcounts and returns wordpiece vocabulary.
Parameters: - word_counts – list of (string, int) tuples
- params – Params namedtuple, parameters for learning
Returns: string, final vocabulary with each word separated by newline
-
renn.data.wordpiece_tokenizer_learner_lib.
learn_binary_search
(word_counts, lower, upper, params)[source]¶ Performs binary search to find wordcount frequency threshold.
Given upper and lower bounds and a list of (word, count) tuples, performs binary search to find the threshold closest to producing a vocabulary of size vocab_size.
Parameters: - word_counts – list of (string, int) tuples
- lower – int, lower bound for binary search
- upper – int, upper bound for binary search
- params – Params namedtuple, parameters for learning
Returns: list of strings, vocab that is closest to target vocab_size
-
renn.data.wordpiece_tokenizer_learner_lib.
learn_with_thresh
(word_counts, thresh, params)[source]¶ Wordpiece learning algorithm to produce a vocab given frequency threshold.
Parameters: - word_counts – list of (string, int) tuples
- thresh – int, frequency threshold for a token to be included in the vocab
- params – Params namedtuple, parameters for learning
Returns: list of strings, vocabulary generated for the given thresh