Classes | |
class | BasicTokenizer |
class | BertTokenizer |
class | FullTokenizer |
class | WordpieceTokenizer |
Functions | |
def | validate_case_matches_checkpoint (do_lower_case, init_checkpoint) |
def | convert_to_unicode (text) |
def | printable_text (text) |
def | load_vocab (vocab_file) |
def | convert_by_vocab (vocab, items) |
def | convert_tokens_to_ids (vocab, tokens) |
def | convert_ids_to_tokens (inv_vocab, ids) |
def | whitespace_tokenize (text) |
def | _is_whitespace (char) |
def | _is_control (char) |
def | _is_punctuation (char) |
def helpers.tokenization.validate_case_matches_checkpoint | ( | do_lower_case, | |
init_checkpoint | |||
) |
Checks whether the casing config is consistent with the checkpoint name.
def helpers.tokenization.convert_to_unicode | ( | text | ) |
Converts `text` to Unicode (if it's not already), assuming utf-8 input.
def helpers.tokenization.printable_text | ( | text | ) |
Returns text encoded in a way suitable for print or `tf.logging`.
def helpers.tokenization.load_vocab | ( | vocab_file | ) |
Loads a vocabulary file into a dictionary.
def helpers.tokenization.convert_by_vocab | ( | vocab, | |
items | |||
) |
Converts a sequence of [tokens|ids] using the vocab.
def helpers.tokenization.convert_tokens_to_ids | ( | vocab, | |
tokens | |||
) |
def helpers.tokenization.convert_ids_to_tokens | ( | inv_vocab, | |
ids | |||
) |
def helpers.tokenization.whitespace_tokenize | ( | text | ) |
Runs basic whitespace cleaning and splitting on a piece of text.
|
private |
Checks whether `chars` is a whitespace character.
|
private |
Checks whether `chars` is a control character.
|
private |
Checks whether `chars` is a punctuation character.