Public Member Functions | |
def | __init__ (self, do_lower_case=True) |
def | tokenize (self, text) |
Public Attributes | |
do_lower_case | |
Private Member Functions | |
def | _run_strip_accents (self, text) |
def | _run_split_on_punc (self, text) |
def | _tokenize_chinese_chars (self, text) |
def | _is_chinese_char (self, cp) |
def | _clean_text (self, text) |
Runs basic tokenization (punctuation splitting, lower casing, etc.).
def helpers.tokenization.BasicTokenizer.__init__ | ( | self, | |
do_lower_case = True |
|||
) |
Constructs a BasicTokenizer. Args: do_lower_case: Whether to lower case the input.
def helpers.tokenization.BasicTokenizer.tokenize | ( | self, | |
text | |||
) |
Tokenizes a piece of text.
|
private |
Strips accents from a piece of text.
|
private |
Splits punctuation on a piece of text.
|
private |
Adds whitespace around any CJK character.
|
private |
Checks whether CP is the codepoint of a CJK character.
|
private |
Performs invalid character removal and whitespace cleanup on text.
helpers.tokenization.BasicTokenizer.do_lower_case |