The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.
We recommend you upgrade now or ensure your notebook will continue to use TensorFlow 1.x via the %tensorflow_version 1.x magic: more info.

TransformersTokenizer

class TransformersTokenizer[source]

TransformersTokenizer(tokenizer:PreTrainedTokenizer)

fastai want the tokenizer can handle list of string. use in parallel_gen()

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

texts = ['This is a test', 'Just test']
transfomersTokenizer = TransformersTokenizer(tokenizer)
tok_texts = list(transfomersTokenizer(texts))

test_eq(tok_texts, [['[CLS]', 'this', 'is', 'a', 'test', '[SEP]'], ['[CLS]', 'just', 'test', '[SEP]']])
texts = ['This is a test', 'Just test']
# parallel_gen will return generator of (0, ['this', 'is', 'a', 'test']), (1, ['just', 'test'])
tok_texts = L(parallel_gen(TransformersTokenizer, texts, tokenizer=tokenizer)).sorted().itemgot(1)

test_eq(tok_texts, [['[CLS]', 'this', 'is', 'a', 'test', '[SEP]'],['[CLS]', 'just', 'test', '[SEP]']])

GPT2DecoderTokenizer

class GPT2DecoderTokenizer[source]

GPT2DecoderTokenizer(*inputs, **kwargs) :: GPT2Tokenizer

Add special tokens: <|bos|>, <|pad|>. Add <|bos|> to the begin of the tokenized string and add <|endoftext|> to the end of the tokenized string. For the decoder of machine translation

tokenizer = GPT2DecoderTokenizer.from_pretrained('distilgpt2')
sentence = 'The dog.'
test_eq( tokenizer.tokenize(sentence), ['<|bos|>', 'The', 'Ġdog', '.', '<|endoftext|>'] )
test_eq( tokenizer.encode(sentence), [50257, 464, 3290, 13, 50256] )
test_eq( tokenizer.encode(sentence, max_length=6, pad_to_max_length=True), [50257, 464, 3290, 13, 50256, 50258] )