TransformersTokenizer¶
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
texts = ['This is a test', 'Just test']
transfomersTokenizer = TransformersTokenizer(tokenizer)
tok_texts = list(transfomersTokenizer(texts))
test_eq(tok_texts, [['[CLS]', 'this', 'is', 'a', 'test', '[SEP]'], ['[CLS]', 'just', 'test', '[SEP]']])
texts = ['This is a test', 'Just test']
# parallel_gen will return generator of (0, ['this', 'is', 'a', 'test']), (1, ['just', 'test'])
tok_texts = L(parallel_gen(TransformersTokenizer, texts, tokenizer=tokenizer)).sorted().itemgot(1)
test_eq(tok_texts, [['[CLS]', 'this', 'is', 'a', 'test', '[SEP]'],['[CLS]', 'just', 'test', '[SEP]']])
GPT2DecoderTokenizer¶
tokenizer = GPT2DecoderTokenizer.from_pretrained('distilgpt2')
sentence = 'The dog.'
test_eq( tokenizer.tokenize(sentence), ['<|bos|>', 'The', 'Ġdog', '.', '<|endoftext|>'] )
test_eq( tokenizer.encode(sentence), [50257, 464, 3290, 13, 50256] )
test_eq( tokenizer.encode(sentence, max_length=6, pad_to_max_length=True), [50257, 464, 3290, 13, 50256, 50258] )