Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down a text into smaller units called tokens. These tokens can be words, characters, or subwords, depending on the type of tokenization used.

Types of Tokenization

  • Word Tokenization: Splits text into words. For example, "Hello, world!" becomes ["Hello", "world", "!"].
  • Character Tokenization: Splits text into individual characters. For example, "Hello, world!" becomes ["H", "e", "l", "l", "o", ",", " ", "w", "o", "r", "l", "d", "!"].
  • Subword Tokenization: Splits text into subwords, which are smaller units than words. This can be useful for handling out-of-vocabulary words.

Tokenization in Practice

Tokenization is typically performed using libraries like NLTK or spaCy. Here's an example using spaCy:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Tokenization is key in NLP."
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

This will output: ['Tokenization', 'is', 'key', 'in', 'NLP', '.']

Further Reading

For more information on tokenization, you can check out our Introduction to NLP.

Tokenization in Different Languages

Different languages may require different tokenization strategies. For example, Chinese text is typically tokenized by character rather than word, as Chinese characters can be morphemes.

Chinese Characters