In the context of an LLM (Large Language Model), a token refers to a unit of text that the model processes. Each token typically represents a word or a subword in the input text. For example, in English text, words like "cat," "dog," or "house" would each be tokens. In some cases, tokens might represent parts of words or characters, especially in languages with complex morphology or in tokenization processes like byte-pair encoding (BPE) used in models like GPT. Each token is assigned a numerical representation or embedding by the model, which is then used for various natural language processing tasks.
Embeddings refer to numerical representations of words, phrases, or sentences. At the end of tha day computers can get only number as input. Software engineers invented a way to convert string of text in chunks of numbers so they can process words with computers.
These representations capture semantic and syntactic relationships between words, allowing machine learning models to understand and process language more effectively.
In simpler terms, embeddings encode words or text into vectors of real numbers, where each dimension of the vector represents a different linguistic feature. For example, two words with similar meanings might have embeddings that are closer together in this high-dimensional space, while words with different meanings would be farther apart.
Word embeddings are often learned from large text corpora using techniques like word2vec, GloVe, or FastText. These embeddings are then used as input to machine learning models for various NLP tasks such as sentiment analysis, machine translation, and named entity recognition. Embeddings can also be learned for larger text units like phrases or sentences, capturing their semantic content in a similar way.
In most cases, each token is converted into a vector when generating embeddings. When processing text with machine learning models, especially in natural language processing tasks, each token is mapped to its corresponding embedding vector.
That's why LLMs use tokens as unit of costs, because they represent the unit of the effort to create vectors from your text.
We are also experimenting with Generative AI. Learn more on https://inspector.dev