Natural Language Processing: Past and Present

12 min readJan 1, 2025

In linguistics, the oldest way of thinking about “meaning” can usually be described by the relationship between “signifier” and “signified”:

Signifier: refers to the formal part of the symbol, such as the spelling, pronunciation or text form of a word.
Signified: refers to the concept or thing represented by the symbol, that is, the “meaning” or “object” associated with the word.

When people use “Signifier” and “Signified” to explain the meaning of words, they use the thinking paradigm of denotational semantics. It means that each symbol (such as the word “tree”) directly points to a concept or physical object (the concept of “tree” in reality or “tree” in the mind). This view believes that there is a relatively fixed correspondence between “symbols and concepts”, which reflects how language expresses concrete or abstract objective things in a perceptible symbol, thus forming the most basic expression of linguistic meaning.

Denotational semantics is the most basic and simplest semantic interpretation method in lexical semantics. The semantic analysis of vocabulary based on denotational semantics generally needs to decompose:

Synonym

It refers to a group of words with the same or very similar meaning. For example, “fast” and “speedy” can be regarded as synonyms in English.
Synonyms are not necessarily interchangeable in all situations. Sometimes there are differences in the degree or use of scenarios, but their core semantics are very close.

2. Hypernym (upper word)

It refers to a broader meaning and can be used as a general term for a series of specific words ( lower words or hyponyms).
For example: “Fruit” is the hypernym of “apple, banana, pear”, etc. “Transportation” is the hypernym of “car, train, airplane” and so on.
The relationship between hypernym and hyponym reflects the semantic hierarchy (taxonomy) structure: upper-level words represent more general and abstract concepts, and lower-level words represent more specific and subdivided concepts.

The computer represents all data in binary, so how can we express the meaning of words and the relationship between them?

The earliest idea is to arrange words into a dictionary. They are discrete single symbols. A typical word encoding method called “One Hot” coding, which expresses words as vectors, with 1 in one element only and 0 in all the other elements.

This coding method makes it difficult to store the relationship between words, such as synonyms and hypernyms. When the bank is used as a retail bank, financial institution is its upper word, as a river bank, geographical field is its upper word. A word has multiple meanings, multiple hypernyms, how to represent the multiple options in computers? If we add symbolism, metaphor, and irony, it will be difficult for us to accurately find the meaning of a word or choose a word to express the specific meaning we want to convey. Especially when the meaning is related to the context. The meanings of the same word are different in different contexts. These specific meanings do not depend on the words themselves but on their context. In this case, denotational semantics is no longer applicable.

Computer scientists have several ways to deal with this problem, such as Context-based Disambiguation (Word Sense Disambiguation) by looking for words around Bank to determine whether it is a finance bank or a river bank, Semantic Networks using graph data structures to represent the semantic relationship between words, such as synonyms, hypernyms, hyponyms, etc. There is also knowledge graph, which tries to store different entities, concepts, and the relationship between them. However, denotational semantics is a fundamental thinking paradigm. These are dead ends if the One Hot code dominance does not change.

Therefore, several pioneering linguists, such as Ukrainian linguist Zellig Harris (the mentor of renowned Chinese linguist Zhao Yuanren), tried to use statistics to express vocabulary. In particular, the famous quote put forward by British linguist J.R Firth in 1957:

You shall know a word by the company it keeps

which basically means the meaning of the word is its context.

These claims have developed into contextual semantics. This is a supplement to the denotational semantics, emphasizing the influence of context on the meaning of words, paying attention to the implications, culture and pragmatic factors. However, it is not easy to formalize and standardize, and it is challenging to conduct logical analysis.

These forerunners are equivalent to theoretical scientists, and computer scientists (experimental scientists) implement and verify their insights into linguistics. The link is mathematics. These formalized and standardized tasks must be completed by finding the proper mathematical methods.

Some computer scientists have invented innovative vector coding, which is no longer a single “One Hot” coding but encodes the context together. For example, Scott Deerwester records the frequency of co-occurrence of words in certain contexts to make a “co-occurrence matrix”. Because of the large number of words, the matrix is very sparse, and then its dimension must be reduced using mathematical methods. The method adopted is Singular Value Decomposition (SVD), in which large matrices are represented as multiplication of three matrices. These three matrices are two orthogonal square matrices and a rectangular diagonal matrix, equivalent to exchanging space with time. Among them, the singular values of the rectangular diagonal matrix are sorted according to the contribution to the original large matrix. Therefore, the less critical parts can be abandoned while keeping the primary data, making the matrices smaller.

Subsequent computer scientists, such as the paper of David Blei and Andrew Ng in 2003, developed this idea and established a probabilistic relationship between the frequency of occurrence of words and the meaning of texts. This is the Latent Dirichlet Allocation. The meaning of a text is a mixed probability of many possible topics, each of which is a mixed probability of a series of words. This idea is essentially a generative probability model based on Bayesian statistics, which finds conditional probability through observations. Its calculation process is very similar to the hidden Markov model (HMM), which infers the probability of hidden data through the observed data. Of course, HMM processes sequence data, and the hidden state changes with time (or sequence position). LDA processes the “word bag” data, ignores the order of words, and pays attention to the co-occurrence of words. Although LDA and other methods based on probability statistics perform well in tasks such as Topic Modeling and Lexical Relationship Analysis, they are limited by the paradigm of thinking and cannot make a breakthrough of the complex semantics of natural language. These methods are all progress, but they cannot achieve significant results.

At the same time, Geoffrey Hinton, the Turing award winner, made progress on another track. He proposed the idea of distributed semantic representation in a 1986 paper. Distributed Representation uses a dense real-valued vector to represent words or concepts, in which each dimensional feature has no clear meaning but expresses semantics through combination. A concept is described as a point in the high dimensional space, and similar concepts are close to each other in this space. This allows us to do math on synonyms and antonyms. By changing the sparse One Hot vector to a dense vector, words with similar semantics are closer in the vector space. Hinton also improved the recurrent neural network (RNN). His ideas are radical, beautiful, and beyond the age. Computers should be able to learn independently without artificially giving meaning to vocabulary, but the development of computer science limited this and could not be achieved at that time.

In 2003, Yoshua Bengio, who won the Turing Award along with Hinton, first proposed a neural network-based language model (NNLM) to combine Word Embedding with neural networks. RNN was used in the 1980s to process serialized information, such as digital signal processing. In 1997, LSTM and other technologies added memory units based on RNN, which can capture long-distance dependencies. Bengio and others use multi-layer feed-forward neural networks to learn word vectors and language models jointly, they not only express words as dense real-value vectors (Dense Vector) but also model the relationship between words through context windows. Learn word vectors in neural networks for the first time, and model probabilities through nonlinear activation functions (Tanh) and Softmax. While traditional methods rely on manual feature engineering and are only capable of linear classification, neural network models can automatically learn input representations and have nonlinear decision boundaries to capture complicated patterns. Context-based neural network training can derive a weight vector for each current center word, indicating which surrounding words are most important for determining the meaning of the center word. Although NLP is a vast and challenging field with extremely high irregularities, this small step is undoubtedly significant progress in a critical area. Yet, this method requires a lot of computing power. It was not until nearly 10 years later that the large-scale use of GPU for neural network training realized its potential.

Out of confidence in this field and after seeing the development of related technologies, Tomas Mikolov and his team invented a breakthrough algorithm, word2vec, to convert text into vectors in 2013. Word2Vec is essentially a categorization and representation of words. It enabled the calculation of words by representing them in a high-dimensional vector space. This significant discovery has completely changed the direction of NLP development. When these PhDs see that the trained text vectors can go through plus and minus to implement the analogy reasoning, the following relationship, they must feel fanatic:

To achieve this goal, Mikolov introduced Negative Sampling and a Hierarchical Softmax to normalize the probability, significantly reducing computation complexity. This allowed word2vec to be trained on large-scale corpus, such as Google News's 1 billion words.

As an analogy, assuming there is no such thing as Dewey coding, a library receives a large number of uncategorized books. How should we organize them on the shelves?

Words themselves are symbols, just like book titles. The categorization and placement of books are analogous to abstract meanings, which can only be understood by discovering the relationships among the books. Neighboring books share a sort of mysterious connection — they might be by the same author, part of the same series, or cover similar themes — but they are not necessarily in order or directly adjacent to each other.

Mikolov invented the Skip-Gram model, which predicts surrounding words (context) based on the current word. It’s like using the title of one book to guess which other books are likely to be placed nearby. The CBOW (Continuous Bag of Words) model, on the other hand, predicts the current word based on the surrounding books (context). The model is trained using a large corpus by comparing these predictions to actual occurrences. If the predictions are inaccurate, partial derivatives are used to update the algorithm’s weights, making the model more accurate over time.

When training word embedding models, Negative Sampling can significantly improve efficiency. Instead of comparing each book with the dozen or so books around it to calculate similarity, it’s more efficient to see how this book differs from three randomly selected books that are placed far away. This is because books placed far apart are less likely to be sequels or have related content. By measuring similarity or dissimilarity, we can then move the books to different shelves. This is the essence of mini-batch Negative Sampling. At the same time, Mikolov’s algorithm uses Sparse Matrix Updates, meaning that only a small portion of the data is modified at each step. This is like not needing to read and move every single book in the library; we only need to focus on rearranging a small part of them. Regarding the use of hash tables for word vector lookups, it’s similar to assigning labels to books for easy cataloging. RNNs combined with Word Embedding are incredibly powerful, achieving remarkable success in tasks such as machine translation, sentiment analysis, and dialogue systems.

One year later, Stanford Professor Christopher Manning, added global relations on top of word2vec in his GloVe model, achieved another state-of-the-art result. Almost 7 years later, while teaching CS224N NLP of winter 2021, Manning attributed GloVe’s better result to the training data. (Wikipedia is better than Google News).

In 2014, the Bengio team first proposed to add an Attention mechanism to RNN, which imitates the way of human reading and understanding and determines which context information is more important than the rest. Attention Score assigns different weights to context information, that is, to give a higher probability to more relevant contexts. This mechanism enables the model to capture semantic and context dependencies more accurately when conducting lexical prediction, translation, or text generation. It aims to alleviate the problem of deviation amplification caused by RNN. It was immediately successfully applied in the field of machine translation. The neural machine translation of the seq2seq structure trained by a team of dozens in months outperformed the statistical models developed by hundreds for years. All translation software companies turned to Neural Machine Translation.

In 2017, the Google Brain team proposed to abandon RNN and only use the Self Attention mechanism to design the Transformer architecture, removing the restriction that the words must be processed sequentially. We can achieve Massive Parallel Training when all words can be processed simultaneously. It brought forward significant innovations, including:

Positional Encoding. Injecting information about the position of words in a sequence enables the model to capture sequential relationships. The self-attention mechanism captures long-distance dependencies while preserving the Positional Encoding information, allowing the model to perform better with long text and complex semantic structures.
Unsupervised pre-training, thus no longer relying on large-scale annotated data, reduces training costs.
The self-attention mechanism expands to multiple heads so that different levels of information in the text can be captured, both syntactically and semantically, and even include some logical inferences.

Transformer-based neural machine translation further improves translation accuracy compared to RNN with attention mechanism. We now call the Transformer-based language model a large language model.

In 2018, OpenAI published GPT, a one-directional decoder that is a breakthrough large language model of text generation. This model expanded the transformer to the text generation area and established a new paradigm: unsupervised pre-training + supervised fine-tuning, allowing models to generate coherent and natural long text. In the same year, Google published BERT, a bidirectional encoder that is a breakthrough for semantic understanding and context modeling. Dramatically improved semantic understanding, especially in tasks such as question and answer (QA), sentiment analysis, reading comprehension, and text categorization. Achieved SOTA performance in GLUE, SQuAD, and other benchmarks.

In 2022, ChatGPT was born. It aroused great public enthusiasm. Based on GPT3.5, it optimized dialogue strategies and contextual connections through Reinforced Learning with Human Feedback (RLHF). Dialogue fine-tuning makes ChatGPT better at multiple rounds of dialogue and emotional conversation and can dynamically adjust the semantic consistency in the context. It has become a milestone in dialogue AI and generative AI.

At present, Pragmatics is a hot spot for further NLP research. It expands the boundaries of semantics and considers the influence of the speaker’s intention, cultural background, and social relations on the meaning of words. For example: Question: “Can you open the window?”
Semantic explanation: This question is about “ability”. Pragmatic explanation: Request the other party to open the window.

Compared with semantics, pragmatics emphasizes the implicit meaning and inferred intention, such as metaphor, irony, pun, euphemism, and other phenomena. These phenomena depend on context, cultural background, social relations, speaker’s intention, and dialogue coherence. The large language model has attracted social attention because it not only solves the problem of semantics but also shows some pragmatics achievements.

For example, Intent Recognition is a key issue in natural language understanding (NLU). For example, dialogue logic involves contextual consistency, echoing, and logical reasoning. The implementation of the existing model is relatively simple. For example, ChatGPT has the client to resend all the dialogue history back to the model, given the fixed context window, long-distance logical relationship will be at a loss. Perhaps we can send back some of the “most important” dialogue concepts to avoid losing long-distance logical relationships.

For the further development of pragmatics, we hope for the emergence of large language models. Emergence is a phenomenon in a complex system in which the interaction of simple elements leads to a more advanced and complex behavior in the overall system. For example, how did free amino acids become proteins and eventually become organic molecules that can replicate themselves? The intelligence of a swarm of bees is far more superior than a single bee. A large number of human neurons are connected, resulting in complex consciousness and thinking activities. The overall behavior and phenomenon cannot be achieved by adding up the simple individual elements. However, if there is a huge number of repeated attempts, extremely low-probability events will eventually occur. This is a research hotspot of complex systems science and nonlinear dynamics.

Large-scale parameters and training data make the LLM model show non-linear and complex language understanding and generation ability. Complex reasoning, analogy, dialogue logic, common sense reasoning, and other abilities are emerging. However, we still know very little about the internal nature of the large language model, and our ability to explain it is very limited. We basically rely on speculation, hoping that a new generation of NLP scholars can make further breakthroughs.

Originally published at http://rayhu007.wordpress.com on January 1, 2025.

Natural Language Processing: Past and Present

Written by Ray Hu

No responses yet