Misplaced Pages

CanLII

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The Canadian Legal Information Institute ( CanLII ; French : Institut canadien d'information juridique ) is a non-profit organization created and funded by the Federation of Law Societies of Canada in 2001 on behalf of its 14 member societies. CanLII is a member of the Free Access to Law Movement , which includes the primary stakeholders involved in free, open publication of law throughout the world.

#108891

70-425: CanLII offers free public access to over 2.4 million documents across more than 300 case law and legislative databases. The official websites of provincial governments, which provide access to primary legislative documents, are linked to CANLII online. The CANLII database is one of the most comprehensive collections of Canadian federal, provincial and territorial legislation. It is used by lawyers, legal professionals and

140-539: A self-supervised and semi-supervised training process. The largest and most capable LLMs are artificial neural networks built with a decoder-only transformer-based architecture , enabling efficient processing and generation of large-scale text data. Modern models can be fine-tuned for specific tasks, or be guided by prompt engineering . These models acquire predictive power regarding syntax , semantics , and ontologies inherent in human language corpora, but they also inherit inaccuracies and biases present in

210-481: A 12-billion-parameter LLM computational cost is 72,300 A100-GPU -hours, while in 2020 the cost of training a 1.5-billion-parameter LLM (which was two orders of magnitude smaller than the state of the art in 2020) was between $ 80,000 and $ 1,600,000. Since 2020, large sums were invested in increasingly large models. For example, training of the GPT-2 (i.e. a 1.5-billion-parameters model) in 2019 cost $ 50,000, while training of

280-486: A commentary program including law reviews, e-books, articles, public legal education materials, and reports. In June 2020, CanLII started actively promoting the CanLII guest writer program. As of February 2024, CanLII is piloting the use of a large language model to generate artificial intelligence case summaries. Other websites will often use CanLII as their primary source when referring to Canadian case law, and as of

350-482: A few cases. For example, in the instruction "Write an essay about the main themes represented in Hamlet ," an initial naive completion might be "If you submit the essay after March 17, your grade will be reduced by 10% for each day of delay," based on the frequency of this textual sequence in the corpus. The largest LLM may be too expensive to train and use directly. For such models, mixture of experts (MoE) can be applied,

420-438: A further LLM. With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it). Training of largest language models might need more linguistic data than naturally available, or that

490-434: A high computational cost. BERT was originally published by Google researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. The design has its origins from pre-training contextual representations, including semi-supervised sequence learning , generative pre-training, ELMo , and ULMFit. Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only

560-415: A line of research pursued by Google researchers since 2017 to train models reaching up to 1 trillion parameters. Most results previously achievable only by (costly) fine-tuning, can be achieved through prompt engineering , although limited to the scope of a single conversation (more precisely, limited to the scope of a context window). In order to find out which tokens are relevant to each other within

630-484: A long-term memory of its previous contexts, and the memory can be retrieved in the same way as Retrieval Augmented Generation. Multiple such agents can interact socially. Typically, LLMs are trained with single- or half-precision floating point numbers (float32 and float16). One float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes. The largest models typically have 100 billion parameters, requiring 200 gigabytes to load, which places them outside

700-419: A matter of experimentation and domain-specific considerations. A model may be pre-trained either to predict how the segment continues, or what is missing in the segment, given a segment from its training dataset. It can be either Models may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and

770-421: A model with just 60% of its parameters (66M), while preserving 95% of its benchmark scores. Similarly, TinyBERT (2019) is a distilled model with just 28% of its parameters. ALBERT (2019) used shared-parameter across layers, and experimented with independently varying the hidden size and the word-embedding layer's output size as two hyperparameters. They also replaced the next sentence prediction task with

SECTION 10

#1732877118109

840-446: A pair of pretrained language model and image encoder to perform better on visual question answering than models trained from scratch. Google PaLM model was fine-tuned into a multimodal model PaLM-E using the tokenization method, and applied to robotic control. LLaMA models have also been turned multimodal using the tokenization method, to allow image inputs, and video inputs. GPT-4 can use both text and image as inputs (although

910-410: A plain text corpus . Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary, whereas BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for "running" will have the same word2vec vector representation for both of its occurrences in the sentences "He is running a company" and "He

980-464: A portmanteau of "Reason + Act", constructs an agent out of an LLM, using the LLM as a planner. The LLM is prompted to "think out loud". Specifically, the language model is prompted with a textual description of the environment, a goal, a list of possible actions, and a record of the actions and observations so far. It generates one or more thoughts before generating an action, which is then executed in

1050-409: A result of this training process, BERT learns contextual, latent representations of tokens in their context, similar to ELMo and GPT-2 . It found applications for many natural language processing tasks, such as coreference resolution and polysemy resolution. It is an evolutionary step over ELMo , and spawned the study of "BERTology", which attempts to interpret what is learned by BERT. BERT

1120-418: A sequence of vectors using self-supervised learning . It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model . As of 2020 , BERT is a ubiquitous baseline in natural language processing (NLP) experiments. BERT is trained by masked token prediction and next sentence prediction. As

1190-548: A single input vector ( x i n p u t = x p o s i t i o n + x t o k e n {\displaystyle x_{input}=x_{position}+x_{token}} ), DeBERTa keeps them separate as a tuple: ( ( x p o s i t i o n , x t o k e n ) {\displaystyle (x_{position},x_{token})} ). Then, at each self-attention layer, DeBERTa computes three distinct attention matrices, rather than

1260-417: A small amount of finetuning (for BERT LARGE , 1 hour on 1 Cloud TPU) allowed it to achieved state-of-the-art performance on a number of natural language understanding tasks: In the original paper, all parameters of BERT are finetuned, and recommended that, for downstream applications that are text classifications, the output token at the [CLS] input token is fed into a linear-softmax layer to produce

1330-618: A visual guide. While quantized models are typically frozen, and only pre-quantized models are fine-tuned, quantized models can still be fine-tuned. Multimodality means "having several modalities", and a "modality" refers to a type of input or output, such as video, image, audio, text, proprioception , etc. There have been many AI models trained specifically to ingest one modality and output another modality, such as AlexNet for image to label, visual question answering for image-text to text, and speech recognition for speech to text. A common method to create multimodal models out of an LLM

1400-432: Is an "encoder-only" transformer architecture. At a high level, BERT consists of 4 modules: The task head is necessary for pre-training, but it is often unnecessary for so-called "downstream tasks," such as question answering or sentiment classification . Instead, one removes the task head and replaces it with a newly initialized module suited for the task, and finetune the new module. The latent vector representation of

1470-419: Is available only via API with no offering of downloading the model to execute locally. But it was the 2022 consumer-facing browser-based ChatGPT that captured the imaginations of the general population and caused some media hype and online buzz. The 2023 GPT-4 was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities. OpenAI did not reveal the high-level architecture and

SECTION 20

#1732877118109

1540-432: Is finite, then fine-tuning may be done just once. If the number of tools can grow arbitrarily, as with online API services, then the LLM can be fine-tuned to be able to read API documentation and call API correctly. A simpler form of tool use is retrieval-augmented generation : the augmentation of an LLM with document retrieval . Given a query, a document retriever is called to retrieve the most relevant documents. This

1610-421: Is later found that more diverse training objectives are generally better. As an illustrative example, consider the sentence "my dog is cute". It would first be divided into tokens like "my 1 dog 2 is 3 cute 4 ". Then a random token in the sentence would be picked. Let it be the 4th one "cute 4 ". Next, there would be three possibilities: After processing the input text, the model's 4th output vector

1680-468: Is longer than its context window, only the parts inside the context window are taken into account when generating the next answer, or the model needs to apply some algorithm to summarize the too distant parts of conversation. The shortcomings of making a context window larger include higher computational cost and possibly diluting the focus on local context, while making it smaller can cause a model to miss an important long-range dependency. Balancing them are

1750-432: Is not jagged , the shorter texts must be "padded" until they match the length of the longest one. How many tokens are, on average, needed per word depends on the language of the dataset. As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and punctuation marks ) are treated as an initial set of n -grams (i.e. initial set of uni-grams). Successively

1820-441: Is passed to its decoder layer, which outputs a probability distribution over its 30,000-dimensional vocabulary space. Given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus, outputting either [IsNext] or [NotNext] . The first span starts with a special token [CLS] (for "classify"). The two spans are separated by a special token [SEP] (for "separate"). After processing

1890-469: Is running a marathon", BERT will provide a contextualized embedding that will be different according to the sentence. On October 25, 2019, Google announced that they had started applying BERT models for English language search queries within the US . On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages. In October 2020, almost every single English-based query

1960-435: Is the embedding layer, which contains three components: token type embeddings, position embeddings, and segment type embeddings. The three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using a LayerNorm operation, outputting a 768-dimensional vector for each input token. After this,

2030-490: Is to "tokenize" the output of a trained encoder. Concretely, one can construct an LLM that can understand images as follows: take a trained LLM, and take a trained image encoder E {\displaystyle E} . Make a small multilayered perceptron f {\displaystyle f} , so that for any image y {\displaystyle y} , the post-processed vector f ( E ( y ) ) {\displaystyle f(E(y))} has

2100-594: Is usually done by encoding the query and the documents into vectors, then finding the documents with vectors (usually stored in a vector database ) most similar to the vector of the query. The LLM then generates an output based on both the query and context included from the retrieved documents. An LLM is typically not an autonomous agent by itself, as it lacks the ability to interact with dynamic environments, recall past behaviors, and plan future actions, but can be transformed into one by integrating modules like profiling, memory, planning, and action. The ReAct pattern ,

2170-576: The Shan language from Myanmar . Even more widespread languages such as Portuguese and German have "a premium of 50%" compared to English. Greedy tokenization also causes subtle problems with text completion. In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset, discarding low-quality data, and de-duplication. Cleaned datasets can increase training efficiency and lead to improved downstream performance. A trained LLM can be used to clean datasets for training

CanLII - Misplaced Pages Continue

2240-941: The data on which they are trained. Before 2017, there were a few language models that were large as compared to capacities then available. In the 1990s, the IBM alignment models pioneered statistical language modelling. A smoothed n-gram model in 2001 trained on 0.3 billion words achieved state-of-the-art perplexity at the time. In the 2000s, as Internet use became prevalent, some researchers constructed Internet-scale language datasets ("web as corpus" ), upon which they trained statistical language models. In 2009, in most language processing tasks, statistical language models dominated over symbolic language models, as they can usefully ingest large datasets. After neural networks became dominant in image processing around 2012, they were applied to language modelling as well. Google converted its translation service to Neural Machine Translation in 2016. As it

2310-480: The sentence-order prediction (SOP) task, where the model must distinguish the correct order of two consecutive text segments from their reversed order. ELECTRA (2020) applied the idea of generative adversarial networks to the MLM task. Instead of masking out tokens, a small language model generates random plausible plausible substitutions, and a larger network identify these replaced tokens. The small model aims to fool

2380-529: The 10th Edition of the Canadian Guide to Uniform Legal Citation , is the designated preferred citation , in the absence of official court-issued neutral citations . Large language model A large language model ( LLM ) is a type of computational model designed for natural language processing tasks such as language generation . As language models , LLMs acquire these abilities by learning statistical relationships from vast amounts of text during

2450-678: The Llama 3 70 billion parameter model is the most powerful open LLM according to the LMSYS Chatbot Arena Leaderboard, being more powerful than GPT-3.5 but not as powerful as GPT-4. As of 2024, the largest and most capable models are all based on the Transformer architecture. Some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model). Because machine learning algorithms process numbers rather than text,

2520-482: The PaLM (i.e. a 540-billion-parameters model) in 2022 cost $ 8 million, and Megatron-Turing NLG 530B (in 2021) cost around $ 11 million. For Transformer-based LLM, training cost is much higher than inference cost. It costs 6 FLOPs per parameter to train on one token, whereas it costs 1 to 2 FLOPs per parameter to infer on one token. There are certain tasks that, in principle, cannot be solved by any LLM, at least not without

2590-448: The context. For example, the word fine can have two different meanings depending on the context ( I feel fine today , She has fine blond hair ). BERT considers the words surrounding the target word fine from the left and right side. However it comes at a cost: due to encoder-only architecture lacking a decoder, BERT can't be prompted and can't generate text , while bidirectional models in general do not work effectively without

2660-413: The end of each episode, the LLM is given the record of the episode, and prompted to think up "lessons learned", which would help it perform better at a subsequent episode. These "lessons learned" are given to the agent in the subsequent episodes. Monte Carlo tree search can use an LLM as rollout heuristic. When a programmatic world model is not available, an LLM can also be prompted with a description of

2730-585: The environment to act as world model. For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can be used as a reward signal to guide a normal (non-LLM) reinforcement learning agent. Alternatively, it can propose increasingly difficult tasks for curriculum learning . Instead of outputting individual actions, an LLM planner can also construct "skills", or functions for complex action sequences. The skills can be stored and later invoked, allowing increasing levels of abstraction in planning. LLM-powered agents can keep

2800-608: The environment. The linguistic description of the environment given to the LLM planner can even be the LaTeX code of a paper describing the environment. In the DEPS ("Describe, Explain, Plan and Select") method, an LLM is first connected to the visual world via image descriptions, then it is prompted to produce plans for complex tasks and behaviors based on its pretrained knowledge and environmental feedback it receives. The Reflexion method constructs an agent that learns over multiple episodes. At

2870-495: The feed-forward/filter size is always 4 H {\displaystyle 4H} . By varying these two numbers, one obtains an entire family of BERT models. For BERT The notation for encoder stack is written as L/H. For example, BERT BASE is written as 12L/768H, BERT LARGE as 24L/1024H, and BERT TINY as 2L/128H. BERT was pre-trained simultaneously on two tasks. In masked language modeling, 15% of tokens would be randomly selected for masked-prediction task, and

CanLII - Misplaced Pages Continue

2940-435: The general public, with usage averaging over 30,000 visits per day. The case law database is reportedly growing at a rate of approximately 120,000 new cases each year, 20% of which are historic cases which are included to enrich existing databases. In April 2014, CanLII launched CanLII Connects, a legal community sourced publication and discussion platform for case law summaries and commentaries. In March 2018, CanLII launched

3010-404: The initial-set of uni-grams. A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as possible for an average English word. An average word in another language encoded by such an English-optimized tokenizer is however split into suboptimal amount of tokens. GPT-2 tokenizer can use up to 15 times more tokens per word for some languages, for example for

3080-682: The label outputs. The original code base defined the final linear layer as a "pooler layer", in analogy with global pooling in computer vision, even though it simply discards all output tokens except the one corresponding to [CLS] . BERT was trained on the BookCorpus (800M words) and a filtered version of English Misplaced Pages (2,500M words) without lists, tables, and headers. Training BERT BASE on 4 cloud TPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD. Training BERT LARGE on 16 cloud TPU (64 TPU chips total) took 4 days. Language models like ELMo, GPT-2, and BERT, spawned

3150-470: The large model. DeBERTa (2020) is a significant architectural variant, with disentangled attention . Its key idea is to treat the positional and token encodings separately throughout the attention mechanism. Instead of combining the positional encoding ( x p o s i t i o n {\displaystyle x_{position}} ) and token encoding ( x token {\displaystyle x_{\text{token}}} ) into

3220-444: The model is directly fed into this new module, allowing for sample-efficient transfer learning . This section describes the embedding used by BERT BASE . The other one, BERT LARGE , is similar, just larger. The tokenizer of BERT is WordPiece, which is a sub-word strategy like byte pair encoding . Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by [UNK] ("unknown"). The first layer

3290-429: The model must predict whether they appear consecutively in the training corpus. During training, regularization loss is also used to stabilize training. However regularization loss is usually not used during testing and evaluation. Substantial infrastructure is necessary for training the largest models. Advances in software and hardware have reduced the cost substantially since 2020, such that in 2023 training of

3360-489: The most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n -grams that most frequently occur together are then again merged into even lengthier n -gram, until a vocabulary of prescribed size is obtained (in case of GPT-3 , the size is 50257). After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in

3430-552: The naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. Microsoft's Phi series of LLMs is trained on textbook-like data generated by another LLM. Reinforcement learning from human feedback (RLHF) through algorithms, such as proximal policy optimization , is used to further fine-tune a model based on a dataset of human preferences. Using "self-instruct" approaches, LLMs have been able to bootstrap correct responses, replacing any naive responses, starting from human-generated corrections of

3500-539: The number of parameters of GPT-4. Competing language models have for the most part been attempting to equal the GPT series, at least in terms of number of parameters. Since 2022, source-available models have been gaining popularity, especially at first with BLOOM and LLaMA , though both have restrictions on the field of use. Mistral AI 's models Mistral 7B and Mixtral 8x7b have the more permissive Apache License . As of June 2024 , The Instruction fine tuned variant of

3570-453: The number of input tokens and that the maximum number of output tokens differs from the input and is often smaller. For example, the GPT-4 Turbo model has a maximum output of 4096 tokens. Length of a conversation that the model can take into account when generating its next answer is limited by the size of a context window, as well. If the length of a conversation, for example with ChatGPT ,

SECTION 50

#1732877118109

3640-566: The range of most consumer electronics. Post-training quantization aims to decrease the space requirement by lowering precision of the parameters of a trained model, while preserving most of its performance. The simplest form of quantization simply truncates all numbers to a given number of bits. It can be improved by using a different quantization codebook per layer. Further improvement can be done by applying different precisions to different parameters, with higher precision for particularly important parameters ("outlier weights"). See for

3710-447: The relationships represented by attention weights. The high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained. This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of

3780-484: The representation vectors are passed forward through 12 Transformer encoder blocks, and are decoded back to 30,000-dimensional vocabulary space using a basic affine transformation layer. The encoder stack of BERT has 2 free parameters: L {\displaystyle L} , the number of layers, and H {\displaystyle H} , the hidden size . There are always H / 64 {\displaystyle H/64} self-attention heads, and

3850-591: The right side, thus being difficult to prompt. As an illustrative example, if one wishes to use BERT to continue a sentence fragment "Today, I went to", then naively one would mask out all the tokens as "Today, I went to [MASK] [MASK] [MASK] ... [MASK] ." where the number of [MASK] is the length of the sentence one wishes to extend to. However, this constitutes a dataset shift, as during training, BERT has never seen sentences with that many tokens masked out. Consequently, its performance degrades. More sophisticated techniques allow text generation, but at

3920-407: The same dimensions as an encoded token. That is an "image token". Then, one can interleave text tokens and image tokens. The compound model is then fine-tuned on an image-text dataset. This basic construction can be applied with more sophistication to improve the model. The image encoder may be frozen to improve stability. Flamingo demonstrated the effectiveness of the tokenization method, finetuning

3990-475: The scope of the context window, the attention mechanism calculates "soft" weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own "relevance" for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) GPT-2 model has had twelve attention heads and a context window of only 1k tokens. In its medium version it has 345M parameters and contains 24 layers, each with 12 attention heads. For

4060-426: The study of "BERTology", which attempts to interpret what is learned by these models. Their performance on these natural language understanding tasks are not yet well understood. Several research publications in 2018 and 2019 focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences, analysis of internal vector representations through probing classifiers, and

4130-527: The text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an embedding is associated to the integer index. Algorithms include byte-pair encoding (BPE) and WordPiece . There are also special tokens serving as control characters , such as [MASK] for masked-out token (as used in BERT ), and [UNK] ("unknown") for characters not appearing in

4200-399: The time now? It is ", where a separate program interpreter would need to execute a code to get system time on the computer, so that the LLM can include it in its reply. This basic strategy can be sophisticated with multiple attempts of generated programs, and other sampling strategies. Generally, in order to get an LLM to use tools, one must fine-tune it for tool-use. If the number of tools

4270-517: The training objective was to predict the masked token given its context. In more detail, the selected token is The reason not all selected tokens are masked is to avoid the dataset shift problem. The dataset shift problem arises when the distribution of inputs seen during training differs significantly from the distribution encountered during inference. A trained BERT model might be applied to word representation (like Word2Vec ), where it would be run over sentences not containing any [MASK] tokens. It

SECTION 60

#1732877118109

4340-464: The training with gradient descent a batch size of 512 was utilized. The largest models, such as Google's Gemini 1.5 , presented in February 2024, can have a context window sized up to 1 million (context window of 10 million was also "successfully tested"). Other models with large context windows includes Anthropic's Claude 2.1, with a context window of up to 200k tokens. Note that this maximum refers to

4410-675: The two spans, the 1-st output vector (the vector coding for [CLS] ) is passed to a separate neural network for the binary classification into [IsNext] and [NotNext] . BERT is meant as a general pretrained model for various applications in natural language processing. That is, after pre-training, BERT can be fine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such as natural language inference and text classification , and sequence-to-sequence-based language generation tasks such as question answering and conversational response generation. The original BERT paper published results demonstrating that

4480-400: The use of external tools or additional software. An example of such a task is responding to the user's input '354 * 139 = ', provided that the LLM has not already encountered a continuation of this calculation in its training corpus. In such cases, the LLM needs to resort to running program code that calculates the result, which can then be included in its response. : Another example is "What is

4550-567: The vision component was not released to the public until GPT-4V ); Google DeepMind 's Gemini is also multimodal. Mistral introduced its own multimodel Pixtral 12B model in September 2024. The following four hyper-parameters characterize an LLM: BERT (language model) Bidirectional encoder representations from transformers ( BERT ) is a language model introduced in October 2018 by researchers at Google . It learns to represent text as

4620-542: The vocabulary. Also, some special symbols are used to denote special text formatting. For example, "Ġ" denotes a preceding whitespace in RoBERTa and GPT. "##" denotes continuation of a preceding word in BERT. For example, the BPE tokenizer used by GPT-3 (Legacy) would split tokenizer: texts -> series of numerical "tokens" as Tokenization also compresses the datasets. Because LLMs generally require input to be an array that

4690-409: Was before transformers , it was done by seq2seq deep LSTM networks. At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their landmark paper " Attention Is All You Need ". This paper's goal was to improve upon 2014 seq2seq technology, and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014. The following year in 2018, BERT

4760-412: Was introduced and quickly became "ubiquitous". Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model. Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI at first deemed it too powerful to release publicly, out of fear of malicious use. GPT-3 in 2020 went a step further and as of 2024

4830-528: Was originally implemented in the English language at two model sizes, BERT BASE (110 million parameters) and BERT LARGE (340 million parameters). Both were trained on the Toronto BookCorpus (800M words) and English Misplaced Pages (2,500M words). The weights were released on GitHub . On March 11, 2020, 24 smaller models were released, the smallest being BERT TINY with just 4 million parameters. BERT

4900-413: Was processed by a BERT model. The BERT models were influential and inspired many variants. RoBERTa (2019) was an engineering improvement. It preserves BERT's architecture (slightly larger, at 355M parameters), but improves its training, changing key hyperparameters, removing the next-sentence prediction task, and using much larger mini-batch sizes. DistilBERT (2019) distills BERT BASE to

#108891