Summary

Top 10 papers analyzed

The latest state of the art large language model is GPT-3 (Generative Pre-trained Transformer 3), which was released in June 2020 by OpenAI. GPT-3 is a massive neural network consisting of 175 billion parameters, which is over 10 times larger than its predecessor GPT-2. GPT-3 is pre-trained on a diverse range of applications and tasks, including language translation, reading comprehension, and question-answering. It has set new benchmarks in many natural language processing (NLP) tasks, such as language modeling, text completion, and text classification. In addition to GPT-3, there are also other pre-trained models that are publicly available, such as BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (A Robustly Optimized BERT Pretraining Approach), and T5 (Text-to-Text Transfer Transformer). These models have been trained on large corpora of text and have achieved state-of-the-art results on various NLP benchmarks. Researchers are continuously exploring and improving the design criteria for pre-training models to achieve better downstream performance on specific tasks, such as BioNLP tasks (biomedical natural language processing). One approach is to pretrain new models on specialized corpora, as demonstrated by the authors of Summary 1. They showed that their base model can be improved by knowledge distillation from a large model, but there is still a gap between the two. Overall, the latest state of the art large language model is GPT-3, but there are other pre-trained models that are also highly effective. Researchers are continually working to improve the design and pre-training criteria of these models to achieve better results on specific tasks.

Consensus Meter

Yes - 0%
No - 0%
Non conclusive - 0%

The linear model performs very poorly at perplexity 115 even compared to 67.6 of a Kneser-Ney 5-gram model, even though the former has access to more context. Surprisingly, the introduction of the gated linear units is enough to reach 61 perplexity on Google Billion Word, which surpasses both Kneser-Ney 5-gram models and the non-linear neural model of.

Published By:

YN Dauphin, A Fan, M Auli… - … conference on machine …, 2017 - proceedings.mlr.press

Cited By:

1911

Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. The conditional probabilities P(Wt = ilht ) are thus computed as follows, denoting with h t the history before Wt. and L t the short list of words for the prediction of Wt. If i E L t then the probability is PNN(Wt = ilWt E Lt , ht)Ptrigram(Wt E Ltlht ), else it is Ptrigram(Wt = ilht ), where PNN(Wt = ilWt E L t , ht) are the normalized scores of the words computed by the neural network, where the "Softmax" is only normalized over the words in the short list Lt, and Ptrigram(Wt E Ltlht ) = ~iEL. Ptrigram(ilht), with Ptrigram(ilht) standing for the next-word probabilities computed by the smoothed trigram.

Published By:

Y Bengio, R Ducharme… - Advances in neural …, 2000 - proceedings.neurips.cc

Cited By:

9714

Model We use a Transformer based architecture for our LMs. The model largely follows the details of the OpenAI GPT model with a 117M 345M 762M 1542M Layers dmodel 12 24 36 48 768 1024 1280 1600 Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT. Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText.

Published By:

A Radford, J Wu, R Child, D Luan, D Amodei… - OpenAI …, 2019 - life-extension.github.io

Cited By:

5492

Index Terms: language modeling, recurrent neural networks, LSTM neural networks 1. Neural network language models Although there are several differences in the neural network language models that have been successfully applied so far, all of them share some basic principles: The input words are encoded by 1-of-K coding where K is the number of words in the vocabulary.

Published By:

M Sundermeyer, R Schlüter, H Ney - Thirteenth annual conference …, 2012 - isca-speech.org

Cited By:

2346

We follow a fine-tuning strategy where we modify the pretrained base model to perform the new task and then train the entire model end-to-end. Self-supervised language models on the other hand have resulted in significant improvements over prior work [12-14, 43]. In this work, we develop a model and proxy tasks for learning joint visual-linguistic representations - extending the popular BERT model.

Published By:

J Lu, D Batra, D Parikh, S Lee - Advances in neural …, 2019 - proceedings.neurips.cc

Cited By:

2076

12 The difference in parameters is greater for non-PTB corpora as the size of the word model scales faster with |V|. For example, on Arabic the small/large word models have 35m/121m parameters while the corresponding character models have 29m/69m parameters respectively. While our model requires additional convolution operations over characters and is thus slower than a comparable word-level model which can perform a simple lookup at the input layer, we found that the difference was manageable with optimized GPU implementations-for example on PTB the large character-level model trained at 1500 tokens/sec compared to the word-level model which trained at 3000 tokens/sec.

Published By:

Y Kim, Y Jernite, D Sontag, A Rush - … of the AAAI conference on artificial …, 2016 - ojs.aaai.org

Cited By:

1932

4.1 Pretraining New Models In addition to these publicly available models, we also pretrain new models on the corpora in Section 3 and examine which design criteria are important for strong downstream performance on BioNLP tasks. Finally, we demonstrate that our base model can be further improved by knowledge distillation from our large model, although there remains a gap between the distillation-improved base model and our large model.

Published By:

P Lewis, M Ott, J Du, V Stoyanov - … 3rd Clinical Natural Language …, 2020 - aclanthology.org

Cited By:

87

3.1 Comparison to off-the-shelf tools We compare the performance of HunFlair in a cross-corpus setting to five other state-of-the-art biomedical NER tools using three gold standard corpora: CRAFT, BioNLP13 Cancer Genetics and PDR. None of these was used in the training of neither HunFlair nor any competitor tools and we checked that there are no significant textual overlaps between these corpora and any of HunFlair' s trainings corpora. 'HunFlair' refers to the HunFlair model without pretraining on gold standard corpora.

Published By:

L Weber, M Sänger, J Münchmeyer, M Habibi… - …, 2021 - academic.oup.com

Cited By:

56

To study the dependence of ML performance on model size, we train 8 different sizes of model, from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network.

Published By:

T Brown, B Mann, N Ryder… - Advances in neural …, 2020 - proceedings.neurips.cc

Cited By:

8540

A study from the Australian National University has found that medical language processing tools developed using generic English do not always translate effectively to a clinical environment. The research aimed to enhance medical language references to specific contexts to improve the accuracy of human-computer interaction in clinical informatics. The study found two successful methods of adapting language models to extract patient information: using domain-specific word representations, and applying transfer learning models in a transfer of knowledge from general to clinical English. However, the current lack of data privacy legislation means that few medical records exist that machine learning can utilise in medical language processing, limiting progress in the field. The study showed there were statistically significant performance differences between the machine learning system developed and its baseline. The machine learning tool offered an improvement in extracting macro-averaged F1 of 3.4% with domain-specific word representations, with a further 7.1% improvement in performance when applying transfer learning models. The successful methods trailed three independent datasets that were applied to a total of 101 patient reports across three sets: training, validation, and test.

Published By:

L Zhou, H Suominen, T Gedeon - JMIR medical informatics, 2019 - medinform.jmir.org

Cited By:

16