Summary
The state-of-the-art latest AI language model is a transformer-based architecture, which largely follows the details of the OpenAI GPT model with varying layers, dmodel, and sizes. The smallest model is equivalent to the original GPT, while the largest model, called GPT-2, has over an order of magnitude more parameters than GPT. The context of the language model is seeded with example question-answer pairs and trained on a filtered version of the Common Crawl dataset, which has enabled deep learning models to achieve high accuracy on specific NLP and computer vision applications. There are also multilingual variants of these models, such as mBERT, mT5, and GPT-3, trained using some amount of multilingual data, reducing the amount of labeled data necessary for training on various supervised tasks. However, there are risks associated with large LMs that model their training data very closely and can be prompted to output specific information from that training data. For low-resource languages, it is often beneficial to leverage data in similar but higher-resource languages, especially when they share a significant fraction of their vocabularies. Two methods have been proposed to learn cross-lingual language models, one unsupervised that relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. Cross-lingual classification and low-resource language modeling have also been evaluated, with promising results. Overall, the latest AI language models have significantly advanced the field of NLP, allowing for unprecedented accuracy and capabilities in text processing, translation, and generation. However, as with any technology, caution must be exercised to avoid unintended consequences, such as biased or harmful output.
Consensus Meter
We propose two methods to learn cross-lingual language models: one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. 4 4.4 Low-resource language modeling For low-resource languages, it is often beneficial to leverage data in similar but higher-resource languages, especially when they share a significant fraction of their vocabularies. Cross-lingual classification In Table 1, we evaluate two types of pretrained cross-lingual encoders: an unsupervised cross-lingual language model that uses the MLM objective on monolingual corpora only; and a supervised cross-lingual language model that combines both the MLM and the TLM loss using additional parallel data. Low-resource language model In Table 4, we investigate the impact of cross-lingual language modeling for improving the perplexity of a Nepali language model.
Published By:
A Conneau, G Lample - Advances in neural information …, 2019 - proceedings.neurips.cc
Cited By:
972
Model We use a Transformer based architecture for our LMs. The model largely follows the details of the OpenAI GPT model with a 117M 345M 762M 1542M Layers dmodel 12 24 36 48 768 1024 1280 1600 Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT. Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText. Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset. Interesting learned functionality in generative models has been documented before such as the cells in an RNN language model performing line-width tracking and quote/comment detection Karpathy et al.
Published By:
A Radford, J Wu, R Child, D Luan, D Amodei… - OpenAI …, 2019 - life-extension.github.io
Cited By:
5444
To promote natural language understanding, we propose to incorporate explicit contextual semantics from pre-trained semantic role labeling, and introduce an improved language representation model, Semanticsaware BERT, which is capable of explicitly absorbing contextual semantics over a BERT backbone. Our model consists of three components: 1) an out-ofshelf semantic role labeler to annotate the input sentences with a variety of semantic role labels; 2) an sequence encoder where a pre-trained language model is used to build representation for input raw texts and the semantic role labels are mapped to embedding in parallel; 3) a semantic integration component to integrate the text representation with the contextual explicit semantic embedding to obtain the joint representation for downstream tasks. 2.2 Explicit Contextual Semantics Although distributed representations including the latest advanced pre-trained contextual language models have already been strengthened by semantics to some extent from linguistic sense, we argue such implicit semantics may not be enough to support a powerful contextual representation for NLU, according to our observation on the semantically incomplete answer span generated by BERT on SQuAD, which motivates us to directly introduce explicit semantics. Since SemBERT absorbs contextual semantics in a deep processing way, we wonder if a simple and straightforward way integrating such semantic information may still work, thus we concatenate the SRL embedding with BERT subword embeddings for a direct comparison, where the semantic role labels are copied to the number of subwords for each original word, without CNN and pooling for word-level alignment.
Published By:
Z Zhang, Y Wu, H Zhao, Z Li, S Zhang, X Zhou… - … on Artificial Intelligence, 2020 - ojs.aaai.org
Cited By:
279
Microsoft's open-source library called DeepSpeed introduces new techniques that enhance the training of large models by improving scale, speed, cost, and usability. The library comprises breakthroughs such as the Zero Redundancy Optimizer (ZeRO), a parallelized optimizer that reduces the resources required for model and data parallelism while massively increasing the number of parameters that can be trained. This novel memory optimization technology paves the way for training models with trillions of parameters, representing an unprecedented leap in deep learning system technology. DeepSpeed provides lightweight APIs compatible with PyTorch, enabling users to leverage the library's state-of-the-art training techniques, including optimized kernels, distributed training, mixed precision, and checkpointing, with just a few lines of code changes to their PyTorch model. The potential impact of this technology is significant, with DeepSpeed enabling the training of models with over 100 billion parameters, a feat that was previously impossible. In summary, DeepSpeed represents a significant advancement in deep learning system technology, enabling the training of models with trillions of parameters, an unthinkable feat just a few years ago. This technology has immense potential, and with its innovative techniques, researchers and developers can unlock new possibilities in the field of artificial intelligence.
Published By:
J Rasley, S Rajbhandari, O Ruwase, Y He - Proceedings of the 26th …, 2020 - dl.acm.org
Cited By:
217
A number of these models also have multilingual variants such as mBERT and mT5 or are trained with some amount of multilingual data such as GPT-3 where 7% of the training data was not in English. While training the word embeddings required a large amount of data, it reduced the amount of labeled data necessary for training on the various supervised tasks. The training set for GPT-3 was a filtered version of the Common Crawl dataset, developed by training a classifier to pick out those documents UNFATHOMABLE TRAINING DATA The size of data available on the web has enabled deep learning models to achieve high accuracy on specific benchmarks in NLP and computer vision applications. Finally, we note that there are risks associated with the fact that LMs with extremely large numbers of parameters model their training data very closely and can be prompted to output specific information from that training data.
Published By:
EM Bender, T Gebru, A McMillan-Major… - Proceedings of the 2021 …, 2021 - dl.acm.org
Cited By:
1327
Under a Creative Commons license open access Highlights AI-powered Language models are promising in drug discovery and development. A 'fit-for-purpose' selection the key to positioning AI-powered language models in drug discovery and development. Keywords Artificial intelligence Language models Natural language processing Drug discovery Drug development COVID-19 Loading... Cited by Zhichao Liu Zhichao Liu is a technical leader in Artificial Intelligence Research Force, Division of Bioinformatics & Biostatistics, FDA/NCTR. Dr Liu's background spans the fields of chemistry, biology, and computer science. Weida Tong Weida Tong is the Director of the Division of Bioinformatics and Biostatistics at FDA/NCTR. He has published over 300 peer-reviewed papers from his roles in supervising and leading the FDA-led community-wide MicroArray and Sequencing Quality Control consortium to analyze technical performance and practical utility of emerging genomic technologies with emphasis on regulatory application and precision medicine; addressing drug safety concerns related to drug-induced liver injury; developing machine learning and AI for digital health and drug repositioning; and conducting molecular modeling and QSARs on various toxicological endpoints, such as carcinogenicity.
Published By:
Z Liu, RA Roberts, M Lal-Nag, X Chen, R Huang… - Drug Discovery …, 2021 - Elsevier
Cited By:
17
In such challenging scenarios, recent studies have often used meta-learning to simulate the few-shot task, thus negating implicit common linguistic features across tasks. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. 2017) is a deep matrix-based method using sample averages as class prototypes, MAML is a model-agnostic method that is compatible with any model trained with gradient descent and applicable to a variety of learning problems, Relation Network is a metric-based few-shot learning model that uses a neural network as the distance measurement and calculate class vectors by summing sample vectors in the support set, ROBUSTTC-FSL is an approach that combines adaptive metric methods by clustering the tasks, Induction-Network-Routing is a recent state-of-the-art method which learn generalized class-wise representations by combining the dynamic routing algorithm with a typical meta-learning framework. To evaluate the proposed model objectively with the baselines, note that for ARSC, the support set for testing is fixed by; therefore, we need to run the test episode once for each of the target tasks.
Published By:
S Deng, N Zhang, Z Sun, J Chen, H Chen - … on Artificial Intelligence, 2020 - ojs.aaai.org
Cited By:
26
Whereas feed-forward networks only exploit a fixed context length to predict the next word of a sequence, conceptually, standard recurrent neural networks can take into account all of the predecessor words. Index Terms: language modeling, recurrent neural networks, LSTM neural networks 1. Whenever the gradient of the error function of the neural network is propagated back through a unit of a neural network, it gets scaled by a certain factor. Neural network language models Although there are several differences in the neural network language models that have been successfully applied so far, all of them share some basic principles: The input words are encoded by 1-of-K coding where K is the number of words in the vocabulary.
Published By:
M Sundermeyer, R Schlüter, H Ney - Thirteenth annual conference …, 2012 - isca-speech.org
Cited By:
2345
Figure 2 shows that our approach closes the previously significant gap between models that use the full softmax and models with the usually less accurate hierarchical softmax. In comparison, the largest model we have trained reaches 31.9 test perplexity compared to the 30.6 of that approach, but only requires training for 2 weeks on 8 GPUs compared to 3 weeks of training on 32 GPUs for the LSTM. Note that these results can be improved by either using mixtures of experts or ensembles of these models. The linear model performs very poorly at perplexity 115 even compared to 67.6 of a Kneser-Ney 5-gram model, even though the former has access to more context. Surprisingly, the introduction of the gated linear units is enough to reach 61 perplexity on Google Billion Word, which surpasses both Kneser-Ney 5-gram models and the non-linear neural model of.
Published By:
YN Dauphin, A Fan, M Auli… - … conference on machine …, 2017 - proceedings.mlr.press
Cited By:
1907
Unlike BERT which is used mainly for NLU tasks, U NI LM can be configured, using different self-attention masks, to aggregate context for different types of language models, and thus can be used for both NLU and NLG tasks. EOS] not only marks the sentence boundary in NLU tasks, but also is used for the model to learn when to terminate the decoding process in NLG tasks. The pre-trained model, used as an encoderdecoder model, can be easily adapted to a wide range of conditional text generation tasks, such as abstractive summarization. Models in the first block only use 10K the others are abstractive models.
Published By:
L Dong, N Yang, W Wang, F Wei… - Advances in neural …, 2019 - proceedings.neurips.cc
Cited By:
1088