LLM for beginners

2 minute read

Published:

The following are two parts for helping the beginners in large language models (LLM) to get quick insight about it.

  1. A collection of kernel papers of large language models for the beginners, which are aiming at helping the fresh LLMers for getting the idea of LLM techniques.
  2. Several reference materials for optimization, machine learning and pre-LLM NLP techniques.

======

1. Kernel Paper Collection

======


1.1 Distributed Word Representation


Efficient Estimation of Word Representations in Vector Space Distributed Representations of Words and Phrases and their Compositionality GloVe: Global Vectors for Word Representation

1.2 Contextual Word Representations and MLM Pretraining


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ELMo: Deep contextualized word representations Contextual Word Representations: A Contextual Introduction The Illustrated BERT, ELMo, and co. Jurafsky and Martin Chapter 11 (Fine-Tuning and Masked Language Models)

1.3 Generative Pretraining


GPT-2: Language Models are Unsupervised Multitask Learners GPT-3: Language Models are Few-Shot Learners LLaMA: Open and Efficient Foundation Language Models

1.4 Instruction Tuning and Alignment


InstructGPT: Aligning language models to follow instructions Scaling Instruction-Finetuned Language Models Self-Instruct: Aligning Language Models with Self-Generated Instructions Alpaca: A Strong, Replicable Instruction-Following Model Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality Direct Preference Optimization: Your Language Model is Secretly a Reward Model

1.5 Efficient Finetuning Techniques


Parameter-Efficient Transfer Learning for NLP LoRA: Low-Rank Adaptation of Large Language Models QLoRA: Efficient Finetuning of Quantized LLMs

1.6 Acceleration and Efficiency


FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters Gradient Checkpoint: Training Deep Nets with Sublinear Memory Cost What is Gradient Accumulation ?

1.7 Deployment and Speed-Up Inference


vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention Fast Inference from Transformers via Speculative Decoding

2. Reference for Basic ML techniques

======

2.1 Optimization and Neural Network Basics


Stanford SLP book notes on Neural Networks, Backpropagation

HKUST Prof.Kim’s PyTorchZeroToAll Tutorial

Deep Learning Practical Methodology


2.2 Language Model and Neural Network Architectures


Stanford CS224N notes on Language Models, RNN, GRU and LSTM

Stanford CS224N notes on Self-Attention & Transformers

The Annotated Transformer

The Illustrated Transformer


2.3 Word Vectors and Tokenizers


Stanford CS224N notes on Word Vectors

Huggingface Tokenizer’s Summary

Tokenizers’ Chinese Summary on Zhihu