Large Language Model (LLM) Training Algorithm: Difference between revisions

From GM-RKB
Jump to navigation Jump to search
(ContinuousReplacement)
Tag: continuous replacement
Line 85: Line 85:


== References ==
== References ==
* ([[2025_LLMPostTrainingADeepDiveIntoRea|Kumar et al., 2025]]) ⇒ [[author::Komal Kumar]], [[author::Tajamul Ashraf]], [[author::Omkar Thawakar]], [[author::Rao Muhammad Anwer]], [[author::Hisham Cholakkal]], [[author::Mubarak Shah]], [[author::Ming-Hsuan Yang]], [[author::Phillip H. S. Torr]], [[author::Salman Khan]], and [[author::Fahad Shahbaz Khan]]. ([[year::2025]]). “LLM Post-Training: A Deep Dive Into Reasoning Large Language Models.”  [http://dx.doi.org/10.48550/arXiv.2502.21321 doi:10.48550/arXiv.2502.21321]  
 
=== 2025 ===
 
== References ==
* ([[Kumar et al., 2025]]) ⇒ [[Komal Kumar]], [[Tajamul Ashraf]], [[Omkar Thawakar]], [[Rao Muhammad Anwer]], [[Hisham Cholakkal]], [[Mubarak Shah]], [[Ming-Hsuan Yang]], [[Phillip H. S. Torr]], [[Salman Khan]], and [[Fahad Shahbaz Khan]]. ([[2025]]). “LLM Post-Training: A Deep Dive Into Reasoning Large Language Models.”  [http://dx.doi.org/10.48550/arXiv.2502.21321 doi:10.48550/arXiv.2502.21321]  
** NOTES:  
** NOTES:  
**# **Post-Training Algorithm Taxonomy**: The paper establishes a clear taxonomy of post-training algorithms (Figure 1), demonstrating how LLM training algorithms extend beyond initial pre-training to include fine-tuning (SFT), reinforcement learning (PPO, DPO, GRPO), and test-time scaling—showcasing the complete optimization lifecycle for LLM parameters.
**# [[Post-Training LLM Algorithm Taxonomy]]: [[Kumar et al., 2025|The paper]] establishes a clear [[taxonomy]] of [[post-training algorithm]]s ([[Figure 1]]), demonstrating how [[LLM training algorithm]]s extend beyond initial [[pre-training]] to include [[fine-tuning]] ([[SFT]]), [[reinforcement learning]] ([[PPO]], [[DPO]], [[GRPO]]), and [[test-time scaling]]—showcasing the complete [[optimization lifecycle]] for [[LLM parameter]]s.
**# **Parameter-Efficient Training Algorithms**: The paper's coverage of LoRA, QLoRA, and adapter methods (Section 4.7 and Table 2) illustrates how modern LLM training algorithms can optimize selective subsets of parameters rather than all weights, directly confirming your wiki's categorization of "Parameter-Efficient Training Algorithms."
**# [[Parameter-Efficient Training Algorithm]]s: [[Kumar et al., 2025|The paper]]'s coverage of [[LoRA]], [[QLoRA]], and [[adapter method]]s ([[Section 4.7]] and [[Table 2]]) illustrates how modern [[LLM training algorithm]]s can optimize [[selective subset]]s of [[parameter]]s rather than all [[weight]]s, directly confirming your [[wiki]]'s categorization of "[[Parameter-Efficient Training Algorithm]]s."
**# **Reinforcement Learning for Sequential Decision-Making**: The paper's explanation of how RL algorithms (Sections 3.1-3.2) adapt to token-by-token generation frames LLM training as a sequential decision process with specialized advantage functions and credit assignment mechanisms—extending beyond the traditional gradient descent approaches in your wiki.
**# [[Reinforcement Learning for Sequential Decision-Making]]: [[Kumar et al., 2025|The paper]]'s explanation of how [[RL algorithm]]s ([[Section]]s 3.1-3.2) adapt to [[token-by-token generation]] frames [[LLM training]] as a [[sequential decision process]] with specialized [[advantage function]]s and [[credit assignment mechanism]]s—extending beyond the traditional [[gradient descent approach]]es in your [[wiki]].
**# **Process vs. Outcome Reward Optimization**: The comparison between Process Reward Models and Outcome Reward Models (Section 3.1.3-3.1.4) demonstrates a unique aspect of LLM training algorithms not explicitly covered in your wiki: optimization can target either intermediate reasoning steps or final outputs.
**# [[Process vs. Outcome Reward Optimization]]: The comparison between [[Process Reward Model]]s and [[Outcome Reward Model]]s ([[Section]]s 3.1.3-3.1.4) demonstrates a unique aspect of [[LLM training algorithm]]s not explicitly covered in your [[wiki]]: [[optimization]] can target either [[intermediate reasoning step]]s or [[final output]]s.
**# **Hybrid Training-Inference Algorithms**: The paper's extensive coverage of test-time scaling methods (Section 5) reveals that modern LLM training algorithms can span the traditional training-inference boundary, with techniques like Monte Carlo Tree Search and Chain-of-Thought representing algorithmic approaches that continue model optimization during deployment.
**# [[Hybrid Training-Inference Algorithm]]s: [[Kumar et al., 2025|The paper]]'s extensive coverage of [[test-time scaling method]]s ([[Section 5]]) reveals that modern [[LLM training algorithm]]s can span the traditional [[training-inference boundary]], with [[technique]]s like [[Monte Carlo Tree Search]] and [[Chain-of-Thought]] representing [[algorithmic approach]]es that continue [[model optimization]] during [[deployment]].


----
----

Revision as of 23:15, 3 March 2025

A Large Language Model (LLM) Training Algorithm is a deep neural model training algorithm that can be implemented by an LLM training system (optimizes large language model parameters) to support LLM training tasks.



References

2025

References