Softmax function: Difference between revisions

Browse history interactively ← Previous editContent deleted Content addedVisual WikitextInline

Revision as of 10:07, 15 August 2023 edit17.72.102.139 (talk) →Definition Tag: Reverted← Previous edit		Latest revision as of 23:32, 19 December 2024 edit undoDavid Eppstein (talk \| contribs)Autopatrolled, Administrators226,286 edits Lacra Pavel
(36 intermediate revisions by 22 users not shown)
Line 14:		Line 14:
	\|section=6.2.2.3 Softmax Units for Multinoulli Output Distributions		\|section=6.2.2.3 Softmax Units for Multinoulli Output Distributions
	\|pages=180–184		\|pages=180–184
	}}</ref>{{rp\|184}} or '''normalized exponential function''',<ref name="bishop" />{{rp\|198}} converts a vector of {{mvar\|K}} real numbers into a ] of {{mvar\|K}} possible outcomes. It is a generalization of the ] to multiple dimensions, and used in ]. The softmax function is often used as the last ] of a ] to normalize the output of a network to a ] over predicted output classes~~, based on ]~~.		}}</ref>{{rp\|184}} or '''normalized exponential function''',<ref name="bishop" />{{rp\|198}} converts a vector of {{mvar\|K}} real numbers into a ] of {{mvar\|K}} possible outcomes. It is a generalization of the ] to multiple dimensions, and is used in ]. The softmax function is often used as the last ] of a ] to normalize the output of a network to a ] over predicted output classes.

	== Definition ==		== Definition ==
Line 20:		Line 20:
	The softmax function takes as input a vector {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a ] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the ] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.		The softmax function takes as input a vector {{mvar\|z}} of {{mvar\|K}} real numbers, and normalizes it into a ] consisting of {{mvar\|K}} probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the ] <math>(0, 1)</math>, and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

	~~The~~ standard (unit) softmax function <math>\sigma : \R^K \~~mapsto~~ ^K</math> where <math>K ~~\ge~~ 1</math> is ~~defined~~ by ~~the~~ ~~formula~~		Formally, the standard (unit) softmax function <math>\sigma\colon \R^K \to (0, 1)^K</math>, where <math>K > 1</math>, takes a vector <math>\mathbf{z} = (z_1, \dotsc, z_K) \in \R^K</math> and computes each component of vector <math>\sigma(\mathbf{z}) \in (0, 1)^K</math> with

	<math display="block">\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} \ ~~\ \text{ for } i = 1~~, ~~\dotsc, K \text{ and } \mathbf{z} = (z_1, \dotsc, z_K) \in \R^K~~.</math>		<math display="block">\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\,.</math>

	In words, it applies the standard ] to each element <math>z_i</math> of the input vector <math>\mathbf z</math> and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector <math>\sigma(\mathbf z)</math> is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of <math>(1,2,8)</math> is approximately <math>(0.001,0.002,0.997)</math>, which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).		In words, the softmax applies the standard ] to each element <math>z_i</math> of the input vector <math>\mathbf z</math> (consisting of <math>K</math> real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector <math>\sigma(\mathbf z)</math> is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of <math>(1,2,8)</math> is approximately <math>(0.001,0.002,0.997)</math>, which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).

	In general, instead of {{mvar\|e}} a different ] {{math\|b > 0}} can be used. If {{math\|~~0 <~~ b < 1}}, ~~smaller~~ input components will result in larger output probabilities, and ~~decreasing~~ the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the ~~smallest~~ input values. Conversely~~, as above~~, if {{math\|b > 1}} ~~larger~~ input components will result in larger output probabilities, and ~~increasing~~ the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the ~~largest~~ input values. Writing <math>b = e^\beta</math> or <math>b = e^{-\beta}</math>{{efn\|Positive {{mvar\|β}} corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative {{math\|−β}} corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the ], interpreting {{mvar\|β}} as ].}} (for real {{mvar\|β}}){{efn\|1=The notation {{mvar\|β}} is for the ], which is inverse ]: <math>\beta = 1/T</math>, <math>T = 1/\beta.</math>}} yields the expressions:{{efn\|For <math>\beta = 0</math> (] zero, infinite temperature), <math>b = e^\beta = e^0 = 1</math>, and this becomes the constant function {{tmath\|(1/n, \dots, 1/n)}}, corresponding to the ].}}		In general, instead of {{mvar\|e}} a different ] {{math\|b > 0}} can be used. As above, if {{math\|b > 1}} then larger input components will result in larger output probabilities, and increasing the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if {{math\|0 < b < 1}} then smaller input components will result in larger output probabilities, and decreasing the value of {{mvar\|b}} will create probability distributions that are more concentrated around the positions of the smallest input values. Writing <math>b = e^\beta</math> or <math>b = e^{-\beta}</math>{{efn\|Positive {{mvar\|β}} corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative {{math\|−β}} corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the ], interpreting {{mvar\|β}} as ].}} (for real {{mvar\|β}}){{efn\|1=The notation {{mvar\|β}} is for the ], which is inverse ]: <math>\beta = 1/T</math>, <math>T = 1/\beta.</math>}} yields the expressions:{{efn\|For <math>\beta = 0</math> (] zero, infinite temperature), <math>b = e^\beta = e^0 = 1</math>, and this becomes the constant function {{tmath\|(1/n, \dots, 1/n)}}, corresponding to the ].}}

	<math display="block">\sigma(\mathbf{z})_i = \frac{e^{\beta z_i}}{\sum_{j=1}^K e^{\beta z_j}} \text{ or } \sigma(\mathbf{z})_i = \frac{e^{-\beta z_i}}{\sum_{j=1}^K e^{-\beta z_j}} \text{ for } i = 1, \dotsc , K .</math>		<math display="block">\sigma(\mathbf{z})_i = \frac{e^{\beta z_i}}{\sum_{j=1}^K e^{\beta z_j}} \text{ or } \sigma(\mathbf{z})_i = \frac{e^{-\beta z_i}}{\sum_{j=1}^K e^{-\beta z_j}} \text{ for } i = 1, \dotsc , K .</math>

			{{anchor\|Temperature}}
⚫	In some fields, the base is fixed, corresponding to a fixed scale,{{efn\|In statistical mechanics, fixing {{mvar\|β}} is interpreted as having coldness and temperature of 1.}} while in others the parameter {{mvar\|β}} is varied.
			A value proportional to the reciprocal of {{mvar\|β}} is sometimes referred to as the ''temperature'': <math display=inline>\beta = 1 / kT</math>, where {{mvar\|k}} is typically 1 or the ] and {{mvar\|T}} is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher ]; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.

		⚫	In some fields, the base is fixed, corresponding to a fixed scale,{{efn\|In statistical mechanics, fixing {{mvar\|β}} is interpreted as having coldness and temperature of 1.}} while in others the parameter {{mvar\|β}} (or {{mvar\|T}}) is varied.

	== Interpretations ==		== Interpretations ==
	=== Smooth arg max ===		=== Smooth arg max ===
	{{See also\|Arg max}}		{{See also\|Arg max}}
	The ~~name "softmax" is misleading. The~~ function is ~~not a ] (that is, a ] to the ] function), but is rather~~ a smooth approximation to the ] function: the function whose value is the ''index'' of a vector's largest element. In ~~fact~~, the term "softmax" is also used for the closely related ] function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", ~~but~~ the term "softmax" is conventional in machine learning.<ref name="sako2018"/>{{sfn\|Goodfellow\|Bengio\|Courville\|2016\|pp=183–184\|ps=: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous or differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is <math>\operatorname{softmax}(\mathbf{z})^\top \mathbf{z}</math>. It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.}} This section uses the term "softargmax" ~~to emphasize this~~ ~~interpretation~~.		The Softmax function is a smooth approximation to the ] function: the function whose value is the ''index'' of a vector's largest element. The name "softmax" may be misleading. Softmax is not a ] (that is, a ] to the ] function). The term "softmax" is also used for the closely related ] function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.<ref name="sako2018"/>{{sfn\|Goodfellow\|Bengio\|Courville\|2016\|pp=183–184\|ps=: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is <math>\operatorname{softmax}(\mathbf{z})^\top \mathbf{z}</math>. It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.}} This section uses the term "softargmax" for clarity.

	Formally, instead of considering the arg max as a function with categorical output <math>1, \dots, n</math> (corresponding to the index), consider the arg max function with ] representation of the output (assuming there is a unique maximum arg):		Formally, instead of considering the arg max as a function with categorical output <math>1, \dots, n</math> (corresponding to the index), consider the arg max function with ] representation of the output (assuming there is a unique maximum arg):
	:<math>\operatorname{arg\,max}(z_1,\, \dots,\, z_n) = (y_1,\, \dots,\, y_n) = (0,\, \dots,\, 0,\, 1,\, 0,\, \dots,\, 0),</math>		<math display="block">\operatorname{arg\,max}(z_1,\, \dots,\, z_n) = (y_1,\, \dots,\, y_n) = (0,\, \dots,\, 0,\, 1,\, 0,\, \dots,\, 0),</math>
	where the output coordinate <math>y_i = 1</math> if and only if <math>i</math> is the arg max of <math>(z_1, \dots, z_n)</math>, meaning <math>z_i</math> is the unique maximum value of <math>(z_1,\, \dots,\, z_n)</math>. For example, in this encoding <math>\operatorname{arg\,max}(1, 5, 10) = (0, 0, 1),</math> since the third argument is the maximum.		where the output coordinate <math>y_i = 1</math> if and only if <math>i</math> is the arg max of <math>(z_1, \dots, z_n)</math>, meaning <math>z_i</math> is the unique maximum value of <math>(z_1,\, \dots,\, z_n)</math>. For example, in this encoding <math>\operatorname{arg\,max}(1, 5, 10) = (0, 0, 1),</math> since the third argument is the maximum.

Line 81:		Line 84:

	== Applications ==		== Applications ==
	The softmax function is used in various ] methods, such as ] (also known as softmax regression)<ref name="bishop">{{cite book \|first=Christopher M. \|last=Bishop \|year=2006 \|title=Pattern Recognition and Machine Learning \|publisher=Springer \|isbn=0-387-31073-8 }}</ref>{{rp\|206–209}} , multiclass ], ]s, and ]s.<ref>ai-faq </ref> Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of {{mvar\|K}} distinct ]s, and the predicted probability for the {{mvar\|j}}th class given a sample vector {{math\|'''x'''}} and a weighting vector {{math\|'''w'''}} is:		The softmax function is used in various ] methods, such as ] (also known as softmax regression),<ref name="bishop">{{cite book \|first=Christopher M. \|last=Bishop \|year=2006 \|title=Pattern Recognition and Machine Learning \|publisher=Springer \|isbn=0-387-31073-8 }}</ref>{{rp\|206–209}}<ref>{{Cite web \|title=Unsupervised Feature Learning and Deep Learning Tutorial \|url=http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ \|access-date=2024-03-25 \|website=ufldl.stanford.edu}}</ref> multiclass ], ]s, and ]s.<ref>ai-faq </ref> Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of {{mvar\|K}} distinct ]s, and the predicted probability for the {{mvar\|j}}th class given a sample vector {{math\|'''x'''}} and a weighting vector {{math\|'''w'''}} is:

	:<math>P(y=j\mid \mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}</math>		<math display="block">P(y=j\mid \mathbf{x}) = \frac{e^{\mathbf{x}^\mathsf{T}\mathbf{w}_j}}{\sum_{k=1}^K e^{\mathbf{x}^\mathsf{T}\mathbf{w}_k}}</math>

	This can be seen as the ] of {{mvar\|K}} linear functions <math>\mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K</math> and the softmax function (where <math>\mathbf{x}^\mathsf{T}\mathbf{w}</math> denotes the inner product of <math>\mathbf{x}</math> and <math>\mathbf{w}</math>). The operation is equivalent to applying a linear operator defined by <math>\mathbf{w}</math> to vectors <math>\mathbf{x}</math>, thus transforming the original, probably highly-dimensional, input to vectors in a {{mvar\|K}}-dimensional space <math>\mathbb{R}^K</math>.		This can be seen as the ] of {{mvar\|K}} linear functions <math>\mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_1, \ldots, \mathbf{x} \mapsto \mathbf{x}^\mathsf{T}\mathbf{w}_K</math> and the softmax function (where <math>\mathbf{x}^\mathsf{T}\mathbf{w}</math> denotes the inner product of <math>\mathbf{x}</math> and <math>\mathbf{w}</math>). The operation is equivalent to applying a linear operator defined by <math>\mathbf{w}</math> to vectors <math>\mathbf{x}</math>, thus transforming the original, probably highly-dimensional, input to vectors in a {{mvar\|K}}-dimensional space <math>\mathbb{R}^K</math>.
Line 96:		Line 99:
	This expression is symmetrical in the indexes <math>i, k</math> and thus may also be expressed as		This expression is symmetrical in the indexes <math>i, k</math> and thus may also be expressed as

	:<math> \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, k)(\delta_{ik} - \sigma(\textbf{q}, i)).</math>		<math display="block"> \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \sigma(\textbf{q}, k)(\delta_{ik} - \sigma(\textbf{q}, i)).</math>

	Here, the ] is used for simplicity (cf. the derivative of a ], being expressed via the function itself).		Here, the ] is used for simplicity (cf. the derivative of a ], being expressed via the function itself).

	In ~~order to achieve~~ stable numerical computations of the ~~derivative,~~ ~~one often subtracts a constant~~ from the input vector. In ~~theory~~, ~~this does~~ not ~~change~~ the output, ~~and neither~~ the derivative. ~~But~~ it is ~~more~~ ~~stable~~ as it ~~can~~ ~~control explicitly the largest~~ value computed ~~in each exponent~~.		To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.

	If the function is scaled with the parameter <math>\beta</math>, then these expressions must be multiplied by <math>\beta</math>.		If the function is scaled with the parameter <math>\beta</math>, then these expressions must be multiplied by <math>\beta</math>.
Line 113:		Line 116:

	== Computational complexity and remedies ==		== Computational complexity and remedies ==
	In neural network applications, the number {{mvar\|K}} of possible outcomes is often large, e.g. in case of ] that predict the most likely outcome out of a vocabulary which might contain millions of possible words.<ref name=":0">{{Cite journal \|last1=Onal \|first1=Kezban Dilek \|last2=Zhang \|first2=Ye \|last3=Altingovde \|first3=Ismail Sengor \|last4=Rahman \|first4=Md Mustafizur \|last5=Karagoz \|first5=Pinar \|last6=Braylan \|first6=Alex \|last7=Dang \|first7=Brandon \|last8=Chang \|first8=Heng-Lu \|last9=Kim \|first9=Henna \|last10=McNamara \|first10=Quinten \|last11=Angert \|first11=Aaron \|date=2018-06-01 \|title=Neural information retrieval: at the end of the early years \|journal=Information Retrieval Journal \|language=en \|volume=21 \|issue=2 \|pages=111–182 \|doi=10.1007/s10791-017-9321-y \|s2cid=21684923 \|issn=1573-7659\|doi-access=free }}</ref> This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the <math>z_i</math>, followed by the application of the softmax function itself) computationally expensive.<ref name=":0" /><ref name=":1">{{Cite journal \|last1=Chen \|first1=Wenlin \|last2=Grangier \|first2=David \|last3=Auli \|first3=Michael \|date=August 2016 \|title=Strategies for Training Large Vocabulary Neural Language Models \|url=https://aclanthology.org/P16-1186 \|journal=Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|location=Berlin, Germany \|publisher=Association for Computational Linguistics \|pages=1975–1985 \|doi=10.18653/v1/P16-1186\|s2cid=6035643 \|doi-access=free }}</ref> What's more, the ] ] method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.<ref name=":0" /><ref name=":1" />		In neural network applications, the number {{mvar\|K}} of possible outcomes is often large, e.g. in case of ] that predict the most likely outcome out of a vocabulary which might contain millions of possible words.<ref name=":0">{{Cite journal \|last1=Onal \|first1=Kezban Dilek \|last2=Zhang \|first2=Ye \|last3=Altingovde \|first3=Ismail Sengor \|last4=Rahman \|first4=Md Mustafizur \|last5=Karagoz \|first5=Pinar \|last6=Braylan \|first6=Alex \|last7=Dang \|first7=Brandon \|last8=Chang \|first8=Heng-Lu \|last9=Kim \|first9=Henna \|last10=McNamara \|first10=Quinten \|last11=Angert \|first11=Aaron \|date=2018-06-01 \|title=Neural information retrieval: at the end of the early years \|journal=Information Retrieval Journal \|language=en \|volume=21 \|issue=2 \|pages=111–182 \|doi=10.1007/s10791-017-9321-y \|s2cid=21684923 \|issn=1573-7659\|doi-access=free \|hdl=11245.1/008d6e8f-df13-4abf-8ae9-6ff2e17377f3 \|hdl-access=free }}</ref> This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the <math>z_i</math>, followed by the application of the softmax function itself) computationally expensive.<ref name=":0" /><ref name=":1">{{Cite journal \|last1=Chen \|first1=Wenlin \|last2=Grangier \|first2=David \|last3=Auli \|first3=Michael \|date=August 2016 \|title=Strategies for Training Large Vocabulary Neural Language Models \|url=https://aclanthology.org/P16-1186 \|journal=Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) \|location=Berlin, Germany \|publisher=Association for Computational Linguistics \|pages=1975–1985 \|doi=10.18653/v1/P16-1186\|s2cid=6035643 \|doi-access=free \|arxiv=1512.04906 }}</ref> What's more, the ] ] method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.<ref name=":0" /><ref name=":1" />

	Approaches that reorganize the softmax layer for more efficient calculation include the '''hierarchical softmax''' and the '''differentiated softmax'''.<ref name=":0" /> The hierarchical softmax (introduced by Morin and ] in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming ].<ref name=":1" /><ref name=":2">{{Cite journal \|last1=Morin \|first1=Frederic \|last2=Bengio \|first2=Yoshua \|date=2005-01-06 \|title=Hierarchical Probabilistic Neural Network Language Model \|url=https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf \|journal=International Workshop on Artificial Intelligence and Statistics \|language=en \|publisher=PMLR \|pages=246–252}}</ref> The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.<ref name=":1" /> Ideally, when the tree is balanced, this would reduce the ] from <math>O(K)</math> to <math>O(\log_2 K)</math>.<ref name=":2" /> In practice, results depend on choosing a good strategy for clustering the outcomes into classes.<ref name=":1" /><ref name=":2" /> A ] was used for this in Google's ] models (introduced in 2013) to achieve scalability.<ref name=":0" />		Approaches that reorganize the softmax layer for more efficient calculation include the '''hierarchical softmax''' and the '''differentiated softmax'''.<ref name=":0" /> The hierarchical softmax (introduced by Morin and ] in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming ].<ref name=":1" /><ref name=":2">{{Cite journal \|last1=Morin \|first1=Frederic \|last2=Bengio \|first2=Yoshua \|date=2005-01-06 \|title=Hierarchical Probabilistic Neural Network Language Model \|url=https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf \|journal=International Workshop on Artificial Intelligence and Statistics \|language=en \|publisher=PMLR \|pages=246–252}}</ref> The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.<ref name=":1" /> Ideally, when the tree is balanced, this would reduce the ] from <math>O(K)</math> to <math>O(\log_2 K)</math>.<ref name=":2" /> In practice, results depend on choosing a good strategy for clustering the outcomes into classes.<ref name=":1" /><ref name=":2" /> A ] was used for this in Google's ] models (introduced in 2013) to achieve scalability.<ref name=":0" />
Line 125:		Line 128:

	More generally, softmax is invariant under translation by the same value in each coordinate: adding <math>\mathbf{c} = (c,\, \dots,\, c)</math> to the inputs <math>\mathbf{z}</math> yields <math>\sigma(\mathbf{z} + \mathbf{c}) = \sigma(\mathbf{z})</math>, because it multiplies each exponent by the same factor, <math>e^c</math> (because <math>e^{z_i + c} = e^{z_i} \cdot e^c</math>), so the ratios do not change:		More generally, softmax is invariant under translation by the same value in each coordinate: adding <math>\mathbf{c} = (c,\, \dots,\, c)</math> to the inputs <math>\mathbf{z}</math> yields <math>\sigma(\mathbf{z} + \mathbf{c}) = \sigma(\mathbf{z})</math>, because it multiplies each exponent by the same factor, <math>e^c</math> (because <math>e^{z_i + c} = e^{z_i} \cdot e^c</math>), so the ratios do not change:
		⚫	<math display="block">\sigma(\mathbf{z} + \mathbf{c})_j = \frac{e^{z_j + c}}{\sum_{k=1}^K e^{z_k + c}} = \frac{e^{z_j} \cdot e^c}{\sum_{k=1}^K e^{z_k} \cdot e^c} = \sigma(\mathbf{z})_j.</math>

⚫	: <math>\sigma(\mathbf{z} + \mathbf{c})_j = \frac{e^{z_j + c}}{\sum_{k=1}^K e^{z_k + c}} = \frac{e^{z_j} \cdot e^c}{\sum_{k=1}^K e^{z_k} \cdot e^c} = \sigma(\mathbf{z})_j.</math>

	Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: <math>\mathbf{c}</math> where <math display="inline">c = \frac{1}{n} \sum z_i</math>), and then the softmax takes the hyperplane of points that sum to zero, <math display="inline">\sum z_i = 0</math>, to the open simplex of positive values that sum to 1<math display="inline">\sum \sigma(\mathbf{z})_i = 1</math>, analogously to how the exponent takes 0 to 1, <math>e^0 = 1</math> and is positive.		Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: <math>\mathbf{c}</math> where <math display="inline">c = \frac{1}{n} \sum z_i</math>), and then the softmax takes the hyperplane of points that sum to zero, <math display="inline">\sum z_i = 0</math>, to the open simplex of positive values that sum to 1<math display="inline">\sum \sigma(\mathbf{z})_i = 1</math>, analogously to how the exponent takes 0 to 1, <math>e^0 = 1</math> and is positive.
Line 136:		Line 138:
	The softmax function is also the gradient of the ] function, a ]:		The softmax function is also the gradient of the ] function, a ]:

	: <math>\frac{\partial}{\partial z_i} \operatorname{LSE}(\mathbf{z}) = \frac{\exp z_i}{\sum_{j=1}^{K} \exp z_j} = \sigma(\mathbf{z})_i, \quad \text{ for } i = 1, \dotsc , K, \quad \mathbf{z} = (z_1,\, \dotsc,\, z_K) \in\R^K,</math>		<math display="block">\frac{\partial}{\partial z_i} \operatorname{LSE}(\mathbf{z}) = \frac{\exp z_i}{\sum_{j=1}^{K} \exp z_j} = \sigma(\mathbf{z})_i, \quad \text{ for } i = 1, \dotsc , K, \quad \mathbf{z} = (z_1,\, \dotsc,\, z_K) \in\R^K,</math>

	where the LogSumExp function is defined as <math>\operatorname{LSE}(z_1,\, \dots,\, z_n) = \log\left(\exp(z_1) + \cdots + \exp(z_n)\right)</math>.		where the LogSumExp function is defined as <math>\operatorname{LSE}(z_1,\, \dots,\, z_n) = \log\left(\exp(z_1) + \cdots + \exp(z_n)\right)</math>.
Line 153:		Line 155:
	}}</ref>		}}</ref>

	The use of the softmax in ] is credited to ~~{{harvtxt\|~~Luce~~\|1959}}~~,<ref name="Gao">{{cite arXiv\|eprint=1704.00805\|last1=Gao\|first1=Bolin\|title=On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning\|last2=Pavel\|first2=Lacra\|class=math.OC\|year=2017}}</ref>{{rp\|1}} who used the axiom of ] in ] to deduce the softmax in ] for relative preferences.		The use of the softmax in ] is credited to ],<ref name="Gao">{{cite arXiv\|eprint=1704.00805\|last1=Gao\|first1=Bolin\|title=On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning\|last2=Pavel\|first2=Lacra\|author2-link=Lacra Pavel\|class=math.OC\|year=2017}}</ref>{{rp\|1}} who used the axiom of ] in ] to deduce the softmax in ] for relative preferences.{{Cn\|date=March 2024}}

	In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, {{harvtxt\|Bridle\|1990a}}:<ref name="Gao"/>{{rp\|1}} and {{harvtxt\|Bridle\|1990b}}:<ref name="sako2018"/>		In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, {{harvtxt\|Bridle\|1990a}}:<ref name="Gao"/>{{rp\|1}} and {{harvtxt\|Bridle\|1990b}}:<ref name="sako2018"/>
Line 176:		Line 178:

	Given a set of unconstrained values, {{tmath\|V_j(x)}}, we can ensure both conditions by using a Normalised Exponential transformation:		Given a set of unconstrained values, {{tmath\|V_j(x)}}, we can ensure both conditions by using a Normalised Exponential transformation:
	: <math>Q_j(x) = \left. e^{V_j(x)} \right/ \sum_k e^{V_k(x)} </math>		<math display="block">Q_j(x) = \left. e^{V_j(x)} \right/ \sum_k e^{V_k(x)} </math>
	This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value. For this reason we like to refer to it as '''softmax'''.<ref>{{cite conference		This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value. For this reason we like to refer to it as '''softmax'''.<ref>{{cite conference
	\|first=John S. \|last=Bridle		\|first=John S. \|last=Bridle
Line 189:		Line 191:

	== Example ==		== Example ==
	~~If we take~~ an input of {{math\|}}, the softmax of ~~that is~~ {{math\|}}. The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: ~~softmax~~ is ~~not~~ ~~scale~~ ~~invariant~~, ~~so if~~ the ~~input~~ ~~were~~ {{math\|}} ~~(which sums to 1.6)~~ the softmax ~~would~~ be {{math\|}}. This shows that ~~for~~ ~~values between 0 and 1 softmax, in fact,~~ de-~~emphasizes~~ the maximum value ~~(note that 0.169 is not only less than 0.475, it is also less than the initial proportion of {{math\|1=0.4/1.6=0.25}})~~.		With an input of {{math\|(1, 2, 3, 4, 1, 2, 3)}}, the softmax is approximately {{math\|(0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175)}}. The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: a change of ''temperature'' changes the output. When the temperature is multiplied by 10, the inputs are effectively {{math\|(0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3)}} and the softmax is approximately {{math\|(0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153)}}. This shows that high temperatures de-emphasize the maximum value.

	Computation of this example using ] code:		Computation of this example using ] code:
Line 195:		Line 197:
	<syntaxhighlight lang="pycon">		<syntaxhighlight lang="pycon">
	>>> import numpy as np		>>> import numpy as np
	>>> a =		>>> z = np.array()
	>>> ~~np.exp(a)~~ / np.~~sum(np.exp(a))~~		>>> beta = 1.0
			>>> np.exp(beta * z) / np.sum(np.exp(beta * z))
	array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054,		array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054,
	0.06426166, 0.1746813])		0.06426166, 0.1746813])
	</syntaxhighlight>		</syntaxhighlight>

			== Alternatives ==
	Here is an example of ] code:
			The softmax function generates probability predictions densely distributed over its ]. Other functions like ] or α-] can be used when sparse probability predictions are desired.<ref>"Speeding Up Entmax" by Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov, https://arxiv.org/abs/2111.06832v3</ref> Also the ] can be used when sampling from a discrete-discrete distribution needs to be mimicked in a differentiable manner.

	<syntaxhighlight lang="jlcon">
	julia> A = ; # semicolon to suppress interactive output

	julia> exp.(A) ./ sum(exp, A)
	7-element Array{Float64,1}:
	0.0236405
	0.0642617
	0.174681
	0.474833
	0.0236405
	0.0642617
	0.174681
	</syntaxhighlight>

	Here is an example of ] code:

	<syntaxhighlight lang="rout">
	> z <- c(1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0)
	> softmax <- exp(z)/sum(exp(z))
	> softmax
	0.02364054 0.06426166 0.17468130 0.47483300 0.02364054 0.06426166 0.17468130
	</syntaxhighlight>

	Here is an example of ] code:<ref>{{Cite web\|url=https://github.com/elixir-nx/nx/tree/main/nx#examples\|title = Nx/Nx at main · elixir-nx/Nx\| website=] }}</ref>
	<syntaxhighlight lang="iex">
	iex> t = Nx.tensor(, ])
	iex> Nx.divide(Nx.exp(t), Nx.sum(Nx.exp(t)))

	#Nx.Tensor<
	f64
	[
	,

	]
	>
	</syntaxhighlight>

	Here is an example of ] code:

	<syntaxhighlight lang="raku">
	> my @z = ;
	> say @z.map: {exp($_)/sum(@z.map: {exp($_)})}

	(0.023640543021591385 0.06426165851049616 0.17468129859572226 0.4748329997443803 0.023640543021591385 0.06426165851049616 0.17468129859572226)
	</syntaxhighlight>

	== See also ==		== See also ==
Line 269:		Line 227:
	}}		}}

			{{Artificial intelligence navbox}}
	{{Differentiable computing}}

	]		]

Latest revision as of 23:32, 19 December 2024

Smooth approximation of one-hot arg max This article is about the smooth approximation of one-hot arg max. For the smooth approximation of max, see LogSumExp. "Softmax" redirects here. For the Korean video game company, see ESA (company).

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural radiance field Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Journals and conferences ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and is used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

Definition

The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval $(0,1)$ , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

Formally, the standard (unit) softmax function $\sigma \colon \mathbb {R} ^{K}\to (0,1)^{K}$ , where $K>1$ , takes a vector $\mathbf {z} =(z_{1},\dotsc ,z_{K})\in \mathbb {R} ^{K}$ and computes each component of vector $\sigma (\mathbf {z} )\in (0,1)^{K}$ with

$\sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}\,.$

In words, the softmax applies the standard exponential function to each element $z_{i}$ of the input vector $\mathbf {z}$ (consisting of $K$ real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector $\sigma (\mathbf {z} )$ is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of $(1,2,8)$ is approximately $(0.001,0.002,0.997)$ , which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).

In general, instead of e a different base b > 0 can be used. As above, if b > 1 then larger input components will result in larger output probabilities, and increasing the value of b will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if 0 < b < 1 then smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values. Writing $b=e^{\beta }$ or $b=e^{-\beta }$ (for real β) yields the expressions:

$\sigma (\mathbf {z} )_{i}={\frac {e^{\beta z_{i}}}{\sum _{j=1}^{K}e^{\beta z_{j}}}}{\text{ or }}\sigma (\mathbf {z} )_{i}={\frac {e^{-\beta z_{i}}}{\sum _{j=1}^{K}e^{-\beta z_{j}}}}{\text{ for }}i=1,\dotsc ,K.$

A value proportional to the reciprocal of β is sometimes referred to as the temperature: ${\textstyle \beta =1/kT}$ , where k is typically 1 or the Boltzmann constant and T is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.

In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter β (or T) is varied.

Interpretations

Smooth arg max

Probability theory

In probability theory, the output of the softargmax function can be used to represent a categorical distribution – that is, a probability distribution over K different possible outcomes.

Statistical mechanics

In statistical mechanics, the softargmax function is known as the Boltzmann distribution (or Gibbs distribution): the index set ${1,\,\dots ,\,k}$ are the microstates of the system; the inputs $z_{i}$ are the energies of that state; the denominator is known as the partition function, often denoted by Z; and the factor β is called the coldness (or thermodynamic beta, or inverse temperature).

Applications

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression), multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks. Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the jth class given a sample vector x and a weighting vector w is:

$P(y=j\mid \mathbf {x} )={\frac {e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{j}}}{\sum _{k=1}^{K}e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{k}}}}$

This can be seen as the composition of K linear functions $\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{1},\ldots ,\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{K}$ and the softmax function (where $\mathbf {x} ^{\mathsf {T}}\mathbf {w}$ denotes the inner product of $\mathbf {x}$ and $\mathbf {w}$ ). The operation is equivalent to applying a linear operator defined by $\mathbf {w}$ to vectors $\mathbf {x}$ , thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space $\mathbb {R} ^{K}$ .

Neural networks

The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a vector and a specific index $i$ to a real value, the derivative needs to take the index into account:

${\frac {\partial }{\partial q_{k}}}\sigma ({\textbf {q}},i)=\sigma ({\textbf {q}},i)(\delta _{ik}-\sigma ({\textbf {q}},k)).$

This expression is symmetrical in the indexes $i,k$ and thus may also be expressed as

${\frac {\partial }{\partial q_{k}}}\sigma ({\textbf {q}},i)=\sigma ({\textbf {q}},k)(\delta _{ik}-\sigma ({\textbf {q}},i)).$

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

To ensure stable numerical computations subtracting the maximum value from the input vector is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.

If the function is scaled with the parameter $\beta$ , then these expressions must be multiplied by $\beta$ .

See multinomial logit for a probability model which uses the softmax activation function.

Reinforcement learning

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is: $P_{t}(a)={\frac {\exp(q_{t}(a)/\tau )}{\sum _{i=1}^{n}\exp(q_{t}(i)/\tau )}}{\text{,}}$

where the action value $q_{t}(a)$ corresponds to the expected reward of following action a and $\tau$ is called a temperature parameter (in allusion to statistical mechanics). For high temperatures ( $\tau \to \infty$ ), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature ( $\tau \to 0^{+}$ ), the probability of the action with the highest expected reward tends to 1.

Computational complexity and remedies

In neural network applications, the number K of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words. This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the $z_{i}$ , followed by the application of the softmax function itself) computationally expensive. What's more, the gradient descent backpropagation method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.

Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax. The hierarchical softmax (introduced by Morin and Bengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming latent variables. The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf. Ideally, when the tree is balanced, this would reduce the computational complexity from $O(K)$ to $O(\log _{2}K)$ . In practice, results depend on choosing a good strategy for clustering the outcomes into classes. A Huffman tree was used for this in Google's word2vec models (introduced in 2013) to achieve scalability.

A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor. These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).

Mathematical properties

Geometrically the softmax function maps the vector space $\mathbb {R} ^{K}$ to the boundary of the standard $(K-1)$ -simplex, cutting the dimension by one (the range is a $(K-1)$ -dimensional simplex in $K$ -dimensional space), due to the linear constraint that all output sum to 1 meaning it lies on a hyperplane.

Along the main diagonal $(x,\,x,\,\dots ,\,x),$ softmax is just the uniform distribution on outputs, $(1/n,\dots ,1/n)$ : equal scores yield equal probabilities.

More generally, softmax is invariant under translation by the same value in each coordinate: adding $\mathbf {c} =(c,\,\dots ,\,c)$ to the inputs $\mathbf {z}$ yields $\sigma (\mathbf {z} +\mathbf {c} )=\sigma (\mathbf {z} )$ , because it multiplies each exponent by the same factor, $e^{c}$ (because $e^{z_{i}+c}=e^{z_{i}}\cdot e^{c}$ ), so the ratios do not change: $\sigma (\mathbf {z} +\mathbf {c} )_{j}={\frac {e^{z_{j}+c}}{\sum _{k=1}^{K}e^{z_{k}+c}}}={\frac {e^{z_{j}}\cdot e^{c}}{\sum _{k=1}^{K}e^{z_{k}}\cdot e^{c}}}=\sigma (\mathbf {z} )_{j}.$

Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: $\mathbf {c}$ where ${\textstyle c={\frac {1}{n}}\sum z_{i}}$ ), and then the softmax takes the hyperplane of points that sum to zero, ${\textstyle \sum z_{i}=0}$ , to the open simplex of positive values that sum to 1 ${\textstyle \sum \sigma (\mathbf {z} )_{i}=1}$ , analogously to how the exponent takes 0 to 1, $e^{0}=1$ and is positive.

By contrast, softmax is not invariant under scaling. For instance, $\sigma {\bigl (}(0,\,1){\bigr )}={\bigl (}1/(1+e),\,e/(1+e){\bigr )}$ but $\sigma {\bigl (}(0,2){\bigr )}={\bigl (}1/\left(1+e^{2}\right),\,e^{2}/\left(1+e^{2}\right){\bigr )}.$

The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the (x, y) plane. One variable is fixed at 0 (say $z_{2}=0$ ), so $e^{0}=1$ , and the other variable can vary, denote it $z_{1}=x$ , so ${\textstyle e^{z_{1}}/\sum _{k=1}^{2}e^{z_{k}}=e^{x}/\left(e^{x}+1\right),}$ the standard logistic function, and ${\textstyle e^{z_{2}}/\sum _{k=1}^{2}e^{z_{k}}=1/\left(e^{x}+1\right),}$ its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line $(x/2,\,-x/2)$ , with outputs $e^{x/2}/\left(e^{x/2}+e^{-x/2}\right)=e^{x}/\left(e^{x}+1\right)$ and $e^{-x/2}/\left(e^{x/2}+e^{-x/2}\right)=1/\left(e^{x}+1\right).$

The softmax function is also the gradient of the LogSumExp function, a smooth maximum:

${\frac {\partial }{\partial z_{i}}}\operatorname {LSE} (\mathbf {z} )={\frac {\exp z_{i}}{\sum _{j=1}^{K}\exp z_{j}}}=\sigma (\mathbf {z} )_{i},\quad {\text{ for }}i=1,\dotsc ,K,\quad \mathbf {z} =(z_{1},\,\dotsc ,\,z_{K})\in \mathbb {R} ^{K},$

where the LogSumExp function is defined as $\operatorname {LSE} (z_{1},\,\dots ,\,z_{n})=\log \left(\exp(z_{1})+\cdots +\exp(z_{n})\right)$ .

History

The softmax function was used in statistical mechanics as the Boltzmann distribution in the foundational paper Boltzmann (1868), formalized and popularized in the influential textbook Gibbs (1902).

The use of the softmax in decision theory is credited to R. Duncan Luce, who used the axiom of independence of irrelevant alternatives in rational choice theory to deduce the softmax in Luce's choice axiom for relative preferences.

In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, Bridle (1990a): and Bridle (1990b):

We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.

For any input, the outputs must all be positive and they must sum to unity. ...
Given a set of unconstrained values, ⁠ $V_{j}(x)$ ⁠, we can ensure both conditions by using a Normalised Exponential transformation: $Q_{j}(x)=\left.e^{V_{j}(x)}\right/\sum _{k}e^{V_{k}(x)}$ This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value. For this reason we like to refer to it as softmax.

Example

With an input of (1, 2, 3, 4, 1, 2, 3), the softmax is approximately (0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175). The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: a change of temperature changes the output. When the temperature is multiplied by 10, the inputs are effectively (0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3) and the softmax is approximately (0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153). This shows that high temperatures de-emphasize the maximum value.

Computation of this example using Python code:

>>> import numpy as np
>>> z = np.array()
>>> beta = 1.0
>>> np.exp(beta * z) / np.sum(np.exp(beta * z)) 
array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054,
       0.06426166, 0.1746813])

Alternatives

The softmax function generates probability predictions densely distributed over its support. Other functions like sparsemax or α-entmax can be used when sparse probability predictions are desired. Also the Gumbel-softmax reparametrization trick can be used when sampling from a discrete-discrete distribution needs to be mimicked in a differentiable manner.

Notes

Positive β corresponds to the maximum convention, and is usual in machine learning, corresponding to the highest score having highest probability. The negative −β corresponds to the minimum convention, and is conventional in thermodynamics, corresponding to the lowest energy state having the highest probability; this matches the convention in the Gibbs distribution, interpreting β as coldness.
The notation β is for the thermodynamic beta, which is inverse temperature: $\beta =1/T$ , $T=1/\beta .$
For $\beta =0$ (coldness zero, infinite temperature), $b=e^{\beta }=e^{0}=1$ , and this becomes the constant function ⁠ $(1/n,\dots ,1/n)$ ⁠, corresponding to the discrete uniform distribution.
In statistical mechanics, fixing β is interpreted as having coldness and temperature of 1.

References

Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "6.2.2.3 Softmax Units for Multinoulli Output Distributions". Deep Learning. MIT Press. pp. 180–184. ISBN 978-0-26203561-3.
^ Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. ISBN 0-387-31073-8.
^ Sako, Yusaku (2018-06-02). "Is the term "softmax" driving you nuts?". Medium.
Goodfellow, Bengio & Courville 2016, pp. 183–184: The name "softmax" can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term "soft" derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous nor differentiable. The softmax function thus provides a "softened" version of the arg max. The corresponding soft version of the maximum function is $\operatorname {softmax} (\mathbf {z} )^{\top }\mathbf {z}$ . It would perhaps be better to call the softmax function "softargmax," but the current name is an entrenched convention.
LeCun, Yann; Chopra, Sumit; Hadsell, Raia; Ranzato, Marc’Aurelio; Huang, Fu Jie (2006). "A Tutorial on Energy-Based Learning" (PDF). In Gökhan Bakır; Thomas Hofmann; Bernhard Schölkopf; Alexander J. Smola; Ben Taskar; S.V.N Vishwanathan (eds.). Predicting Structured Data. Neural Information Processing series. MIT Press. ISBN 978-0-26202617-8.
"Unsupervised Feature Learning and Deep Learning Tutorial". ufldl.stanford.edu. Retrieved 2024-03-25.
ai-faq What is a softmax activation function?
Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998. Softmax Action Selection
^ Onal, Kezban Dilek; Zhang, Ye; Altingovde, Ismail Sengor; Rahman, Md Mustafizur; Karagoz, Pinar; Braylan, Alex; Dang, Brandon; Chang, Heng-Lu; Kim, Henna; McNamara, Quinten; Angert, Aaron (2018-06-01). "Neural information retrieval: at the end of the early years". Information Retrieval Journal. 21 (2): 111–182. doi:10.1007/s10791-017-9321-y. hdl:11245.1/008d6e8f-df13-4abf-8ae9-6ff2e17377f3. ISSN 1573-7659. S2CID 21684923.
^ Chen, Wenlin; Grangier, David; Auli, Michael (August 2016). "Strategies for Training Large Vocabulary Neural Language Models". Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics: 1975–1985. arXiv:1512.04906. doi:10.18653/v1/P16-1186. S2CID 6035643.
^ Morin, Frederic; Bengio, Yoshua (2005-01-06). "Hierarchical Probabilistic Neural Network Language Model" (PDF). International Workshop on Artificial Intelligence and Statistics. PMLR: 246–252.
Boltzmann, Ludwig (1868). "Studien über das Gleichgewicht der lebendigen Kraft zwischen bewegten materiellen Punkten" [Studies on the balance of living force between moving material points]. Wiener Berichte. 58: 517–560.
Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics.
^ Gao, Bolin; Pavel, Lacra (2017). "On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning". arXiv:1704.00805 .
Bridle, John S. (1990a). Soulié F.F.; Hérault J. (eds.). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. Neurocomputing: Algorithms, Architectures and Applications (1989). NATO ASI Series (Series F: Computer and Systems Sciences). Vol. 68. Berlin, Heidelberg: Springer. pp. 227–236. doi:10.1007/978-3-642-76153-9_28.
Bridle, John S. (1990b). D. S. Touretzky (ed.). Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters. Advances in Neural Information Processing Systems 2 (1989). Morgan-Kaufmann.
"Speeding Up Entmax" by Maxat Tezekbayev, Vassilina Nikoulina, Matthias Gallé, Zhenisbek Assylbekov, https://arxiv.org/abs/2111.06832v3

Artificial intelligence

Concepts

Applications

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Speech synthesis 15.ai ElevenLabs Speech recognition Whisper Facial recognition AlphaFold Text-to-image models Aurora DALL-E Firefly Flux Ideogram Imagen Midjourney Stable Diffusion Text-to-video models Dream Machine Gen-3 Alpha Hailuo AI Kling Sora Veo Music generation Suno AI Udio
Text	Word2vec Seq2seq GloVe BERT T5 Llama Chinchilla AI PaLM GPT 1 2 3 J ChatGPT 4 4o o1 o3 Claude Gemini chatbot Grok LaMDA BLOOM Project Debater IBM Watson IBM Watsonx Granite PanGu-Σ
Decisional	AlphaGo AlphaZero OpenAI Five Self-driving car MuZero Action selection AutoGPT Robot control