Misplaced Pages

Inductive probability: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editContent deleted Content addedVisualWikitext
Revision as of 13:58, 28 October 2014 editThepigdog (talk | contribs)Extended confirmed users5,174 editsm Probability as estimationTag: Visual edit← Previous edit Latest revision as of 03:30, 19 July 2024 edit undo2001:8003:4403:3801:fd85:b456:5134:a843 (talk) Boolean inductive inference 
(109 intermediate revisions by 38 users not shown)
Line 1: Line 1:
'''Inductive probability''' attempts to give the probability of future events based on past events. It is the basis for ], and gives the mathematical basis for ] and the perception of patterns. It is a source of ] about the world. {{short description|Determining the probability of future events based on past events}}
'''Inductive probability''' attempts to give the ] of future events based on past events. It is the basis for ], and gives the mathematical basis for ] and the perception of patterns. It is a source of ] about the world.


There are three sources of knowledge: ], communication, and deduction. Communication relays information found using other methods. Deduction establishes new facts based on existing facts. Inference establishes new facts from data. Its basis is ].
There are three sources of knowledge.
* Inference
* Communication
* Deduction


Information describing the world is written in a language. For example, a simple mathematical language of propositions may be chosen. Sentences may be written down in this language as strings of characters. But in the computer it is possible to encode these sentences as strings of bits (1s and 0s). Then the language may be encoded so that the most commonly used sentences are the shortest. This internal language implicitly represents probabilities of statements.
Communication relays information found using other methods. Deduction established new facts based on existing facts. Only inference establishes new facts from data.


] says the "simplest theory, consistent with the data is most likely to be correct". The "simplest theory" is interpreted as the representation of the theory written in this internal language. The theory with the shortest encoding in this internal language is most likely to be correct.
The basis of inference is Bayes law. But this law is sometimes hard to apply and understand. The simpler method to understand inference is in terms of quantities of information.


== History ==
Information describing the world is written in a language. For example a simple mathematical language of propositions may be chosen. Sentences may be written down in this language as strings of characters. But in the computer it is possible to encode these sentences as strings of bits (1s and 0s). Then the language may be encoded so that the most commonly used sentences are the shortest. This internal language implicitly represents probabilities of statements.


Probability and statistics was focused on ]s and tests of significance. Probability was formal, well defined, but limited in scope. In particular its application was limited to situations that could be defined as an experiment or trial, with a well defined population.
Occam's Razor says the "simplest theory, consistent with the data is most likely to be correct". The "simplest theory" is interpreted as the representation of the theory written in this internal language. The theory with the shortest encoding in this internal language is most likely to be correct.


] is named after Rev. ] 1701–1761. ] broadened the application of probability to many situations where a population was not well defined. But Bayes' theorem always depended on prior probabilities, to generate new probabilities. It was unclear where these prior probabilities should come from.
== Inferred probability ==

] developed ] which gave an explanation for what randomness is and how patterns in the data may be represented by computer programs, that give shorter representations of the data circa 1964.

] and D. M. Boulton developed ] circa 1968. Later ] developed the ] circa 1978. These methods allow ] to be related to probability, in a way that can be compared to the application of Bayes' theorem, but which give a source and explanation for the role of prior probabilities.

] combined ] with the work of Ray Solomonoff and ] to give a theory for the ] behavior for an ], circa 1998.
===Minimum description/message length===

The program with the shortest length that matches the data is the most likely to predict future data. This is the thesis behind the ]<ref>{{cite journal|last=Wallace|first=Chris| author2=Boulton |title=An information measure for classification |journal=Computer Journal|year=1968|volume=11|issue=2|pages=185–194|doi=10.1093/comjnl/11.2.185|doi-access=free}}</ref> and ]<ref>{{Cite journal | last1 = Rissanen | first1 = J. | title = Modeling by shortest data description | doi = 10.1016/0005-1098(78)90005-5 | journal = Automatica | volume = 14 | issue = 5 | pages = 465–658 | year = 1978 }}</ref> methods.

At first sight ] appears different from the minimimum message/description length principle. At closer inspection it turns out to be the same. Bayes' theorem is about conditional probabilities, and states the probability that event ''B'' happens if firstly event ''A'' happens:

:<math>P(A \land B) = P(B) \cdot P(A | B) = P(A) \cdot P(B | A)</math>

becomes in terms of message length ''L'',
:<math>L(A \land B) = L(B) + L(A | B) = L(A) + L(B | A).</math>

This means that if all the information is given describing an event then the length of the information may be used to give the raw probability of the event. So if the information describing the occurrence of ''A'' is given, along with the information describing ''B'' given ''A'', then all the information describing ''A'' and ''B'' has been given.<ref>
{{cite web
|last=Allison
|first=Lloyd
|title=Minimum Message Length (MML) – LA's MML introduction
|url=http://www.csse.monash.edu.au/~lloyd/tildeMML
}}</ref>
<ref>
{{cite web
|last=Oliver
|first=J. J.
|last2=Baxter
|first2=Rohan A.
|title=MML and Bayesianism: Similarities and Differences (Introduction to Minimum Encoding Inference – Part II)
|year=1994
|url=http://citeseerx.ist.psu.edu/viewdoc/similar;jsessionid=65475C44F4C425AFE77BCAE59D49CE92?doi=10.1.1.1.7367&type=ab}}</ref>

====Overfitting====

] occurs when the model matches the random noise and not the pattern in the data. For example, take the situation where a curve is fitted to a set of points. If a polynomial with many terms is fitted then it can more closely represent the data. Then the fit will be better, and the information needed to describe the deviations from the fitted curve will be smaller. Smaller information length means higher probability.

However, the information needed to describe the curve must also be considered. The total information for a curve with many terms may be greater than for a curve with fewer terms, that has not as good a fit, but needs less information to describe the polynomial.

===Inference based on program complexity===

] is also inductive inference. A bit string ''x'' is observed. Then consider all programs that generate strings starting with ''x''. Cast in the form of inductive inference, the programs are theories that imply the observation of the bit string ''x''.

The method used here to give probabilities for inductive inference is based on ].

====Detecting patterns in the data====

If all the bits are 1, then people infer that there is a bias in the coin and that it is more likely also that the next bit is 1 also. This is described as learning from, or detecting a pattern in the data.

Such a pattern may be represented by a ]. A short computer program may be written that produces a series of bits which are all 1. If the length of the program ''K'' is <math>L(K)</math> bits then its prior probability is,
:<math>P(K) = 2^{-L(K)}</math>

The length of the shortest program that represents the string of bits is called the ].

Kolmogorov complexity is not computable. This is related to the ]. When searching for the shortest program some programs may go into an infinite loop.

====Considering all theories====

The Greek philosopher ] is quoted as saying "If more than one theory is consistent with the observations, keep all theories".<ref>Li, M. and Vitanyi, P., ''An Introduction to Kolmogorov Complexity and Its Applications'', 3rd Edition, Springer Science and Business Media, N.Y., 2008, p 347</ref>

As in a crime novel all theories must be considered in determining the likely murderer, so with inductive probability all programs must be considered in determining the likely future bits arising from the stream of bits.

Programs that are already longer than ''n'' have no predictive power. The raw (or prior) probability that the pattern of bits is random (has no pattern) is <math>2^{-n}</math>.

Each program that produces the sequence of bits, but is shorter than the ''n'' is a theory/pattern about the bits with a probability of <math>2^{-k}</math> where ''k'' is the length of the program.

The probability of receiving a sequence of bits ''y'' after receiving a series of bits ''x'' is then the ] of receiving ''y'' given ''x'', which is the probability of ''x'' with ''y'' appended, divided by the probability of ''x''.<ref>Solomonoff, R., "", Report V-131, Zator Co., Cambridge, Ma. Feb 4, 1960, , Nov., 1960.</ref><ref>Solomonoff, R., "" ''Information and Control'', Vol 7, No. 1 pp 1–22, March 1964.</ref><ref>Solomonoff, R., "" ''Information and Control'', Vol 7, No. 2 pp 224–254, June 1964.</ref>

====Universal priors====

The programming language affects the predictions of the next bit in the string. The language acts as a ]. This is particularly a problem where the programming language codes for numbers and other data types. Intuitively we think that 0 and 1 are simple numbers, and that prime numbers are somehow more complex than numbers that may be composite.

Using the ] gives an unbiased estimate (a universal prior) of the prior probability of a number. As a thought experiment an ] may be fitted with a data input device giving a series of numbers, after applying some transformation function to the raw numbers. Another agent might have the same input device with a different transformation function. The agents do not see or know about these transformation functions. Then there appears no rational basis for preferring one function over another. A universal prior insures that although two agents may have different initial probability distributions for the data input, the difference will be bounded by a constant.

So universal priors do not eliminate an initial bias, but they reduce and limit it. Whenever we describe an event in a language, either using a natural language or other, the language has encoded in it our prior expectations. So some reliance on prior probabilities are inevitable.

A problem arises where an intelligent agent's prior expectations interact with the environment to form a self reinforcing feed back loop. This is the problem of bias or prejudice. Universal priors reduce but do not eliminate this problem.

===Universal artificial intelligence===

The theory of ] applies ] to inductive probabilities. The theory shows how the best actions to optimize a reward function may be chosen. The result is a theoretical model of intelligence.<ref>{{cite book |last=Hutter|first=Marcus |title=Sequential Decisions Based on Algorithmic Probability |year=1998|publisher=Springer |isbn=3-540-22139-5}}</ref>

It is a fundamental theory of intelligence, which optimizes the agents behavior in,
* Exploring the environment; performing actions to get responses that broaden the agents knowledge.
* Competing or co-operating with another agent; games.
* Balancing short and long term rewards.

In general no agent will always provide the best actions in all situations. A particular choice made by an agent may be wrong, and the environment may provide no way for the agent to recover from an initial bad choice. However the agent is ] in the sense that no other agent will do better than this agent in this environment, without doing worse in another environment. No other agent may, in this sense, be said to be better.

At present the theory is limited by incomputability (the ]). Approximations may be used to avoid this. Processing speed and ] remain the primary limiting factors for ].

== Probability ==


Probability is the representation of uncertain or partial knowledge about the truth of statements. Probabilities are subjective and personal estimates of likely outcomes based on past experience and inferences made from the data. Probability is the representation of uncertain or partial knowledge about the truth of statements. Probabilities are subjective and personal estimates of likely outcomes based on past experience and inferences made from the data.
Line 20: Line 112:
This description of probability may seem strange at first. In natural language we refer to "the probability" that the sun will rise tomorrow. We do not refer to "your probability" that the sun will rise. But in order for inference to be correctly modeled probability must be personal, and the act of inference generates new posterior probabilities from prior probabilities. This description of probability may seem strange at first. In natural language we refer to "the probability" that the sun will rise tomorrow. We do not refer to "your probability" that the sun will rise. But in order for inference to be correctly modeled probability must be personal, and the act of inference generates new posterior probabilities from prior probabilities.


Probabilities are personal because they are conditional on the knowledge of the individual. Probabilities are subjective because they always depend, to some extend, on prior probabilities assigned by the individual. Subjective should not be taken here to mean vague or undefined. Probabilities are personal because they are conditional on the knowledge of the individual. Probabilities are subjective because they always depend, to some extent, on prior probabilities assigned by the individual. Subjective should not be taken here to mean vague or undefined.


The term ] is used to refer to the holder of the probabilities. The intelligent agent may be a human or a machine. If the intelligent agent does not interact with the environment then the probability will converge over time to the frequency of the event. The term ] is used to refer to the holder of the probabilities. The intelligent agent may be a human or a machine. If the intelligent agent does not interact with the environment then the probability will converge over time to the frequency of the event.
Line 26: Line 118:
If however the agent uses the probability to interact with the environment there may be a feedback, so that two agents in the identical environment starting with only slightly different priors, end up with completely different probabilities. In this case optimal ] as in ] Universal Artificial Intelligence will give ] performance for the agent. This means that no other intelligent agent could do better in one environment without doing worse in another environment. If however the agent uses the probability to interact with the environment there may be a feedback, so that two agents in the identical environment starting with only slightly different priors, end up with completely different probabilities. In this case optimal ] as in ] Universal Artificial Intelligence will give ] performance for the agent. This means that no other intelligent agent could do better in one environment without doing worse in another environment.


=== Comparison to classical probability === === Comparison to deductive probability ===


In classical probability theories, probabilities are absolutes, independent of the individual making the assessment. But classical probabilities are based on, In deductive probability theories, probabilities are absolutes, independent of the individual making the assessment. But deductive probabilities are based on,
* Shared knowledge. * Shared knowledge.
* Assumed facts, that should be inferred from the data. * Assumed facts, that should be inferred from the data.


For example in a trial the participants are aware the outcome of all the previous history of trials. They also assume that each outcome is equally probable. Together this allows a single unconditional value of probability to be defined. For example, in a trial the participants are aware the outcome of all the previous history of trials. They also assume that each outcome is equally probable. Together this allows a single unconditional value of probability to be defined.


But in reality each individual does not have the same information. And in general the probability of each outcome is not equal. The dice may be loaded, and this loading needs to be inferred from the data. But in reality each individual does not have the same information. And in general the probability of each outcome is not equal. The dice may be loaded, and this loading needs to be inferred from the data.
Line 38: Line 130:
=== Probability as estimation === === Probability as estimation ===


The ] has played a key role in probability theory. It says that if N statements are symmetric so that one condition cannot be preferred over another then all statements are equally probable. The ] has played a key role in probability theory. It says that if N statements are symmetric so that one condition cannot be preferred over another then all statements are equally probable.<ref>{{cite web|last1=Carnap|first1=Rudolf|title=STATISTICAL AND INDUCTIVE PROBABILITY|url=http://fitelson.org/probability/carnap_saip.pdf|author1-link=Carnap}}</ref>


Taken seriously, in evaluating probability this principle leads to contradictions. Suppose there are 3 bags of gold in the distance and you are asked to select one. Then because of the distance you cant see the bag sizes. You estimate using the principle of indifference that each bag has equal amounts of gold, and each bag has one third of the gold. Taken seriously, in evaluating probability this principle leads to contradictions. Suppose there are 3 bags of gold in the distance and one is asked to select one. Then because of the distance one cannot see the bag sizes. You estimate using the principle of indifference that each bag has equal amounts of gold, and each bag has one third of the gold.


Now, while you are not looking, I take one of the bags and divide it into 3 bags. Now there are 5 bags of gold. The principle of indifference now says each bag has one fifth of the gold. A bag that was estimated to have one third of the gold is now estimated to have one fifth of the gold. Now, while one of us is not looking, the other takes one of the bags and divide it into 3 bags. Now there are 5 bags of gold. The principle of indifference now says each bag has one fifth of the gold. A bag that was estimated to have one third of the gold is now estimated to have one fifth of the gold.


Taken as a value associated with the bag the values are different therefore contradictory. But taken as an estimate given under a particular scenario, both values are separate estimates given under different circumstances and there is no reason to believe they are equal. Taken as a value associated with the bag the values are different therefore contradictory. But taken as an estimate given under a particular scenario, both values are separate estimates given under different circumstances and there is no reason to believe they are equal.


Estimates of prior probabilities are particularly suspect. Estimates will be constructed that do not follow any consistent frequency distribution. For this reason prior probabilities are considered as estimates of probabilities rather than probabilities. Estimates of prior probabilities are particularly suspect. Estimates will be constructed that do not follow any consistent frequency distribution. For this reason prior probabilities are considered as estimates of probabilities rather than probabilities.
Line 54: Line 146:
* The estimation procedure used to give the probability. * The estimation procedure used to give the probability.


===Combining probability approaches===
=== Probability and information ===


Inductive probability combines two different approaches to probability.
Whereas logic represents only two values; true and false as the values of statement, probability associates a number between 0.0 and 1.0 with each statement. If the probability of a statement is 0 the statement is false. If the probability of a statement is 1 the statement is true.
* Probability and information
* Probability and frequency
The principle of indifference is used to establish other probability values. Each probability is always associated with the state of knowledge at a particular point in the argument. Probabilities before an inference are known as prior probabilities, and probabilities after are known as posterior probabilities.

Each approach gives a slightly different viewpoint. Information theory is used in relating probabilities to quantities of information. This approach is often used in giving estimates of prior probabilities.

] defines probabilities as objective statements about how often an event occurs. This approach may be stretched by defining the ] to be over ]s. Statements about possible worlds define ].

== Probability and information ==

Whereas logic represents only two values; true and false as the values of statement, probability associates a number in to each statement. If the probability of a statement is 0, the statement is false. If the probability of a statement is 1 the statement is true.


In considering some data as a string of bits the prior probabilities for a sequence of 1 and 0s, the probability of 1 and 0 is equal. Therefore each extra bit halves the probability of a sequence of bits. In considering some data as a string of bits the prior probabilities for a sequence of 1s and 0s, the probability of 1 and 0 is equal. Therefore, each extra bit halves the probability of a sequence of bits.
This leads to the conclusion that, This leads to the conclusion that,
:<math>P(x) = 2^{-L(x)}</math> :<math>P(x) = 2^{-L(x)}</math>
Where <math>P(x)</math> is the probability of the string of bits <math>x</math> and <math>L(x)</math> is its length.
Where
* <math>P(x)</math> is the probability of a string of bits x
* <math>L(x)</math> is the length of the string of bits x.
* <math>2^{-L(x)}</math> means 1 divided by 2 to the power of the length of the string of bits x.


We may consider the prior probability of any statement as the number of bits needed to state it. The prior probability of any statement is calculated from the number of bits needed to state it. See also ].


=== Combining information === === Combining information ===


Two statements A and B may be represented by two separate encodings. Then the length of the encoding is, Two statements <math>A</math> and <math>B</math> may be represented by two separate encodings. Then the length of the encoding is,


: <math>L(A \and B) = L(A) + L(B)</math> : <math>L(A \land B) = L(A) + L(B)</math>


or in terms of probability, or in terms of probability,


: <math>P(A \and B) = P(A) P(B)</math> : <math>P(A \land B) = P(A) P(B)</math>


But this law is not always true because there may be a shorter method of encoding B if we assume A. So the above probability law applies only if A and B are "independent". But this law is not always true because there may be a shorter method of encoding <math>B</math> if we assume <math>A</math>. So the above probability law applies only if <math>A</math> and <math>B</math> are "independent".


===The internal language of information===
=== Conditional probability ===


The primary use of the information approach to probability is to provide estimates of the complexity of statements. Recall that Occam's razor states that "All things being equal, the simplest theory is the most likely to be correct". In order to apply this rule, first there needs to be a definition of what "simplest" means. Information theory defines simplest to mean having the shortest encoding.
Probability depends on the facts known. Prior probabilities are the probabilities before a fact is known. Posterior probabilities are after a fact is known. The posterior probabilities are said to be conditional on the fact. Conditional probabilities are written,
: <math>P(B|A)</math>


Knowledge is represented as ]. Each statement is a ] ]. Expressions are encoded by a function that takes a description (as against the value) of the expression and encodes it as a bit string.
This means the probability that B is true given that A is true.


The length of the encoding of a statement gives an estimate of the probability of a statement. This probability estimate will often be used as the prior probability of a statement.
All probabilities are in some sense conditional. The prior probability of B is,
: <math>P(B) = P(B|\text{true})</math>


Technically this estimate is not a probability because it is not constructed from a frequency distribution. The probability estimates given by it do not always obey ]. Applying the law of total probability to various scenarios will usually give a more accurate probability estimate of the prior probability than the estimate from the length of the statement.
=== The frequentest approach applied to possible worlds ===

====Encoding expressions====

An expression is constructed from sub expressions,
* Constants (including function identifier).
* Application of functions.
* ].

A ] must distinguish the 3 cases. The length of each code is based on the frequency of each type of sub expressions.

Initially constants are all assigned the same length/probability. Later constants may be assigned a probability using the Huffman code based on the number of uses of the function id in all expressions recorded so far. In using a Huffman code the goal is to estimate probabilities, not to compress the data.

The length of a function application is the length of the function identifier constant plus the sum of the sizes of the expressions for each parameter.

The length of a quantifier is the length of the expression being quantified over.

====Distribution of numbers====

No explicit representation of natural numbers is given. However natural numbers may be constructed by applying the successor function to 0, and then applying other arithmetic functions. A distribution of natural numbers is implied by this, based on the complexity of constructing each number.

Rational numbers are constructed by the division of natural numbers. The simplest representation has no common factors between the numerator and the denominator. This allows the probability distribution of natural numbers may be extended to rational numbers.

== Probability and frequency ==

The probability of an ] may be interpreted as the frequencies of ] where the statement is true divided by the total number of outcomes. If the outcomes form a continuum the frequency may need to be replaced with a ].

Events are sets of outcomes. Statements may be related to events. A Boolean statement B about outcomes defines a set of outcomes b,
: <math> b = \{x : B(x)\} </math>
=== Conditional probability ===

Each probability is always associated with the state of knowledge at a particular point in the argument. Probabilities before an inference are known as prior probabilities, and probabilities after are known as posterior probabilities.

Probability depends on the facts known. The truth of a fact limits the domain of outcomes to the outcomes consistent with the fact. Prior probabilities are the probabilities before a fact is known. Posterior probabilities are after a fact is known. The posterior probabilities are said to be conditional on the fact. the probability that <math>B</math> is true given that <math>A</math> is true is written as: <math>P(B | A).</math>


All probabilities are in some sense conditional. The prior probability of <math>B</math> is,
In the ], probabilities are defined as the ratio of the number of ] within an event to the total number of outcomes. In the ] model each possible world is an outcome, and statements about possible worlds define events. The probability of a statement being true is the number of possible worlds divided by the total number of worlds.
: <math>P(B) = P(B | \top)</math>


=== The frequentist approach applied to possible worlds ===
The total number of worlds may be infinite. In this case instead of counting the elements of the set a ] must be used. In general the cardinality |S|, where S is a set, is a measure.


In the ], probabilities are defined as the ratio of the number of ] within an event to the total number of outcomes. In the ] model each possible world is an outcome, and statements about possible worlds define events. The probability of a statement being true is the number of possible worlds where the statement is true divided by the total number of possible worlds. The probability of a statement <math>A</math> being true about possible worlds is then,
The probability of a statement A being true about possible worlds is then,
: <math> P(A) = \frac{|\{x : A(x)\}|}{|x : true|} </math> : <math> P(A) = \frac{|\{x : A(x)\}|}{|x : \top|} </math>


For a conditional probability. For a conditional probability.
: <math> P(B|A) = \frac{|\{x : A(x) \and B(X)\}|}{|x : A(x)|} </math> : <math> P(B | A) = \frac{|\{x : A(x) \land B(x)\}|}{|x : A(x)|} </math>


then then

: <math> P(A \and B)</math>
: <math> = \frac{|\{x : A(x) \and B(x)\}|}{|x : true|} </math> : <math> \begin{align} P(A \land B) &= \frac{|\{x : A(x) \land B(x)\}|}{|x : \top|} \\
: <math> = \frac{|\{x : A(x) \and B(x)\}|}{|\{x : A(x)\}|} \frac{|\{x : A(x)\}|}{|x : true|} </math> &= \frac{|\{x : A(x) \land B(x)\}|}{|\{x : A(x)\}|} \frac{|\{x : A(x)\}|}{|x : \top|} \\
: <math> = P(A) P(B|A) </math> &= P(A) P(B | A)
\end{align}</math>


Using symmetry this equation may be written out as Bayes' law. Using symmetry this equation may be written out as Bayes' law.
: <math> P(A \and B) = P(A) P(B|A) = P(B) P(A|B)</math> : <math> P(A \land B) = P(A) P(B | A) = P(B) P(A | B)</math>


This law describes the relationship between prior and posterior probabilities when new facts are learnt. This law describes the relationship between prior and posterior probabilities when new facts are learnt.


Written as quantities of information ] becomes, Written as quantities of information ] becomes,
: <math>L(A \and B) = L(A) + L(B|A) = L(B) + L(A|B)</math> : <math>L(A \land B) = L(A) + L(B | A) = L(B) + L(A | B)</math>


Two statements A and B are said to be independent if knowing the truth of A does not change the probability of B. Mathematically this is, Two statements A and B are said to be independent if knowing the truth of A does not change the probability of B. Mathematically this is,
: <math>P(B) = P(B|A)</math> : <math>P(B) = P(B | A)</math>


then ] reduces to, then ] reduces to,
: <math>P(A \and B) = P(A) P(B)</math> : <math>P(A \land B) = P(A) P(B)</math>


=== The law of total of probability === === The law of total of probability ===


For a set of mutually exclusive possibilities <math>A_i</math>, the sum of the posterior probabilities must be 1. For a set of mutually exclusive possibilities <math>A_i</math>, the sum of the posterior probabilities must be 1.
: <math>\sum_i{P(A_i|B)} = 1</math> : <math>\sum_i{P(A_i | B)} = 1</math>


Substituting using Bayes' theorem gives the ] Substituting using Bayes' theorem gives the ]
: <math>\sum_i{P(B|A_i)P(B)} = P(B)</math> : <math>\sum_i{P(B | A_i)P(A_i)} = \sum_i{P(A_i | B)P(B)}</math>


: <math>P(B) = \sum_i{P(B|A_i) P(A_i)}</math> : <math>P(B) = \sum_i{P(B | A_i) P(A_i)}</math>


This result is used to give the ], This result is used to give the ],
: <math>P(A_i|B) = \frac{P(B|A_i) P(A_i)}{\sum_j{P(B|A_j) P(A_j)}}</math> : <math>P(A_i | B) = \frac{P(B | A_i) P(A_i)}{\sum_j{P(B | A_j) P(A_j)}}</math>


This is the usual form of Bayes' theorem used in practice, because it guarantees the sum of all the posterior probabilities for <math>A_i</math> is 1. This is the usual form of Bayes' theorem used in practice, because it guarantees the sum of all the posterior probabilities for <math>A_i</math> is 1.
Line 142: Line 273:


For mutually exclusive possibilities, the probabilities add. For mutually exclusive possibilities, the probabilities add.
: <math> P(A \or B) = P(A) + P(B) </math> if <math> P(A \and B) = 0 </math> :<math> P(A \lor B) = P(A) + P(B), \qquad \text{if } P(A \land B) = 0 </math>


Using Using
: <math> A \or B = (A \and \neg (A \and B)) \or (B \and \neg (A \and B)) \or (A \and B)</math> : <math> A \lor B = (A \land \neg (A \land B)) \lor (B \land \neg (A \land B)) \lor (A \land B)</math>
Then the alternatives Then the alternatives
: <math> A \and \neg (A \and B) </math> : <math> A \land \neg (A \land B), \quad B \land \neg (A \land B), \quad A \land B </math>
are all mutually exclusive. Also,
: <math> B \and \neg (A \and B) </math>
: <math> A \and B </math> : <math> (A \land \neg (A \land B)) \lor (A \land B) = A </math>
: <math> P(A \land \neg (A \land B)) + P(A \land B) = P(A) </math>
: <math> P(A \land \neg (A \land B)) = P(A) - P(A \land B) </math>


so, putting it all together,
are all mutually exlusive


: <math> \begin{align}
Also,
: <math> (A \and \neg (A \and B)) \or (A \and B) = A </math> P(A \lor B) &= P((A \land \neg (A \land B)) \lor (B \land \neg (A \land B)) \lor (A \land B)) \\
: <math> P(A \and \neg (A \and B)) + P(A \and B) = P(A) </math> & = P(A \land \neg (A \land B) + P(B \land \neg (A \land B)) + P(A \land B) \\
: <math> P(A \and \neg (A \and B)) = P(A) - P(A \and B) </math> &= P(A) - P(A \land B) + P(B) - P(A \land B) + P(A \land B) \\
&= P(A) + P(B) - P(A \land B)

\end{align}</math>
so, putting it all together,
: <math> P(A \or B) </math>
: <math> = P((A \and \neg (A \and B)) \or (B \and \neg (A \and B)) \or (A \and B)) </math>
: <math> = P(A \and \neg (A \and B) + P(B \and \neg (A \and B)) + P(A \and B)</math>
: <math> = P(A) - P(A \and B) + P(B) - P(A \and B) + P(A \and B)</math>
: <math> = P(A) + P(B) - P(A \and B) </math>


=== Negation === === Negation ===


As, As,
: <math> A \or \neg A = true </math> : <math> A \lor \neg A = \top </math>
then then
: <math> P(A) + P(\neg A) = 1</math> : <math> P(A) + P(\neg A) = 1</math>
Line 175: Line 303:


Implication is related to conditional probability by the following equation, Implication is related to conditional probability by the following equation,
:<math>A \to B \iff P(B \mid A) = 1</math> :<math>A \to B \iff P(B | A) = 1</math>


Derivation, Derivation,
:<math>A \to B</math>
:<math>\iff P(A \to B) = 1</math>
:<math>\iff P(A \and B \or \neg A) = 1</math>
:<math>\iff P(A \and B) + P(\neg A) = 1</math>
:<math>\iff P(A \and B) = P(A)</math>
:<math>\iff P(A) \cdot P(B \mid A) = P(A)</math>
:<math>\iff P(B \mid A) = 1</math>


:<math>\begin{align}
===Probability priors from encoding length===
A \to B & \iff P(A \to B) = 1 \\

&\iff P(A \land B \lor \neg A) = 1 \\
Knowledge may be represented as ]. Each statement is a ] ]. Expressions may encode by a function that takes a description (as against the value) of the expression and encodes it as a bit string.
&\iff P(A \land B) + P(\neg A) = 1 \\

&\iff P(A \land B) = P(A) \\
The length of the encoding of a statement gives an estimate of the probability estimate for a statement. This probability estimate will often be used as the prior probability of a statement.
&\iff P(A) \cdot P(B | A) = P(A) \\

&\iff P(B | A) = 1
Technically this estimate is not a probability because it is not constructed from a frequency distribution. The probability estimates given by it do not always obey ]. Applying the law of total probability to various scenarios will usually give a more accurate probability estimate of the prior probability than the estimate from the length of the statement.
\end{align}</math>

====Encoding expressions====

An expression is constructed from sub expressions,
* Constants (including function identifier).
* Application of functions.
* ].

A code must distinguish the 3 cases. The length of each code is based on the frequency of each type of sub expressions.

Initially constants are all assigned the same length/probability. Later constants may be assigned a probability using the ] based on the number of uses of the function id in the expression.

The length of a function application is the length of the function identifier constant plus the sum of the sizes of the expressions for each parameter.

The length of a quantifier is the length of the expression being quantified over.

====Distribution of numbers====

No explicit representation of natural numbers is given. However natural numbers may be constructed by applying the successor function to 0, and then applying other arithmetic functions. A distribution of natural numbers is implied by this, based on the complexity of constructing each number.

Rational numbers are constructed by the division of natural numbers. The simplest representation has no common factors between the numerator and the denominator. This allows the probability distribution of natural numbers may be extended to rational numbers.


== Bayesian hypothesis testing == == Bayesian hypothesis testing ==
Line 219: Line 320:
Bayes' theorem may be used to estimate the probability of a hypothesis or theory H, given some facts F. The posterior probability of H is then Bayes' theorem may be used to estimate the probability of a hypothesis or theory H, given some facts F. The posterior probability of H is then


: <math>P(H|F) = \frac{P(H)P(F|H)}{P(F)}</math> : <math>P(H | F) = \frac{P(H)P(F | H)}{P(F)}</math>


or in terms of information, or in terms of information,
: <math>P(H|F) = 2^{-(L(H) + L(F|H) - L(F))} </math> : <math>P(H | F) = 2^{-(L(H) + L(F | H) - L(F))} </math>


By assuming the hypothesis is true, a simpler representation of the statement F may be given. The length of the encoding of this simpler representation is L(F|H). By assuming the hypothesis is true, a simpler representation of the statement F may be given. The length of the encoding of this simpler representation is <math>L(F | H).</math>


L(H) + L(F|H) represents the amount of information needed to represent the facts F, if H is true. L(F) is the amount of information needed to represent F without the hypothesis H. The difference is how much the representation of the facts has been compressed by assuming that H is true. This is the evidence that the hypothesis H is true. <math>L(H) + L(F | H) </math> represents the amount of information needed to represent the facts F, if H is true. <math>L(F)</math> is the amount of information needed to represent F without the hypothesis H. The difference is how much the representation of the facts has been compressed by assuming that H is true. This is the evidence that the hypothesis H is true.


If L(F) is estimated from ] then the probability obtained will not be between 0 and 1. The value obtained is proportional to the probability, without being a good probability estimate. The number obtained is sometimes referred to as a relative probability, being how much more probable the theory is than not holding the theory. If <math>L(F)</math> is estimated from ] then the probability obtained will not be between 0 and 1. The value obtained is proportional to the probability, without being a good probability estimate. The number obtained is sometimes referred to as a relative probability, being how much more probable the theory is than not holding the theory.


If a full set of mutually exclusive hypothesis that provide evidence is known, a proper estimate may be given for the prior probability <math>P(F)</math>. If a full set of mutually exclusive hypothesis that provide evidence is known, a proper estimate may be given for the prior probability <math>P(F)</math>.
Line 235: Line 336:


Probabilities may be calculated from the extended form of Bayes' theorem. Given all mutually exclusive hypothesis <math>H_i</math> which give evidence, such that, Probabilities may be calculated from the extended form of Bayes' theorem. Given all mutually exclusive hypothesis <math>H_i</math> which give evidence, such that,
: <math>L(H_i) + L(F|H_i) < L(F)</math> : <math>L(H_i) + L(F | H_i) < L(F)</math>


and also the hypothesis R, that none of the hypothesis is true, then, and also the hypothesis R, that none of the hypothesis is true, then,
: <math>P(H_i|F) = \frac{P(H_i) P(F|H_i)}{P(F|R) + \sum_j{P(H_j) P(F|H_j)}}</math> : <math> \begin{align}
P(H_i | F) &= \frac{P(H_i) P(F | H_i)}{P(F|R) + \sum_j{P(H_j) P(F | H_j)}} \\
: <math>P(R|F) = \frac{P(F|R)}{P(F|R) + \sum_j{P(H_j) P(F|H_j)}}</math> P(R | F) &= \frac{P(F | R)}{P(F | R) + \sum_j{P(H_j) P(F | H_j)}}
\end{align}</math>


In terms of information, In terms of information,
: <math>P(H_i|F) = \frac{2^{-(L(H_i) + L(F|H_i))}}{2^{-L(F|R)} + \sum_j{2^{-(L(H_j) + L(F|H_j))}}}</math>
: <math>P(R|F) = \frac{2^{-L(F|R)}}{2^{-L(F|R)} + \sum_j{2^{-(L(H_j) + L(F|H_j))}}}</math>


: <math>\begin{align}
In most situations it is a good approximation to assume that F is independent of R,
P(H_i | F) &= \frac{2^{-(L(H_i) + L(F | H_i))}}{2^{-L(F | R)} + \sum_j 2^{-(L(H_j) + L(F | H_j)) }} \\
: <math>P(F|R) = P(F)</math>
P(R| F) &= \frac{2^{-L(F | R)}}{2^{-L(F | R)} + \sum_j{2^{-(L(H_j) + L(F | H_j))}}}
\end{align}</math>


In most situations it is a good approximation to assume that <math>F</math> is independent of <math>R</math>, which means <math>P(F | R) = P(F)</math> giving,
giving,

: <math>P(H_i|F) \approx \frac{2^{-(L(H_i) + L(F|H_i))}}{2^{-L(F)} + \sum_j{2^{-(L(H_j) + L(F|H_j))}}}</math>
: <math>P(R|F) \approx \frac{2^{-L(F)}}{2^{-L(F)} + \sum_j{2^{-(L(H_j) + L(F|H_j))}}}</math> : <math>\begin{align}
P(H_i | F) &\approx \frac{2^{-(L(H_i) + L(F | H_i))}}{2^{-L(F)} + \sum_j{2^{-(L(H_j) + L(F|H_j))}}} \\
P(R | F) &\approx \frac{2^{-L(F)}}{2^{-L(F)} + \sum_j{2^{-(L(H_j) + L(F | H_j))}}}
\end{align}</math>


==Boolean inductive inference== ==Boolean inductive inference==


]<ref>{{cite book |title=Abduction |year=2017 |publisher=Metaphysics Research Lab, Stanford University |url=http://plato.stanford.edu/entries/abduction/ }}</ref><ref>{{cite journal| first1=Niki| last1=Pfeifer|first2=Gernot D.|last2=Kleiter|title=INFERENCE IN CONDITIONAL PROBABILITY LOGIC|journal=Kybernetika |date=2006 |volume=42|issue=4 |pages=391–404}}</ref><ref>{{cite web |title=Conditional Probability |url=http://artint.info/html/ArtInt_142.html |work=Artificial Intelligence - Foundations of computational agents}}</ref><ref>{{cite web |title=Introduction to the theory of Inductive Logic Programming (ILP) |url=http://www.cs.ox.ac.uk/activities/machlearn/ilp_theory.html}}</ref> starts with a set of facts ''F'' which is a statement (Boolean expression). ] is of the form,
]

<ref>
{{cite web
|title=Abduction
|url=http://plato.stanford.edu/entries/abduction/
}}
</ref>
<ref>
{{cite journal
|first1=Niki
|last1=Pfeifer
|first2=Gernot D.
|last2=Kleiter
|title=INFERENCE IN CONDITIONAL PROBABILITY LOGIC
|journal=Kybernetika
|date=2006
|volume=42
|issue=4
|pages=391– 404
}}
</ref>
<ref>
{{cite web
|title=Conditional Probability
|url=http://artint.info/html/ArtInt_142.html
|work=Artificial Intelligence - Foundations of computational agents
}}
</ref>
<ref>
{{cite web
|title=Introduction to the theory of Inductive Logic Programming (ILP)
|url=http://www.cs.ox.ac.uk/activities/machlearn/ilp_theory.html
}}
</ref> starts with a set of facts ''F'' which is a statement (Boolean expression). ] is of the form,
:''A theory T implies the statement F. As the theory T is simpler than F, abduction says that there is a probability that the theory T is implied by F''. :''A theory T implies the statement F. As the theory T is simpler than F, abduction says that there is a probability that the theory T is implied by F''.


The theory ''T'', also called an explanation of the condition ''F'', is an answer to the ubiquitous factual "why" question. For example for the condition ''F'' is "Why do apples fall?". The answer is a theory ''T'' that implies that apples fall; The theory ''T'', also called an explanation of the condition ''F'', is an answer to the ubiquitous factual "why" question. For example, for the condition ''F'' is "Why do apples fall?". The answer is a theory ''T'' that implies that apples fall;
:<math>F = G \frac{m_1 m_2}{r^2}\ </math> :<math>F = G \frac{m_1 m_2}{r^2}</math>


Inductive inference is of the form, Inductive inference is of the form,
Line 298: Line 372:
In terms of abductive inference, ''all objects in a class C or set have a property P'' is a theory that implies the observed condition, ''All observed objects in a class C have a property P''. In terms of abductive inference, ''all objects in a class C or set have a property P'' is a theory that implies the observed condition, ''All observed objects in a class C have a property P''.


So ] is a special case of abductive inference. In common usage the term inductive inference is often used to refer to both abductive and inductive inference. So ] is a general case of abductive inference. In common usage the term inductive inference is often used to refer to both abductive and inductive inference.

===Generalization and specialization===

Inductive inference is related to ]. Generalizations may be formed from statements by replacing a specific value with membership of a category, or by replacing membership of a category with membership of a broader category. In deductive logic, generalization is a powerful method of generating new theories that may be true. In inductive inference generalization generates theories that have a probability of being true.

The opposite of generalization is specialization. Specialization is used in applying a general rule to a specific case. Specializations are created from generalizations by replacing membership of a category by a specific value, or by replacing a category with a sub category.

The ] classification of living things and objects forms the basis for generalization and specification. The ability to identify, recognize and classify is the basis for generalization. Perceiving the world as a collection of objects appears to be a key aspect of human intelligence. It is the object oriented model, in the non ] sense.

The object oriented model is constructed from our ]. In particularly ] is based on the ability to compare two images and calculate how much information is needed to morph or map one image into another. ] uses this mapping to construct 3D images from ].

] is a means of constructing theory that implies a condition. Plotkin's <ref>{{cite journal|first1=Gordon D.|last1=Plotkin|title=A Note on Inductive Generalization|editor1-first=B.|editor1-last=Meltzer|editor2-first=D.|editor2-last=Michie|publisher=Edinburgh University Press|journal=Machine Intelligence|volume=5|pages=153–163|year=1970}}</ref><ref>{{cite journal|first1=Gordon D.|last1=Plotkin|title=A Further Note on Inductive Generalization|editor1-first=B.|editor1-last=Meltzer|editor2-first=D.|editor2-last=Michie|publisher=Edinburgh University Press|journal=Machine Intelligence|volume=6|pages=101–124|year=1971}}</ref> "''relative least general generalization (rlgg)''" approach constructs the simplest generalization consistent with the condition.


===Newton's use of induction=== ===Newton's use of induction===


] used inductive arguments in constructing his ].<ref>Isaac Newton: "In philosophy particular propositions are inferred from the phenomena and afterwards rendered general by induction": "]", Book 3, General Scholium, at p.392 in Volume 2 of Andrew Motte's English translation published 1729.</ref> Starting with the statement, ] used inductive arguments in constructing his ].<ref>Isaac Newton: "In philosophy particular propositions are inferred from the phenomena and afterwards rendered general by induction": "]", Book 3, General Scholium, at p.392 in Volume 2 of Andrew Motte's English translation published 1729.</ref> Starting with the statement,
* The center of an apple falls towards the center of the earth. * The center of an apple falls towards the center of the Earth.


Generalizing by replacing apple for object, and earth for object gives, in a two body system, Generalizing by replacing apple for object, and Earth for object gives, in a two body system,
* The center of an object falls towards the center of another object. * The center of an object falls towards the center of another object.


Line 323: Line 409:


] as, ] as,
:<math>T \to F \iff P(F \mid T) = 1</math> :<math>T \to F \iff P(F | T) = 1</math>


So, So,
: <math>P(F \mid T) = 1</math> : <math>P(F | T) = 1</math>
: <math>L(F \mid T) = 0</math> : <math>L(F | T) = 0</math>


This result may be used in the probabilities given for Bayesian hypothesis testing. For a single theory, H = T and, This result may be used in the probabilities given for Bayesian hypothesis testing. For a single theory, H = T and,
: <math>P(T \mid F) = \frac{P(T)}{P(F)}</math> : <math>P(T | F) = \frac{P(T)}{P(F)}</math>


or in terms of information, the relative probability is, or in terms of information, the relative probability is,
: <math>P(T \mid F) = 2^{-(L(T) - L(F))} </math> : <math>P(T | F) = 2^{-(L(T) - L(F))} </math>


Note that this estimate for P(T|F) is not a true probability. If <math>L(T_i) < L(F)</math> then the theory has evidence to support it. Then for a set of theories <math>T_i = H_i</math>, such that <math>L(T_i) < L(F)</math>, Note that this estimate for P(T|F) is not a true probability. If <math>L(T_i) < L(F)</math> then the theory has evidence to support it. Then for a set of theories <math>T_i = H_i</math>, such that <math>L(T_i) < L(F)</math>,


: <math>P(T_i \mid F) = \frac{P(T_i)}{P(F \mid R) + \sum_j{P(T_j)}}</math> : <math>P(T_i | F) = \frac{P(T_i)}{P(F | R) + \sum_j{P(T_j)}}</math>
: <math>P(R \mid F) = \frac{P(F \mid R)}{P(F \mid R) + \sum_j{P(T_j)}}</math> : <math>P(R | F) = \frac{P(F | R)}{P(F | R) + \sum_j{P(T_j)}}</math>


giving, giving,
: <math>P(T_i \mid F) \approx \frac{2^{-L(T_i)}}{2^{-L(F)} + \sum_j{2^{-L(T_j)}}}</math> : <math>P(T_i | F) \approx \frac{2^{-L(T_i)}}{2^{-L(F)} + \sum_j{2^{-L(T_j)}}}</math>
: <math>P(R \mid F) \approx \frac{2^{-L(F)}}{2^{-L(F)} + \sum_j{2^{-L(T_j)}}}</math> : <math>P(R | F) \approx \frac{2^{-L(F)}}{2^{-L(F)} + \sum_j{2^{-L(T_j)}}}</math>

==Inference based on program complexity==

] is also inductive inference. A bit string x is observed. Then consider all programs that generate strings starting with x. Cast in the form of inductive inference, the programs are theories that imply the observation of the bit string ''x''.

The method used here to give probabilities for inductive inference is based on ].

===Detecting patterns in the data===

If all the bits are 1, then people infer that there is a bias in the coin and that it is more likely also that the next bit is 1 also. This is described as learning from, or detecting a pattern in the data.

Such a pattern may be represented by a ]. A short computer program may be written that produces a series of bits which are all 1. If the length of the program ''K'' is <math>L(K)</math> bits then it's prior probability is,
:<math>P(K) = 2^{-L(K)}</math>

The length of the shortest program that represents the string of bits is called the ].

Kolmogorov complexity is not computible. This is related to the ]. When searching for the shortest program some programs may go into an infinite loop.

===Considering all theories===

The Greek philosopher ] is quoted as saying "If more than one theory is consistent with the observations, keep all theories".<ref>Li, M. and Vitanyi, P., ''An Introduction to Kolmogorov Complexity and Its Applications'', 3rd Edition, Springer Science and Business Media, N.Y., 2008, p 347</ref>

As in a crime novel all theories must be considered in determining the likely murderer, so with inductive probability all programs must be considered in determining the likely future bits arising from the stream of bits.

Programs that are already longer than ''n'' have no predictive power. The raw (or prior) probability that the pattern of bits is random (has no pattern) is <math>2^{-n}</math>.

Each program that produces the sequence of bits, but is shorter than the ''n'' is a theory/pattern about the bits with a probability of <math>2^{-k}</math> where ''k'' is the length of the program.

The probability of receiving a sequence of bits ''y'' after receiving a series of bits x is then the conditional probability of receiving ''y'' given ''x'', which is the probability of x with y appended, divided by the probability of x.
<ref>Solomonoff, R., "", Report V-131, Zator Co., Cambridge, Ma. Feb 4, 1960, , Nov., 1960.
</ref>
<ref>
Solomonoff, R., "" ''Information and Control'', Vol 7, No. 1 pp 1–22, March 1964.
</ref>
<ref>
Solomonoff, R., "" ''Information and Control'', Vol 7, No. 2 pp 224–254, June 1964.
</ref>

===Universal priors===

The programming language effects the predictions of the next bit in the string. The language acts as a ]. This is particularly a problem where the programming language codes for numbers and other data types. Intuitively we think that 0 and 1 are simple numbers, and that prime numbers are somehow more complex the numbers may be factorized.

Using the ] gives an unbiased estimate (a universal prior) of the prior probability of a number. As a thought experiment an ] may be fitted with a data input device giving a series of numbers, after applying some transformation function to the raw numbers. Another agent might have the same input device with a different transformation function. The agents do not see or know about these transformation functions. Then there appears no rational basis for preferring one function over another. A universal prior insures that although two agents may have different initial probability distributions for the data input, the difference will be bounded by a constant.

So universal priors do not eliminate an initial bias, but they reduce and limit it. Whenever we describe an event in a language, either using a natural language or other, the language has encoded in it our prior expectations. So some reliance on prior probabilities are inevitable.

A problem arises where an intelligent agents prior expectations interact with the environment to form a self reinforcing feed back loop. This is the problem of bias or prejudice. Universal priors reduce but do not eliminate this problem.

==Minimum description/message length==

The program with the shortest length that matches the data is the most likely to predict future data. This is the thesis behind the ]<ref>
{{cite journal
|last=Wallace|first=Chris
|author2=Boulton
|title=An information measure for classification
|journal=Computer Journal|year=1968
|volume=11
|issue=2
|pages=185–194}}
</ref> and ]<ref>{{cite doi|10.1016/0005-1098(78)90005-5}}</ref> methods.

At first sight ] appears different from the minimimum message/description length principle. At closer inspection it turns out to be the same. Bayes' theorem is about conditional probabilities. What is the probability that event ''B'' happens if firstly event ''A'' happens?

:<math>P(A \and B) = P(B) \cdot P(A\mid B) = P(A) \cdot P(B\mid A)</math>

Becomes in terms of message length ''l'',
:<math>L(A \and B) = L(B) + L(A\mid B) = L(A) + L(B\mid A)</math>

What this means is that in describing an event, if all the information is given describing the event then the length of the information may be used to give the raw probability of the event. So if the information describing the occurrence of ''A'' is given, along with the information describing ''B'' given ''A'', then all the information describing ''A'' and ''B'' has been given.<ref>
{{cite web
|last=Allison
|first=Lloyd
|title=Minimum Message Length (MML) – LA's MML introduction
|url=http://www.csse.monash.edu.au/~lloyd/tildeMML
}}</ref>
<ref>
{{cite web
|last=Oliver
|first=J. J.
|last2=Baxter
|first2=Rohan A.
|title=MML and Bayesianism: Similarities and Differences (Introduction to Minimum Encoding Inference – Part II)
|url=http://citeseerx.ist.psu.edu/viewdoc/similar;jsessionid=65475C44F4C425AFE77BCAE59D49CE92?doi=10.1.1.1.7367&type=ab}}</ref>

===Overfitting===

] is where the model matches the random noise and not the pattern in the data. For example take the situation where a curve is fitted to a set of points. If polynomial with many terms is fitted then it can more closely represent the data. Then the fit will be better, and the information needed to describe the deviances from the fitted curve will be smaller. Smaller information length means more probable.

However the information needed to describe the curve must also be considered. The total information for a curve with many terms may be greater than for a curve with fewer terms, that has not as good a fit, but needs less information to describe the polynomial.

==Universal artificial intelligence==

The theory of ] applies ] to inductive probabilities. The theory shows how the best actions to optimize a reward function may be chosen. The result is a theoretical model of intelligence.
<ref>
{{cite book
|last=Hutter
|first=Marcus
|title=Sequential Decisions Based on Algorithmic Probability
|year=1998
|publisher=Springer
|isbn=3-540-22139-5}}
</ref>

It is a fundamental theory of intelligence, which optimizes the agents behavior in,
* Exploring the environment; performing actions to get responses that broaden the agents knowledge.
* Competing or co-operating with another agent; games.
* Balancing short and long term rewards.

In general no agent will always provide the best actions in all situations. A particular choice choice made by an agent may be wrong, and the environment may provide no way for the agent to recover from an initial bad choice. However the agent is ] in the sense that no other agent will do better than this agent in this environment, without doing worse in another environment. No other agent may, in this sense, be said to be better.

At present the theory is limited by incomputability (the ]). Approximations may be used to avoid this. Processing speed and ] remain the primary limiting factors for ].

==Implementation of inductive inference==

] is a means of constructing theory that implies a condition. Plotkin's <ref>{{cite journal|first1=Gordon D.|last1=Plotkin|title=A Note on Inductive Generalization|editor1-first=B.|editor1-last=Meltzer|editor2-first=D.|editor2-last=Michie|publisher=Edinburgh University Press|journal=Machine Intelligence|volume=5|pages=153–163|year=1970}}</ref><ref>{{cite journal|first1=Gordon D.|last1=Plotkin|title=A Further Note on Inductive Generalization|editor1-first=B.|editor1-last=Meltzer|editor2-first=D.|editor2-last=Michie|publisher=Edinburgh University Press|journal=Machine Intelligence|volume=6|pages=101–124|year=1971}}</ref> "''relative least general generalization (rlgg)''" approach constructs the simplest generalization consistent with the condition.


==Derivations== ==Derivations==
Line 464: Line 435:


Make a list of all the shortest programs <math>K_i</math> that each produce a distinct infinite string of bits, and satisfy the relation, Make a list of all the shortest programs <math>K_i</math> that each produce a distinct infinite string of bits, and satisfy the relation,

:<math>T_n(R(K_i)) = x</math> :<math>T_n(R(K_i)) = x</math>
where,
:<math>R(K_i)</math> is the result of running the program <math>K_i</math>.
:<math>T_n</math> truncates the string after ''n'' bits.


The problem is to calculate the probability that the source is produced by program <math>K_i</math>, given that the truncated source after n bits is ''x''. This is represented by the conditional probability, where <math>R(K_i)</math> is the result of running the program <math>K_i</math> and <math>T_n</math> truncates the string after ''n'' bits.

:<math>P(s = R(K_i)\mid T_n(s) = x)</math>
The problem is to calculate the probability that the source is produced by program <math>K_i,</math> given that the truncated source after n bits is ''x''. This is represented by the conditional probability,

:<math>P(s = R(K_i) | T_n(s) = x)</math>


Using the ] Using the ]
:<math>P(A_i\mid B) = \frac{P(B\mid A_i)\,P(A_i)}{\sum\limits_j P(B\mid A_j)\,P(A_j)}\cdot</math>
where,
:<math>B = (T_n(s) = x)</math>
:<math>A_i = (s = R(K_i))</math>


:<math>P(s = R(K_i) |T_n(s) = x) = \frac{P(T_n(s) = x|s = R(K_i))P(s = R(K_i))}{\sum_j P(T_n(s) = x|s = R(K_j)) P(s = R(K_j))}.</math>
The extended form relies on the ]. This means that the <math>A_i</math> must be distinct possibilities, which is given by the condition that each <math>K_i</math> produce a different infinite string. Also one of the conditions <math>A_i</math> must be true. This must be true, as in the limit as ''n'' tends to infinity, there is always at least one program that produces <math>T_n(s)</math>.


The extended form relies on the ]. This means that the <math>s = R(K_i) </math> must be distinct possibilities, which is given by the condition that each <math>K_i</math> produce a different infinite string. Also one of the conditions <math>s = R(K_i) </math> must be true. This must be true, as in the limit as <math>n \to \infty,</math> there is always at least one program that produces <math>T_n(s)</math>.
Then using the extended form and substituting for <math>B</math> and <math>A_i</math> gives,
:<math>P(s = R(K_i)\mid T_n(s) = x) = \frac{P(T_n(s) = x\mid s = R(K_i))\,P(s = R(K_i))}{\sum\limits_j P(T_n(s) = x\mid s = R(K_j))\,P(s = R(K_j))}\cdot</math>


As <math>K_i</math> are chosen so that <math>T_n(R(K_i)) = x</math>, then, As <math>K_i</math> are chosen so that <math>T_n(R(K_i)) = x,</math> then,
:<math>P(T_n(s) = x\mid s = R(K_i)) = 1 </math> :<math>P(T_n(s) = x | s = R(K_i)) = 1 </math>


The a-priori probability of the string being produced from the program, given no information about the string, is based on the size of the program, The apriori probability of the string being produced from the program, given no information about the string, is based on the size of the program,
:<math>P(s = R(K_i)) = 2^{-I(K_i)}</math> :<math>P(s = R(K_i)) = 2^{-I(K_i)}</math>


giving, giving,
:<math>P(s = R(K_i)\mid T_n(s) = x) = \frac{2^{-I(K_i)}}{\sum\limits_j 2^{-I(K_j)}}\cdot</math> :<math>P(s = R(K_i) | T_n(s) = x) = \frac{2^{-I(K_i)}}{\sum_j 2^{-I(K_j)}}.</math>


Programs that are the same or longer than the length of ''x'' provide no predictive power. Separate them out giving, Programs that are the same or longer than the length of ''x'' provide no predictive power. Separate them out giving,
:<math>P(s = R(K_i)\mid T_n(s) = x) = \frac{2^{-I(K_i)}}{\sum\limits_{j:I(K_j)<n} 2^{-I(K_j)}+\sum\limits_{j:I(K_j)>=n} 2^{-I(K_j)}}\cdot</math> :<math>P(s = R(K_i) | T_n(s) = x) = \frac{2^{-I(K_i)}}{\sum_{j:I(K_j)<n} 2^{-I(K_j)}+\sum_{j:I(K_j)\geqslant n} 2^{-I(K_j)}}.</math>


Then identify the two probabilities as, Then identify the two probabilities as,
:Probability that ''x'' has a pattern <math>= \sum\limits_{j:I(K_j)<n} 2^{-I(K_j)}</math> :<math>P(x \text{ has pattern}) = \sum_{j:I(K_j)<n} 2^{-I(K_j)}</math>
:<math>P(x \text{ is random}) = \sum_{j:I(K_j)\geqslant n} 2^{-I(K_j)}</math>
The opposite of this,
:Probability that ''x'' is a random set of bits <math>= \sum\limits_{j:I(K_j)>=n} 2^{-I(K_j)}</math>


But the prior probability that ''x'' is a random set of bits is <math>2^{-n}</math>. So, But the prior probability that ''x'' is a random set of bits is <math>2^{-n}</math>. So,
:<math>P(s = R(K_i)\mid T_n(s) = x) = \frac{2^{-I(K_i)}}{2^{-n} + \sum\limits_{j:I(K_j)<n} 2^{-I(K_j)}}\cdot</math> :<math>P(s = R(K_i) | T_n(s) = x) = \frac{2^{-I(K_i)}}{2^{-n} + \sum_{j:I(K_j)<n} 2^{-I(K_j)}}.</math>


The probability that the source is random, or unpredictable is, The probability that the source is random, or unpredictable is,
:<math>P(\operatorname{random}(s)\mid T_n(s) = x) = \frac{2^{-n}}{2^{-n} + \sum\limits_{j:I(K_j)<n} 2^{-I(K_j)}}\cdot</math> :<math>P(\operatorname{random}(s) | T_n(s) = x) = \frac{2^{-n}}{2^{-n} + \sum_{j:I(K_j)<n} 2^{-I(K_j)}}.</math>


===A model for inductive inference=== ===A model for inductive inference===
Line 510: Line 476:
A model of how worlds are constructed is used in determining the probabilities of theories, A model of how worlds are constructed is used in determining the probabilities of theories,
* A random bit string is selected. * A random bit string is selected.
* A condition is constructed from the bit string, using the ]. * A condition is constructed from the bit string.
* A world is constructed that is consistent with the condition. * A world is constructed that is consistent with the condition.


Line 525: Line 491:


] may be applied ] may be applied
:<math>P(A_i\mid B) = \frac{P(B\mid A_i)\,P(A_i)}{\sum\limits_j P(B\mid A_j)\,P(A_j)}\cdot</math> :<math>P(A_i | B) = \frac{P(B | A_i)\,P(A_i)}{\sum_j P(B | A_j)\,P(A_j)},</math>
where, where,
:<math>B = E(C)</math> :<math>B = E(C)</math>
:<math>A_i = E(t)</math> :<math>A_i = E(t)</math>


To apply Bayes' theorem the following must hold, To apply Bayes' theorem the following must hold: <math>A_i</math> is a ] of the event space.
* <math>A_i</math> is a ] of the event space.


For <math>T(C)</math> to be a partition, no bit string ''n'' may belong to two theories. To prove this assume they can and derive a contradiction, For <math>T(C)</math> to be a partition, no bit string ''n'' may belong to two theories. To prove this assume they can and derive a contradiction,
:<math>N \in T \and N \in M \and N \ne M \and n \in E(N) \and n \in E(M)</math> :<math>(N \in T) \land (N \in M) \land (N \ne M) \land (n \in E(N) \land n \in E(M))</math>
:<math>\implies N \ne M \and R(n) \equiv N \and R(n) \equiv M</math> :<math>\implies (N \ne M) \land R(n) \equiv N \land R(n) \equiv M</math>
:<math>\implies \operatorname{false}</math> :<math>\implies \bot</math>


Secondly prove that ''T'' includes all outcomes consistent with the condition. As all theories consistent with ''C'' are included then <math>R(w)</math> must be in this set. Secondly prove that ''T'' includes all outcomes consistent with the condition. As all theories consistent with ''C'' are included then <math>R(w)</math> must be in this set.


So Bayes theorem may be applied as specified giving, So Bayes theorem may be applied as specified giving,
:<math>\forall t \in T(C), P(E(t) \mid E(C)) = \frac{P(E(t)) \cdot P(E(C) \mid E(t))}{\sum_{j \in T(C)} P(E(j)) \cdot P(E(C) \mid E(j))} </math> :<math>\forall t \in T(C), P(E(t) | E(C)) = \frac{P(E(t)) \cdot P(E(C) | E(t))}{\sum_{j \in T(C)} P(E(j)) \cdot P(E(C) | E(j))} </math>


Using the ], the definition of <math>T(C)</math> implies, Using the ], the definition of <math>T(C)</math> implies,
:<math>\forall t \in T(C), P(E(C) \mid E(t)) = 1</math> :<math>\forall t \in T(C), P(E(C) | E(t)) = 1</math>


The probability of each theory in T is given by, The probability of each theory in ''T'' is given by,
:<math> \forall t \in T(C), P(E(t)) = \sum_{n: R(n) \equiv t} 2^{-L(n)}</math> :<math> \forall t \in T(C), P(E(t)) = \sum_{n: R(n) \equiv t} 2^{-L(n)}</math>


so, so,
:<math>\forall t \in T(C), P(E(t) \mid E(C)) = \frac{\sum_{n: R(n) \equiv t} 2^{-L(n)}}{\sum_{j \in T(C)} \sum_{m: R(m) \equiv j} 2^{-L(m)})} </math> :<math>\forall t \in T(C), P(E(t) | E(C)) = \frac{\sum_{n: R(n) \equiv t} 2^{-L(n)}}{\sum_{j \in T(C)} \sum_{m: R(m) \equiv j} 2^{-L(m)}} </math>


Finally the probabilities of the events may be identified with the probabilities of the condition which the outcomes in the event satisfy, Finally the probabilities of the events may be identified with the probabilities of the condition which the outcomes in the event satisfy,
:<math>\forall t \in T(C), P(E(t) \mid E(C)) = P(t \mid C)</math> :<math>\forall t \in T(C), P(E(t) | E(C)) = P(t | C)</math>


giving giving
:<math>\forall t \in T(C), P(t \mid C) = \frac{\sum_{n: R(n) \equiv t} 2^{-L(n)}}{\sum_{j \in T(C)} \sum_{m: R(m) \equiv j} 2^{-L(m)}} </math> :<math>\forall t \in T(C), P(t | C) = \frac{\sum_{n: R(n) \equiv t} 2^{-L(n)}}{\sum_{j \in T(C)} \sum_{m: R(m) \equiv j} 2^{-L(m)}} </math>


This is the probability of the theory ''t'' after observing that the condition ''C'' holds. This is the probability of the theory ''t'' after observing that the condition ''C'' holds.
Line 563: Line 528:


Theories that are less probable than the condition ''C'' have no predictive power. Separate them out giving, Theories that are less probable than the condition ''C'' have no predictive power. Separate them out giving,
:<math>\forall t \in T(C), P(t \mid C) = \frac{P(E(t))}{(\sum_{j : j \in T(C) \and P(E(j)) > P(E(C))} P(E(j))) + (\sum_{j : j \in T(C) \and P(E(j)) \le P(E(C))} P(j))} </math> :<math>\forall t \in T(C), P(t | C) = \frac{P(E(t))}{(\sum_{j : j \in T(C) \land P(E(j)) > P(E(C))} P(E(j))) + (\sum_{j : j \in T(C) \land P(E(j)) \le P(E(C))} P(j))} </math>


The probability of the theories without predictive power on ''C'' is the same as the probability of ''C''. So, The probability of the theories without predictive power on ''C'' is the same as the probability of ''C''. So,
:<math>P(E(C)) = \sum_{j : j \in T(C) \and P(E(j)) \le P(E(C))} P(j)</math> :<math>P(E(C)) = \sum_{j : j \in T(C) \land P(E(j)) \le P(E(C))} P(j)</math>


So the probability So the probability
:<math>\forall t \in T(C), P(t \mid C) = \frac{P(E(t))}{P(E(C)) + \sum_{j : j \in T(C) \and P(E(j)) > P(E(C))} P(E(j))} </math> :<math>\forall t \in T(C), P(t | C) = \frac{P(E(t))}{P(E(C)) + \sum_{j : j \in T(C) \land P(E(j)) > P(E(C))} P(E(j))} </math>


and the probability of no prediction for C, written as <math>\operatorname{random}(C)</math>, and the probability of no prediction for C, written as <math>\operatorname{random}(C)</math>,
:<math>P(\operatorname{random}(C) \mid C) = \frac{P(E(C))}{P(E(C)) + \sum_{j : j \in T(C) \and P(E(j)) > P(E(C))} P(E(j))} </math> :<math>P(\text{random}(C) | C) = \frac{P(E(C))}{P(E(C)) + \sum_{j : j \in T(C) \land P(E(j)) > P(E(C))} P(E(j))} </math>


The probability of a condition was given as, The probability of a condition was given as,
Line 579: Line 544:
Bit strings for theories that are more complex than the bit string given to the agent as input have no predictive power. There probabilities are better included in the ''random'' case. To implement this a new definition is given as ''F'' in, Bit strings for theories that are more complex than the bit string given to the agent as input have no predictive power. There probabilities are better included in the ''random'' case. To implement this a new definition is given as ''F'' in,


:<math> \forall t, P(F(t, c)) = \sum_{n: R(n) \equiv t \and L(n) < L(c)} 2^{-L(n)}</math> :<math> \forall t, P(F(t, c)) = \sum_{n: R(n) \equiv t \land L(n) < L(c)} 2^{-L(n)}</math>


Using ''F'', an improved version of the abductive probabilities is, Using ''F'', an improved version of the abductive probabilities is,
:<math>\forall t \in T(C), P(t \mid C) = \frac{P(F(t, c))}{P(F(C, c)) + \sum_{j : j \in T(C) \and P(F(j, c)) > P(F(C, c))} P(E(j, c))} </math> :<math>\forall t \in T(C), P(t | C) = \frac{P(F(t, c))}{P(F(C, c)) + \sum_{j : j \in T(C) \land P(F(j, c)) > P(F(C, c))} P(E(j, c))} </math>
:<math>P(\operatorname{random}(C) \mid C) = \frac{P(F(C, c))}{P(F(C, c)) + \sum_{j : j \in T(C) \and P(F(j, c)) > P(F(C, c))} P(F(j, c))} </math> :<math>P(\operatorname{random}(C) | C) = \frac{P(F(C, c))}{P(F(C, c)) + \sum_{j : j \in T(C) \land P(F(j, c)) > P(F(C, c))} P(F(j, c))} </math>


==Key people== ==Key people==
Line 592: Line 557:
* ] * ]
* ] * ]
* D. M. Boulton
* ] * ]
* ] * ]
Line 597: Line 563:
==See also== ==See also==


* ] * ]
* ]
* ] * ]
* ] * ]
* ]
* ]
* ]
* ]
* ]
* ]
* ] * ]
* ] * ]
* ]
* ] * ]
* ] * ]
Line 611: Line 583:
==External links== ==External links==
* Rathmanner, S and Hutter, M., "A Philosophical Treatise of Universal Induction" in Entropy 2011, 13, 1076–1136: A very clear philosophical and mathematical analysis of Solomonoff's Theory of Inductive Inference. * Rathmanner, S and Hutter, M., "A Philosophical Treatise of Universal Induction" in Entropy 2011, 13, 1076–1136: A very clear philosophical and mathematical analysis of Solomonoff's Theory of Inductive Inference.
* ], , Springer-Verlag (Information Science and Statistics), ISBN 0-387-23795-X, May 2005 – , and . * ], , Springer-Verlag (Information Science and Statistics), {{ISBN|0-387-23795-X}}, May 2005 – , and .


] ]
] ]
] ]
] ]
] ]
]

Latest revision as of 03:30, 19 July 2024

Determining the probability of future events based on past events

Inductive probability attempts to give the probability of future events based on past events. It is the basis for inductive reasoning, and gives the mathematical basis for learning and the perception of patterns. It is a source of knowledge about the world.

There are three sources of knowledge: inference, communication, and deduction. Communication relays information found using other methods. Deduction establishes new facts based on existing facts. Inference establishes new facts from data. Its basis is Bayes' theorem.

Information describing the world is written in a language. For example, a simple mathematical language of propositions may be chosen. Sentences may be written down in this language as strings of characters. But in the computer it is possible to encode these sentences as strings of bits (1s and 0s). Then the language may be encoded so that the most commonly used sentences are the shortest. This internal language implicitly represents probabilities of statements.

Occam's razor says the "simplest theory, consistent with the data is most likely to be correct". The "simplest theory" is interpreted as the representation of the theory written in this internal language. The theory with the shortest encoding in this internal language is most likely to be correct.

History

Probability and statistics was focused on probability distributions and tests of significance. Probability was formal, well defined, but limited in scope. In particular its application was limited to situations that could be defined as an experiment or trial, with a well defined population.

Bayes's theorem is named after Rev. Thomas Bayes 1701–1761. Bayesian inference broadened the application of probability to many situations where a population was not well defined. But Bayes' theorem always depended on prior probabilities, to generate new probabilities. It was unclear where these prior probabilities should come from.

Ray Solomonoff developed algorithmic probability which gave an explanation for what randomness is and how patterns in the data may be represented by computer programs, that give shorter representations of the data circa 1964.

Chris Wallace and D. M. Boulton developed minimum message length circa 1968. Later Jorma Rissanen developed the minimum description length circa 1978. These methods allow information theory to be related to probability, in a way that can be compared to the application of Bayes' theorem, but which give a source and explanation for the role of prior probabilities.

Marcus Hutter combined decision theory with the work of Ray Solomonoff and Andrey Kolmogorov to give a theory for the Pareto optimal behavior for an Intelligent agent, circa 1998.

Minimum description/message length

The program with the shortest length that matches the data is the most likely to predict future data. This is the thesis behind the minimum message length and minimum description length methods.

At first sight Bayes' theorem appears different from the minimimum message/description length principle. At closer inspection it turns out to be the same. Bayes' theorem is about conditional probabilities, and states the probability that event B happens if firstly event A happens:

P ( A B ) = P ( B ) P ( A | B ) = P ( A ) P ( B | A ) {\displaystyle P(A\land B)=P(B)\cdot P(A|B)=P(A)\cdot P(B|A)}

becomes in terms of message length L,

L ( A B ) = L ( B ) + L ( A | B ) = L ( A ) + L ( B | A ) . {\displaystyle L(A\land B)=L(B)+L(A|B)=L(A)+L(B|A).}

This means that if all the information is given describing an event then the length of the information may be used to give the raw probability of the event. So if the information describing the occurrence of A is given, along with the information describing B given A, then all the information describing A and B has been given.

Overfitting

Overfitting occurs when the model matches the random noise and not the pattern in the data. For example, take the situation where a curve is fitted to a set of points. If a polynomial with many terms is fitted then it can more closely represent the data. Then the fit will be better, and the information needed to describe the deviations from the fitted curve will be smaller. Smaller information length means higher probability.

However, the information needed to describe the curve must also be considered. The total information for a curve with many terms may be greater than for a curve with fewer terms, that has not as good a fit, but needs less information to describe the polynomial.

Inference based on program complexity

Solomonoff's theory of inductive inference is also inductive inference. A bit string x is observed. Then consider all programs that generate strings starting with x. Cast in the form of inductive inference, the programs are theories that imply the observation of the bit string x.

The method used here to give probabilities for inductive inference is based on Solomonoff's theory of inductive inference.

Detecting patterns in the data

If all the bits are 1, then people infer that there is a bias in the coin and that it is more likely also that the next bit is 1 also. This is described as learning from, or detecting a pattern in the data.

Such a pattern may be represented by a computer program. A short computer program may be written that produces a series of bits which are all 1. If the length of the program K is L ( K ) {\displaystyle L(K)} bits then its prior probability is,

P ( K ) = 2 L ( K ) {\displaystyle P(K)=2^{-L(K)}}

The length of the shortest program that represents the string of bits is called the Kolmogorov complexity.

Kolmogorov complexity is not computable. This is related to the halting problem. When searching for the shortest program some programs may go into an infinite loop.

Considering all theories

The Greek philosopher Epicurus is quoted as saying "If more than one theory is consistent with the observations, keep all theories".

As in a crime novel all theories must be considered in determining the likely murderer, so with inductive probability all programs must be considered in determining the likely future bits arising from the stream of bits.

Programs that are already longer than n have no predictive power. The raw (or prior) probability that the pattern of bits is random (has no pattern) is 2 n {\displaystyle 2^{-n}} .

Each program that produces the sequence of bits, but is shorter than the n is a theory/pattern about the bits with a probability of 2 k {\displaystyle 2^{-k}} where k is the length of the program.

The probability of receiving a sequence of bits y after receiving a series of bits x is then the conditional probability of receiving y given x, which is the probability of x with y appended, divided by the probability of x.

Universal priors

The programming language affects the predictions of the next bit in the string. The language acts as a prior probability. This is particularly a problem where the programming language codes for numbers and other data types. Intuitively we think that 0 and 1 are simple numbers, and that prime numbers are somehow more complex than numbers that may be composite.

Using the Kolmogorov complexity gives an unbiased estimate (a universal prior) of the prior probability of a number. As a thought experiment an intelligent agent may be fitted with a data input device giving a series of numbers, after applying some transformation function to the raw numbers. Another agent might have the same input device with a different transformation function. The agents do not see or know about these transformation functions. Then there appears no rational basis for preferring one function over another. A universal prior insures that although two agents may have different initial probability distributions for the data input, the difference will be bounded by a constant.

So universal priors do not eliminate an initial bias, but they reduce and limit it. Whenever we describe an event in a language, either using a natural language or other, the language has encoded in it our prior expectations. So some reliance on prior probabilities are inevitable.

A problem arises where an intelligent agent's prior expectations interact with the environment to form a self reinforcing feed back loop. This is the problem of bias or prejudice. Universal priors reduce but do not eliminate this problem.

Universal artificial intelligence

The theory of universal artificial intelligence applies decision theory to inductive probabilities. The theory shows how the best actions to optimize a reward function may be chosen. The result is a theoretical model of intelligence.

It is a fundamental theory of intelligence, which optimizes the agents behavior in,

  • Exploring the environment; performing actions to get responses that broaden the agents knowledge.
  • Competing or co-operating with another agent; games.
  • Balancing short and long term rewards.

In general no agent will always provide the best actions in all situations. A particular choice made by an agent may be wrong, and the environment may provide no way for the agent to recover from an initial bad choice. However the agent is Pareto optimal in the sense that no other agent will do better than this agent in this environment, without doing worse in another environment. No other agent may, in this sense, be said to be better.

At present the theory is limited by incomputability (the halting problem). Approximations may be used to avoid this. Processing speed and combinatorial explosion remain the primary limiting factors for artificial intelligence.

Probability

Probability is the representation of uncertain or partial knowledge about the truth of statements. Probabilities are subjective and personal estimates of likely outcomes based on past experience and inferences made from the data.

This description of probability may seem strange at first. In natural language we refer to "the probability" that the sun will rise tomorrow. We do not refer to "your probability" that the sun will rise. But in order for inference to be correctly modeled probability must be personal, and the act of inference generates new posterior probabilities from prior probabilities.

Probabilities are personal because they are conditional on the knowledge of the individual. Probabilities are subjective because they always depend, to some extent, on prior probabilities assigned by the individual. Subjective should not be taken here to mean vague or undefined.

The term intelligent agent is used to refer to the holder of the probabilities. The intelligent agent may be a human or a machine. If the intelligent agent does not interact with the environment then the probability will converge over time to the frequency of the event.

If however the agent uses the probability to interact with the environment there may be a feedback, so that two agents in the identical environment starting with only slightly different priors, end up with completely different probabilities. In this case optimal decision theory as in Marcus Hutter's Universal Artificial Intelligence will give Pareto optimal performance for the agent. This means that no other intelligent agent could do better in one environment without doing worse in another environment.

Comparison to deductive probability

In deductive probability theories, probabilities are absolutes, independent of the individual making the assessment. But deductive probabilities are based on,

  • Shared knowledge.
  • Assumed facts, that should be inferred from the data.

For example, in a trial the participants are aware the outcome of all the previous history of trials. They also assume that each outcome is equally probable. Together this allows a single unconditional value of probability to be defined.

But in reality each individual does not have the same information. And in general the probability of each outcome is not equal. The dice may be loaded, and this loading needs to be inferred from the data.

Probability as estimation

The principle of indifference has played a key role in probability theory. It says that if N statements are symmetric so that one condition cannot be preferred over another then all statements are equally probable.

Taken seriously, in evaluating probability this principle leads to contradictions. Suppose there are 3 bags of gold in the distance and one is asked to select one. Then because of the distance one cannot see the bag sizes. You estimate using the principle of indifference that each bag has equal amounts of gold, and each bag has one third of the gold.

Now, while one of us is not looking, the other takes one of the bags and divide it into 3 bags. Now there are 5 bags of gold. The principle of indifference now says each bag has one fifth of the gold. A bag that was estimated to have one third of the gold is now estimated to have one fifth of the gold.

Taken as a value associated with the bag the values are different therefore contradictory. But taken as an estimate given under a particular scenario, both values are separate estimates given under different circumstances and there is no reason to believe they are equal.

Estimates of prior probabilities are particularly suspect. Estimates will be constructed that do not follow any consistent frequency distribution. For this reason prior probabilities are considered as estimates of probabilities rather than probabilities.

A full theoretical treatment would associate with each probability,

  • The statement
  • Prior knowledge
  • Prior probabilities
  • The estimation procedure used to give the probability.

Combining probability approaches

Inductive probability combines two different approaches to probability.

  • Probability and information
  • Probability and frequency

Each approach gives a slightly different viewpoint. Information theory is used in relating probabilities to quantities of information. This approach is often used in giving estimates of prior probabilities.

Frequentist probability defines probabilities as objective statements about how often an event occurs. This approach may be stretched by defining the trials to be over possible worlds. Statements about possible worlds define events.

Probability and information

Whereas logic represents only two values; true and false as the values of statement, probability associates a number in to each statement. If the probability of a statement is 0, the statement is false. If the probability of a statement is 1 the statement is true.

In considering some data as a string of bits the prior probabilities for a sequence of 1s and 0s, the probability of 1 and 0 is equal. Therefore, each extra bit halves the probability of a sequence of bits. This leads to the conclusion that,

P ( x ) = 2 L ( x ) {\displaystyle P(x)=2^{-L(x)}}

Where P ( x ) {\displaystyle P(x)} is the probability of the string of bits x {\displaystyle x} and L ( x ) {\displaystyle L(x)} is its length.

The prior probability of any statement is calculated from the number of bits needed to state it. See also information theory.

Combining information

Two statements A {\displaystyle A} and B {\displaystyle B} may be represented by two separate encodings. Then the length of the encoding is,

L ( A B ) = L ( A ) + L ( B ) {\displaystyle L(A\land B)=L(A)+L(B)}

or in terms of probability,

P ( A B ) = P ( A ) P ( B ) {\displaystyle P(A\land B)=P(A)P(B)}

But this law is not always true because there may be a shorter method of encoding B {\displaystyle B} if we assume A {\displaystyle A} . So the above probability law applies only if A {\displaystyle A} and B {\displaystyle B} are "independent".

The internal language of information

The primary use of the information approach to probability is to provide estimates of the complexity of statements. Recall that Occam's razor states that "All things being equal, the simplest theory is the most likely to be correct". In order to apply this rule, first there needs to be a definition of what "simplest" means. Information theory defines simplest to mean having the shortest encoding.

Knowledge is represented as statements. Each statement is a Boolean expression. Expressions are encoded by a function that takes a description (as against the value) of the expression and encodes it as a bit string.

The length of the encoding of a statement gives an estimate of the probability of a statement. This probability estimate will often be used as the prior probability of a statement.

Technically this estimate is not a probability because it is not constructed from a frequency distribution. The probability estimates given by it do not always obey the law of total of probability. Applying the law of total probability to various scenarios will usually give a more accurate probability estimate of the prior probability than the estimate from the length of the statement.

Encoding expressions

An expression is constructed from sub expressions,

  • Constants (including function identifier).
  • Application of functions.
  • quantifiers.

A Huffman code must distinguish the 3 cases. The length of each code is based on the frequency of each type of sub expressions.

Initially constants are all assigned the same length/probability. Later constants may be assigned a probability using the Huffman code based on the number of uses of the function id in all expressions recorded so far. In using a Huffman code the goal is to estimate probabilities, not to compress the data.

The length of a function application is the length of the function identifier constant plus the sum of the sizes of the expressions for each parameter.

The length of a quantifier is the length of the expression being quantified over.

Distribution of numbers

No explicit representation of natural numbers is given. However natural numbers may be constructed by applying the successor function to 0, and then applying other arithmetic functions. A distribution of natural numbers is implied by this, based on the complexity of constructing each number.

Rational numbers are constructed by the division of natural numbers. The simplest representation has no common factors between the numerator and the denominator. This allows the probability distribution of natural numbers may be extended to rational numbers.

Probability and frequency

The probability of an event may be interpreted as the frequencies of outcomes where the statement is true divided by the total number of outcomes. If the outcomes form a continuum the frequency may need to be replaced with a measure.

Events are sets of outcomes. Statements may be related to events. A Boolean statement B about outcomes defines a set of outcomes b,

b = { x : B ( x ) } {\displaystyle b=\{x:B(x)\}}

Conditional probability

Each probability is always associated with the state of knowledge at a particular point in the argument. Probabilities before an inference are known as prior probabilities, and probabilities after are known as posterior probabilities.

Probability depends on the facts known. The truth of a fact limits the domain of outcomes to the outcomes consistent with the fact. Prior probabilities are the probabilities before a fact is known. Posterior probabilities are after a fact is known. The posterior probabilities are said to be conditional on the fact. the probability that B {\displaystyle B} is true given that A {\displaystyle A} is true is written as: P ( B | A ) . {\displaystyle P(B|A).}

All probabilities are in some sense conditional. The prior probability of B {\displaystyle B} is,

P ( B ) = P ( B | ) {\displaystyle P(B)=P(B|\top )}

The frequentist approach applied to possible worlds

In the frequentist approach, probabilities are defined as the ratio of the number of outcomes within an event to the total number of outcomes. In the possible world model each possible world is an outcome, and statements about possible worlds define events. The probability of a statement being true is the number of possible worlds where the statement is true divided by the total number of possible worlds. The probability of a statement A {\displaystyle A} being true about possible worlds is then,

P ( A ) = | { x : A ( x ) } | | x : | {\displaystyle P(A)={\frac {|\{x:A(x)\}|}{|x:\top |}}}

For a conditional probability.

P ( B | A ) = | { x : A ( x ) B ( x ) } | | x : A ( x ) | {\displaystyle P(B|A)={\frac {|\{x:A(x)\land B(x)\}|}{|x:A(x)|}}}

then

P ( A B ) = | { x : A ( x ) B ( x ) } | | x : | = | { x : A ( x ) B ( x ) } | | { x : A ( x ) } | | { x : A ( x ) } | | x : | = P ( A ) P ( B | A ) {\displaystyle {\begin{aligned}P(A\land B)&={\frac {|\{x:A(x)\land B(x)\}|}{|x:\top |}}\\&={\frac {|\{x:A(x)\land B(x)\}|}{|\{x:A(x)\}|}}{\frac {|\{x:A(x)\}|}{|x:\top |}}\\&=P(A)P(B|A)\end{aligned}}}

Using symmetry this equation may be written out as Bayes' law.

P ( A B ) = P ( A ) P ( B | A ) = P ( B ) P ( A | B ) {\displaystyle P(A\land B)=P(A)P(B|A)=P(B)P(A|B)}

This law describes the relationship between prior and posterior probabilities when new facts are learnt.

Written as quantities of information Bayes' Theorem becomes,

L ( A B ) = L ( A ) + L ( B | A ) = L ( B ) + L ( A | B ) {\displaystyle L(A\land B)=L(A)+L(B|A)=L(B)+L(A|B)}

Two statements A and B are said to be independent if knowing the truth of A does not change the probability of B. Mathematically this is,

P ( B ) = P ( B | A ) {\displaystyle P(B)=P(B|A)}

then Bayes' Theorem reduces to,

P ( A B ) = P ( A ) P ( B ) {\displaystyle P(A\land B)=P(A)P(B)}

The law of total of probability

For a set of mutually exclusive possibilities A i {\displaystyle A_{i}} , the sum of the posterior probabilities must be 1.

i P ( A i | B ) = 1 {\displaystyle \sum _{i}{P(A_{i}|B)}=1}

Substituting using Bayes' theorem gives the law of total probability

i P ( B | A i ) P ( A i ) = i P ( A i | B ) P ( B ) {\displaystyle \sum _{i}{P(B|A_{i})P(A_{i})}=\sum _{i}{P(A_{i}|B)P(B)}}
P ( B ) = i P ( B | A i ) P ( A i ) {\displaystyle P(B)=\sum _{i}{P(B|A_{i})P(A_{i})}}

This result is used to give the extended form of Bayes' theorem,

P ( A i | B ) = P ( B | A i ) P ( A i ) j P ( B | A j ) P ( A j ) {\displaystyle P(A_{i}|B)={\frac {P(B|A_{i})P(A_{i})}{\sum _{j}{P(B|A_{j})P(A_{j})}}}}

This is the usual form of Bayes' theorem used in practice, because it guarantees the sum of all the posterior probabilities for A i {\displaystyle A_{i}} is 1.

Alternate possibilities

For mutually exclusive possibilities, the probabilities add.

P ( A B ) = P ( A ) + P ( B ) , if  P ( A B ) = 0 {\displaystyle P(A\lor B)=P(A)+P(B),\qquad {\text{if }}P(A\land B)=0}

Using

A B = ( A ¬ ( A B ) ) ( B ¬ ( A B ) ) ( A B ) {\displaystyle A\lor B=(A\land \neg (A\land B))\lor (B\land \neg (A\land B))\lor (A\land B)}

Then the alternatives

A ¬ ( A B ) , B ¬ ( A B ) , A B {\displaystyle A\land \neg (A\land B),\quad B\land \neg (A\land B),\quad A\land B}

are all mutually exclusive. Also,

( A ¬ ( A B ) ) ( A B ) = A {\displaystyle (A\land \neg (A\land B))\lor (A\land B)=A}
P ( A ¬ ( A B ) ) + P ( A B ) = P ( A ) {\displaystyle P(A\land \neg (A\land B))+P(A\land B)=P(A)}
P ( A ¬ ( A B ) ) = P ( A ) P ( A B ) {\displaystyle P(A\land \neg (A\land B))=P(A)-P(A\land B)}

so, putting it all together,

P ( A B ) = P ( ( A ¬ ( A B ) ) ( B ¬ ( A B ) ) ( A B ) ) = P ( A ¬ ( A B ) + P ( B ¬ ( A B ) ) + P ( A B ) = P ( A ) P ( A B ) + P ( B ) P ( A B ) + P ( A B ) = P ( A ) + P ( B ) P ( A B ) {\displaystyle {\begin{aligned}P(A\lor B)&=P((A\land \neg (A\land B))\lor (B\land \neg (A\land B))\lor (A\land B))\\&=P(A\land \neg (A\land B)+P(B\land \neg (A\land B))+P(A\land B)\\&=P(A)-P(A\land B)+P(B)-P(A\land B)+P(A\land B)\\&=P(A)+P(B)-P(A\land B)\end{aligned}}}

Negation

As,

A ¬ A = {\displaystyle A\lor \neg A=\top }

then

P ( A ) + P ( ¬ A ) = 1 {\displaystyle P(A)+P(\neg A)=1}

Implication and condition probability

Implication is related to conditional probability by the following equation,

A B P ( B | A ) = 1 {\displaystyle A\to B\iff P(B|A)=1}

Derivation,

A B P ( A B ) = 1 P ( A B ¬ A ) = 1 P ( A B ) + P ( ¬ A ) = 1 P ( A B ) = P ( A ) P ( A ) P ( B | A ) = P ( A ) P ( B | A ) = 1 {\displaystyle {\begin{aligned}A\to B&\iff P(A\to B)=1\\&\iff P(A\land B\lor \neg A)=1\\&\iff P(A\land B)+P(\neg A)=1\\&\iff P(A\land B)=P(A)\\&\iff P(A)\cdot P(B|A)=P(A)\\&\iff P(B|A)=1\end{aligned}}}

Bayesian hypothesis testing

Bayes' theorem may be used to estimate the probability of a hypothesis or theory H, given some facts F. The posterior probability of H is then

P ( H | F ) = P ( H ) P ( F | H ) P ( F ) {\displaystyle P(H|F)={\frac {P(H)P(F|H)}{P(F)}}}

or in terms of information,

P ( H | F ) = 2 ( L ( H ) + L ( F | H ) L ( F ) ) {\displaystyle P(H|F)=2^{-(L(H)+L(F|H)-L(F))}}

By assuming the hypothesis is true, a simpler representation of the statement F may be given. The length of the encoding of this simpler representation is L ( F | H ) . {\displaystyle L(F|H).}

L ( H ) + L ( F | H ) {\displaystyle L(H)+L(F|H)} represents the amount of information needed to represent the facts F, if H is true. L ( F ) {\displaystyle L(F)} is the amount of information needed to represent F without the hypothesis H. The difference is how much the representation of the facts has been compressed by assuming that H is true. This is the evidence that the hypothesis H is true.

If L ( F ) {\displaystyle L(F)} is estimated from encoding length then the probability obtained will not be between 0 and 1. The value obtained is proportional to the probability, without being a good probability estimate. The number obtained is sometimes referred to as a relative probability, being how much more probable the theory is than not holding the theory.

If a full set of mutually exclusive hypothesis that provide evidence is known, a proper estimate may be given for the prior probability P ( F ) {\displaystyle P(F)} .

Set of hypothesis

Probabilities may be calculated from the extended form of Bayes' theorem. Given all mutually exclusive hypothesis H i {\displaystyle H_{i}} which give evidence, such that,

L ( H i ) + L ( F | H i ) < L ( F ) {\displaystyle L(H_{i})+L(F|H_{i})<L(F)}

and also the hypothesis R, that none of the hypothesis is true, then,

P ( H i | F ) = P ( H i ) P ( F | H i ) P ( F | R ) + j P ( H j ) P ( F | H j ) P ( R | F ) = P ( F | R ) P ( F | R ) + j P ( H j ) P ( F | H j ) {\displaystyle {\begin{aligned}P(H_{i}|F)&={\frac {P(H_{i})P(F|H_{i})}{P(F|R)+\sum _{j}{P(H_{j})P(F|H_{j})}}}\\P(R|F)&={\frac {P(F|R)}{P(F|R)+\sum _{j}{P(H_{j})P(F|H_{j})}}}\end{aligned}}}

In terms of information,

P ( H i | F ) = 2 ( L ( H i ) + L ( F | H i ) ) 2 L ( F | R ) + j 2 ( L ( H j ) + L ( F | H j ) ) P ( R | F ) = 2 L ( F | R ) 2 L ( F | R ) + j 2 ( L ( H j ) + L ( F | H j ) ) {\displaystyle {\begin{aligned}P(H_{i}|F)&={\frac {2^{-(L(H_{i})+L(F|H_{i}))}}{2^{-L(F|R)}+\sum _{j}2^{-(L(H_{j})+L(F|H_{j}))}}}\\P(R|F)&={\frac {2^{-L(F|R)}}{2^{-L(F|R)}+\sum _{j}{2^{-(L(H_{j})+L(F|H_{j}))}}}}\end{aligned}}}

In most situations it is a good approximation to assume that F {\displaystyle F} is independent of R {\displaystyle R} , which means P ( F | R ) = P ( F ) {\displaystyle P(F|R)=P(F)} giving,

P ( H i | F ) 2 ( L ( H i ) + L ( F | H i ) ) 2 L ( F ) + j 2 ( L ( H j ) + L ( F | H j ) ) P ( R | F ) 2 L ( F ) 2 L ( F ) + j 2 ( L ( H j ) + L ( F | H j ) ) {\displaystyle {\begin{aligned}P(H_{i}|F)&\approx {\frac {2^{-(L(H_{i})+L(F|H_{i}))}}{2^{-L(F)}+\sum _{j}{2^{-(L(H_{j})+L(F|H_{j}))}}}}\\P(R|F)&\approx {\frac {2^{-L(F)}}{2^{-L(F)}+\sum _{j}{2^{-(L(H_{j})+L(F|H_{j}))}}}}\end{aligned}}}

Boolean inductive inference

Abductive inference starts with a set of facts F which is a statement (Boolean expression). Abductive reasoning is of the form,

A theory T implies the statement F. As the theory T is simpler than F, abduction says that there is a probability that the theory T is implied by F.

The theory T, also called an explanation of the condition F, is an answer to the ubiquitous factual "why" question. For example, for the condition F is "Why do apples fall?". The answer is a theory T that implies that apples fall;

F = G m 1 m 2 r 2 {\displaystyle F=G{\frac {m_{1}m_{2}}{r^{2}}}}

Inductive inference is of the form,

All observed objects in a class C have a property P. Therefore there is a probability that all objects in a class C have a property P.

In terms of abductive inference, all objects in a class C or set have a property P is a theory that implies the observed condition, All observed objects in a class C have a property P.

So inductive inference is a general case of abductive inference. In common usage the term inductive inference is often used to refer to both abductive and inductive inference.

Generalization and specialization

Inductive inference is related to generalization. Generalizations may be formed from statements by replacing a specific value with membership of a category, or by replacing membership of a category with membership of a broader category. In deductive logic, generalization is a powerful method of generating new theories that may be true. In inductive inference generalization generates theories that have a probability of being true.

The opposite of generalization is specialization. Specialization is used in applying a general rule to a specific case. Specializations are created from generalizations by replacing membership of a category by a specific value, or by replacing a category with a sub category.

The Linnaen classification of living things and objects forms the basis for generalization and specification. The ability to identify, recognize and classify is the basis for generalization. Perceiving the world as a collection of objects appears to be a key aspect of human intelligence. It is the object oriented model, in the non computer science sense.

The object oriented model is constructed from our perception. In particularly vision is based on the ability to compare two images and calculate how much information is needed to morph or map one image into another. Computer vision uses this mapping to construct 3D images from stereo image pairs.

Inductive logic programming is a means of constructing theory that implies a condition. Plotkin's "relative least general generalization (rlgg)" approach constructs the simplest generalization consistent with the condition.

Newton's use of induction

Isaac Newton used inductive arguments in constructing his law of universal gravitation. Starting with the statement,

  • The center of an apple falls towards the center of the Earth.

Generalizing by replacing apple for object, and Earth for object gives, in a two body system,

  • The center of an object falls towards the center of another object.

The theory explains all objects falling, so there is strong evidence for it. The second observation,

  • The planets appear to follow an elliptical path.

After some complicated mathematical calculus, it can be seen that if the acceleration follows the inverse square law then objects will follow an ellipse. So induction gives evidence for the inverse square law.

Using Galileo's observation that all objects drop with the same speed,

F 1 = m 1 a 1 = m 1 k 1 r 2 i 1 {\displaystyle F_{1}=m_{1}a_{1}={\frac {m_{1}k_{1}}{r^{2}}}i_{1}}
F 2 = m 2 a 2 = m 2 k 2 r 2 i 2 {\displaystyle F_{2}=m_{2}a_{2}={\frac {m_{2}k_{2}}{r^{2}}}i_{2}}

where i 1 {\displaystyle i_{1}} and i 2 {\displaystyle i_{2}} vectors towards the center of the other object. Then using Newton's third law F 1 = F 2 {\displaystyle F_{1}=-F_{2}}

F = G m 1 m 2 r 2 {\displaystyle F=G{\frac {m_{1}m_{2}}{r^{2}}}}

Probabilities for inductive inference

Implication determines condition probability as,

T F P ( F | T ) = 1 {\displaystyle T\to F\iff P(F|T)=1}

So,

P ( F | T ) = 1 {\displaystyle P(F|T)=1}
L ( F | T ) = 0 {\displaystyle L(F|T)=0}

This result may be used in the probabilities given for Bayesian hypothesis testing. For a single theory, H = T and,

P ( T | F ) = P ( T ) P ( F ) {\displaystyle P(T|F)={\frac {P(T)}{P(F)}}}

or in terms of information, the relative probability is,

P ( T | F ) = 2 ( L ( T ) L ( F ) ) {\displaystyle P(T|F)=2^{-(L(T)-L(F))}}

Note that this estimate for P(T|F) is not a true probability. If L ( T i ) < L ( F ) {\displaystyle L(T_{i})<L(F)} then the theory has evidence to support it. Then for a set of theories T i = H i {\displaystyle T_{i}=H_{i}} , such that L ( T i ) < L ( F ) {\displaystyle L(T_{i})<L(F)} ,

P ( T i | F ) = P ( T i ) P ( F | R ) + j P ( T j ) {\displaystyle P(T_{i}|F)={\frac {P(T_{i})}{P(F|R)+\sum _{j}{P(T_{j})}}}}
P ( R | F ) = P ( F | R ) P ( F | R ) + j P ( T j ) {\displaystyle P(R|F)={\frac {P(F|R)}{P(F|R)+\sum _{j}{P(T_{j})}}}}

giving,

P ( T i | F ) 2 L ( T i ) 2 L ( F ) + j 2 L ( T j ) {\displaystyle P(T_{i}|F)\approx {\frac {2^{-L(T_{i})}}{2^{-L(F)}+\sum _{j}{2^{-L(T_{j})}}}}}
P ( R | F ) 2 L ( F ) 2 L ( F ) + j 2 L ( T j ) {\displaystyle P(R|F)\approx {\frac {2^{-L(F)}}{2^{-L(F)}+\sum _{j}{2^{-L(T_{j})}}}}}

Derivations

Derivation of inductive probability

Make a list of all the shortest programs K i {\displaystyle K_{i}} that each produce a distinct infinite string of bits, and satisfy the relation,

T n ( R ( K i ) ) = x {\displaystyle T_{n}(R(K_{i}))=x}

where R ( K i ) {\displaystyle R(K_{i})} is the result of running the program K i {\displaystyle K_{i}} and T n {\displaystyle T_{n}} truncates the string after n bits.

The problem is to calculate the probability that the source is produced by program K i , {\displaystyle K_{i},} given that the truncated source after n bits is x. This is represented by the conditional probability,

P ( s = R ( K i ) | T n ( s ) = x ) {\displaystyle P(s=R(K_{i})|T_{n}(s)=x)}

Using the extended form of Bayes' theorem

P ( s = R ( K i ) | T n ( s ) = x ) = P ( T n ( s ) = x | s = R ( K i ) ) P ( s = R ( K i ) ) j P ( T n ( s ) = x | s = R ( K j ) ) P ( s = R ( K j ) ) . {\displaystyle P(s=R(K_{i})|T_{n}(s)=x)={\frac {P(T_{n}(s)=x|s=R(K_{i}))P(s=R(K_{i}))}{\sum _{j}P(T_{n}(s)=x|s=R(K_{j}))P(s=R(K_{j}))}}.}

The extended form relies on the law of total probability. This means that the s = R ( K i ) {\displaystyle s=R(K_{i})} must be distinct possibilities, which is given by the condition that each K i {\displaystyle K_{i}} produce a different infinite string. Also one of the conditions s = R ( K i ) {\displaystyle s=R(K_{i})} must be true. This must be true, as in the limit as n , {\displaystyle n\to \infty ,} there is always at least one program that produces T n ( s ) {\displaystyle T_{n}(s)} .

As K i {\displaystyle K_{i}} are chosen so that T n ( R ( K i ) ) = x , {\displaystyle T_{n}(R(K_{i}))=x,} then,

P ( T n ( s ) = x | s = R ( K i ) ) = 1 {\displaystyle P(T_{n}(s)=x|s=R(K_{i}))=1}

The apriori probability of the string being produced from the program, given no information about the string, is based on the size of the program,

P ( s = R ( K i ) ) = 2 I ( K i ) {\displaystyle P(s=R(K_{i}))=2^{-I(K_{i})}}

giving,

P ( s = R ( K i ) | T n ( s ) = x ) = 2 I ( K i ) j 2 I ( K j ) . {\displaystyle P(s=R(K_{i})|T_{n}(s)=x)={\frac {2^{-I(K_{i})}}{\sum _{j}2^{-I(K_{j})}}}.}

Programs that are the same or longer than the length of x provide no predictive power. Separate them out giving,

P ( s = R ( K i ) | T n ( s ) = x ) = 2 I ( K i ) j : I ( K j ) < n 2 I ( K j ) + j : I ( K j ) n 2 I ( K j ) . {\displaystyle P(s=R(K_{i})|T_{n}(s)=x)={\frac {2^{-I(K_{i})}}{\sum _{j:I(K_{j})<n}2^{-I(K_{j})}+\sum _{j:I(K_{j})\geqslant n}2^{-I(K_{j})}}}.}

Then identify the two probabilities as,

P ( x  has pattern ) = j : I ( K j ) < n 2 I ( K j ) {\displaystyle P(x{\text{ has pattern}})=\sum _{j:I(K_{j})<n}2^{-I(K_{j})}}
P ( x  is random ) = j : I ( K j ) n 2 I ( K j ) {\displaystyle P(x{\text{ is random}})=\sum _{j:I(K_{j})\geqslant n}2^{-I(K_{j})}}

But the prior probability that x is a random set of bits is 2 n {\displaystyle 2^{-n}} . So,

P ( s = R ( K i ) | T n ( s ) = x ) = 2 I ( K i ) 2 n + j : I ( K j ) < n 2 I ( K j ) . {\displaystyle P(s=R(K_{i})|T_{n}(s)=x)={\frac {2^{-I(K_{i})}}{2^{-n}+\sum _{j:I(K_{j})<n}2^{-I(K_{j})}}}.}

The probability that the source is random, or unpredictable is,

P ( random ( s ) | T n ( s ) = x ) = 2 n 2 n + j : I ( K j ) < n 2 I ( K j ) . {\displaystyle P(\operatorname {random} (s)|T_{n}(s)=x)={\frac {2^{-n}}{2^{-n}+\sum _{j:I(K_{j})<n}2^{-I(K_{j})}}}.}

A model for inductive inference

A model of how worlds are constructed is used in determining the probabilities of theories,

  • A random bit string is selected.
  • A condition is constructed from the bit string.
  • A world is constructed that is consistent with the condition.

If w is the bit string then the world is created such that R ( w ) {\displaystyle R(w)} is true. An intelligent agent has some facts about the word, represented by the bit string c, which gives the condition,

C = R ( c ) {\displaystyle C=R(c)}

The set of bit strings identical with any condition x is E ( x ) {\displaystyle E(x)} .

x , E ( x ) = { w : R ( w ) x } {\displaystyle \forall x,E(x)=\{w:R(w)\equiv x\}}

A theory is a simpler condition that explains (or implies) C. The set of all such theories is called T,

T ( C ) = { t : t C } {\displaystyle T(C)=\{t:t\to C\}}

Applying Bayes' theorem

extended form of Bayes' theorem may be applied

P ( A i | B ) = P ( B | A i ) P ( A i ) j P ( B | A j ) P ( A j ) , {\displaystyle P(A_{i}|B)={\frac {P(B|A_{i})\,P(A_{i})}{\sum _{j}P(B|A_{j})\,P(A_{j})}},}

where,

B = E ( C ) {\displaystyle B=E(C)}
A i = E ( t ) {\displaystyle A_{i}=E(t)}

To apply Bayes' theorem the following must hold: A i {\displaystyle A_{i}} is a partition of the event space.

For T ( C ) {\displaystyle T(C)} to be a partition, no bit string n may belong to two theories. To prove this assume they can and derive a contradiction,

( N T ) ( N M ) ( N M ) ( n E ( N ) n E ( M ) ) {\displaystyle (N\in T)\land (N\in M)\land (N\neq M)\land (n\in E(N)\land n\in E(M))}
( N M ) R ( n ) N R ( n ) M {\displaystyle \implies (N\neq M)\land R(n)\equiv N\land R(n)\equiv M}
{\displaystyle \implies \bot }

Secondly prove that T includes all outcomes consistent with the condition. As all theories consistent with C are included then R ( w ) {\displaystyle R(w)} must be in this set.

So Bayes theorem may be applied as specified giving,

t T ( C ) , P ( E ( t ) | E ( C ) ) = P ( E ( t ) ) P ( E ( C ) | E ( t ) ) j T ( C ) P ( E ( j ) ) P ( E ( C ) | E ( j ) ) {\displaystyle \forall t\in T(C),P(E(t)|E(C))={\frac {P(E(t))\cdot P(E(C)|E(t))}{\sum _{j\in T(C)}P(E(j))\cdot P(E(C)|E(j))}}}

Using the implication and condition probability law, the definition of T ( C ) {\displaystyle T(C)} implies,

t T ( C ) , P ( E ( C ) | E ( t ) ) = 1 {\displaystyle \forall t\in T(C),P(E(C)|E(t))=1}

The probability of each theory in T is given by,

t T ( C ) , P ( E ( t ) ) = n : R ( n ) t 2 L ( n ) {\displaystyle \forall t\in T(C),P(E(t))=\sum _{n:R(n)\equiv t}2^{-L(n)}}

so,

t T ( C ) , P ( E ( t ) | E ( C ) ) = n : R ( n ) t 2 L ( n ) j T ( C ) m : R ( m ) j 2 L ( m ) {\displaystyle \forall t\in T(C),P(E(t)|E(C))={\frac {\sum _{n:R(n)\equiv t}2^{-L(n)}}{\sum _{j\in T(C)}\sum _{m:R(m)\equiv j}2^{-L(m)}}}}

Finally the probabilities of the events may be identified with the probabilities of the condition which the outcomes in the event satisfy,

t T ( C ) , P ( E ( t ) | E ( C ) ) = P ( t | C ) {\displaystyle \forall t\in T(C),P(E(t)|E(C))=P(t|C)}

giving

t T ( C ) , P ( t | C ) = n : R ( n ) t 2 L ( n ) j T ( C ) m : R ( m ) j 2 L ( m ) {\displaystyle \forall t\in T(C),P(t|C)={\frac {\sum _{n:R(n)\equiv t}2^{-L(n)}}{\sum _{j\in T(C)}\sum _{m:R(m)\equiv j}2^{-L(m)}}}}

This is the probability of the theory t after observing that the condition C holds.

Removing theories without predictive power

Theories that are less probable than the condition C have no predictive power. Separate them out giving,

t T ( C ) , P ( t | C ) = P ( E ( t ) ) ( j : j T ( C ) P ( E ( j ) ) > P ( E ( C ) ) P ( E ( j ) ) ) + ( j : j T ( C ) P ( E ( j ) ) P ( E ( C ) ) P ( j ) ) {\displaystyle \forall t\in T(C),P(t|C)={\frac {P(E(t))}{(\sum _{j:j\in T(C)\land P(E(j))>P(E(C))}P(E(j)))+(\sum _{j:j\in T(C)\land P(E(j))\leq P(E(C))}P(j))}}}

The probability of the theories without predictive power on C is the same as the probability of C. So,

P ( E ( C ) ) = j : j T ( C ) P ( E ( j ) ) P ( E ( C ) ) P ( j ) {\displaystyle P(E(C))=\sum _{j:j\in T(C)\land P(E(j))\leq P(E(C))}P(j)}

So the probability

t T ( C ) , P ( t | C ) = P ( E ( t ) ) P ( E ( C ) ) + j : j T ( C ) P ( E ( j ) ) > P ( E ( C ) ) P ( E ( j ) ) {\displaystyle \forall t\in T(C),P(t|C)={\frac {P(E(t))}{P(E(C))+\sum _{j:j\in T(C)\land P(E(j))>P(E(C))}P(E(j))}}}

and the probability of no prediction for C, written as random ( C ) {\displaystyle \operatorname {random} (C)} ,

P ( random ( C ) | C ) = P ( E ( C ) ) P ( E ( C ) ) + j : j T ( C ) P ( E ( j ) ) > P ( E ( C ) ) P ( E ( j ) ) {\displaystyle P({\text{random}}(C)|C)={\frac {P(E(C))}{P(E(C))+\sum _{j:j\in T(C)\land P(E(j))>P(E(C))}P(E(j))}}}

The probability of a condition was given as,

t , P ( E ( t ) ) = n : R ( n ) t 2 L ( n ) {\displaystyle \forall t,P(E(t))=\sum _{n:R(n)\equiv t}2^{-L(n)}}

Bit strings for theories that are more complex than the bit string given to the agent as input have no predictive power. There probabilities are better included in the random case. To implement this a new definition is given as F in,

t , P ( F ( t , c ) ) = n : R ( n ) t L ( n ) < L ( c ) 2 L ( n ) {\displaystyle \forall t,P(F(t,c))=\sum _{n:R(n)\equiv t\land L(n)<L(c)}2^{-L(n)}}

Using F, an improved version of the abductive probabilities is,

t T ( C ) , P ( t | C ) = P ( F ( t , c ) ) P ( F ( C , c ) ) + j : j T ( C ) P ( F ( j , c ) ) > P ( F ( C , c ) ) P ( E ( j , c ) ) {\displaystyle \forall t\in T(C),P(t|C)={\frac {P(F(t,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(E(j,c))}}}
P ( random ( C ) | C ) = P ( F ( C , c ) ) P ( F ( C , c ) ) + j : j T ( C ) P ( F ( j , c ) ) > P ( F ( C , c ) ) P ( F ( j , c ) ) {\displaystyle P(\operatorname {random} (C)|C)={\frac {P(F(C,c))}{P(F(C,c))+\sum _{j:j\in T(C)\land P(F(j,c))>P(F(C,c))}P(F(j,c))}}}

Key people

See also

References

  1. Wallace, Chris; Boulton (1968). "An information measure for classification". Computer Journal. 11 (2): 185–194. doi:10.1093/comjnl/11.2.185.
  2. Rissanen, J. (1978). "Modeling by shortest data description". Automatica. 14 (5): 465–658. doi:10.1016/0005-1098(78)90005-5.
  3. Allison, Lloyd. "Minimum Message Length (MML) – LA's MML introduction".
  4. Oliver, J. J.; Baxter, Rohan A. (1994). "MML and Bayesianism: Similarities and Differences (Introduction to Minimum Encoding Inference – Part II)".
  5. Li, M. and Vitanyi, P., An Introduction to Kolmogorov Complexity and Its Applications, 3rd Edition, Springer Science and Business Media, N.Y., 2008, p 347
  6. Solomonoff, R., "A Preliminary Report on a General Theory of Inductive Inference", Report V-131, Zator Co., Cambridge, Ma. Feb 4, 1960, revision, Nov., 1960.
  7. Solomonoff, R., "A Formal Theory of Inductive Inference, Part I" Information and Control, Vol 7, No. 1 pp 1–22, March 1964.
  8. Solomonoff, R., "A Formal Theory of Inductive Inference, Part II" Information and Control, Vol 7, No. 2 pp 224–254, June 1964.
  9. Hutter, Marcus (1998). Sequential Decisions Based on Algorithmic Probability. Springer. ISBN 3-540-22139-5.
  10. Carnap, Rudolf. "STATISTICAL AND INDUCTIVE PROBABILITY" (PDF).
  11. Abduction. Metaphysics Research Lab, Stanford University. 2017.
  12. Pfeifer, Niki; Kleiter, Gernot D. (2006). "INFERENCE IN CONDITIONAL PROBABILITY LOGIC". Kybernetika. 42 (4): 391–404.
  13. "Conditional Probability". Artificial Intelligence - Foundations of computational agents.
  14. "Introduction to the theory of Inductive Logic Programming (ILP)".
  15. Plotkin, Gordon D. (1970). Meltzer, B.; Michie, D. (eds.). "A Note on Inductive Generalization". Machine Intelligence. 5. Edinburgh University Press: 153–163.
  16. Plotkin, Gordon D. (1971). Meltzer, B.; Michie, D. (eds.). "A Further Note on Inductive Generalization". Machine Intelligence. 6. Edinburgh University Press: 101–124.
  17. Isaac Newton: "In philosophy particular propositions are inferred from the phenomena and afterwards rendered general by induction": "Principia", Book 3, General Scholium, at p.392 in Volume 2 of Andrew Motte's English translation published 1729.

External links

Categories: