### Information Geometry and Machine Learning

I'm drawing up a list of papers which formulate machine learning algorithms as maximum entropy or minimum relative entropy solutions, or more broadly are written in the general framework of information geometry. The list isn't aiming for completeness, rather coverage. If anyone knows of any obvious omissions, I'd be grateful to hear.

I'm interested in the moment by the topic at the end of the list - Bayesian Information Geometry, which might just be what a huge number of machine learing algorithms are approximating. The idea is simple enough. Starting out form a prior distribution in the space of distributions, for any data the decision rule minimises the divergence between the true distribution and its estimate. For a given data set this is equivalent to finding the distribution at the smallest mean distance from the true distribution, the mean being taken with respect to the posterior distribution. Snoussi's paper shows how broad this framework is, with the flexibility to choose your distance function and the weight you give to your choice of prior.

At a sociological level, it's interesting to speculate why information geometry has been somewhat reluctantly taken up. I'm sure a large part of this is due to there being no straightforward introduction to the subject. Someone should write an exposition of its key successes without the usual huge dollop of differential geometry in the opening section.

Something I'm also curious to know is why there's not greater use of information geometry by the Gaussian process machine learning theorists. With Gaussian processes as maximum entropy solutions, you'd have thought they'd be tailor-made for the IG treatment, perhaps even to help in the choice of covariance function.

I'm interested in the moment by the topic at the end of the list - Bayesian Information Geometry, which might just be what a huge number of machine learing algorithms are approximating. The idea is simple enough. Starting out form a prior distribution in the space of distributions, for any data the decision rule minimises the divergence between the true distribution and its estimate. For a given data set this is equivalent to finding the distribution at the smallest mean distance from the true distribution, the mean being taken with respect to the posterior distribution. Snoussi's paper shows how broad this framework is, with the flexibility to choose your distance function and the weight you give to your choice of prior.

At a sociological level, it's interesting to speculate why information geometry has been somewhat reluctantly taken up. I'm sure a large part of this is due to there being no straightforward introduction to the subject. Someone should write an exposition of its key successes without the usual huge dollop of differential geometry in the opening section.

Something I'm also curious to know is why there's not greater use of information geometry by the Gaussian process machine learning theorists. With Gaussian processes as maximum entropy solutions, you'd have thought they'd be tailor-made for the IG treatment, perhaps even to help in the choice of covariance function.

## 12 Comments:

Hi,

I'm an algorithms person who's interested in information geometry, and I can attest to two personal reasons for struggling with this material.

The first is as you indicate: it takes a while to catch up on the differential geometry, and what makes things harder is that some of the key ideas (the idea of dual connections being one) are new to differential geometry as well, if I understand correctly. Thus, the only source for understanding these concepts is the IG material itself (Amari and Nagaoka's book does a fairly good job of this though).

The second reason is one that I hesitate to bring up, because it appears to be the same kind of objection occasionally levelled against category theory. I get the sense that much of the value of the IG approach to date has been explanatory, in that it provides a common framework for many inference problems. What I have seen less of is work that builds upon this, using ideas from differential geometry to prove non-trivial results that were not known (or seemed inaccessible) before.

I believe that this is merely a matter of time, and I do think once some key results pop out, the adoption of IG methods will move a lot faster.

A paper link: Amari wrote a very nice paper in 1995 explaining the well known EM algorithm as a kind of primal dual method on dual manifolds. Here is the bibref (and a link to the paper)

@article{ amari95information,

author = "Shun-ichi Amari",

title = "Information Geometry of the {EM} and {em} Algorithms for Neural Networks",

journal = "Neural Networks",

volume = "8",

number = "9",

pages = "1379--1408",

year = "1995",

url = "citeseer.ist.psu.edu/amari95information.html" }

Hello,

First off, let me say how much I enjoy reading your blog.

I wanted to add two possible reasons on why Information Geometry has not taken hold in Machine Learning:

1. The parameter space of non-trivial models (i.e. not in the exponential family) is not necessarily a manifold and can have singularities that need concepts from algebraic geometry. I'm thinking here of the work of Sumio Watanabe in Japan and Pachter and Sturmfels in the USA ('Algebraic Statistics in Computational Biology'). The implications of this is that Information Geometry as synthesized by Amari in the early 80's is incomplete for Machine Learning problems in general. Although I have no reference to Amari saying this directly, his collaboration with Watanabe implies it.

2. Once you leave the exponential family of models, you are almost guaranteed that the Fisher metric cannot be calculated in closed form. Furthermore, the dimensionality of the space under consideration can be huge even in the most trivial of departures from the exactly-solvable models. For example, a mixture of one-dimensional Gaussian models requires carrying out differential geometry calculations in a 3*m-1 space, where m is the number of mixtures.

In other words, it is very appealing to think in geometrical terms but the dimensionality of the space and the lack of closed solutions for the metric makes it very hard to actually calculate the geometry of the space, never mind obtain insights from such calculations.

Thanks very much for the comments.

Suresh, yes it's intriguing trying to figure out why a promising program seems not to be flying. Whether it's not as powerful as you thought, or just hasn't been pushed hard enough. I'm not sure where synthetic differential geometry (SDG) is positioned here. I've always wished it well since I heard of it. What's clear is that a program needs a lot of faith and perseverance, and good expository promotion.

Hmm, SDG for IG.

Andres, thanks for the reference and comments. Oh dear, not only do we need a good grip on differential geometry, we also need to know algebraic geometry. Roll on Grothendieck.

It seems that one of those involved in developing the theory of infinite dimensional statistical manifolds, Giovanne Pistone, is also interested in algebraic geometry. I'll have to look at his talk at the 2nd International Symposium on Information Geometry and its Applications, December 12-16, 2005, Tokyo.

This isn't quite on topic, but it's about Bayesianism and machine learning, and it's pretty cool:

Eliezer Yudkowksy and others have set up something called the Singularity Institute for Artificial Intelligence. One of the main goals of this institute right now is to design a friendly AI, meaning roughly: an artificial intelligence that we won't be scary if it gets smarter than us. This is quite a challenge, for all sorts of reasons: it amounts to clearly understanding and implementing "friendliness".

Anyway, Yudnowksy makes explicit use of Bayesian reasoning in his thinking about reinforcement, pleasure and pain, and the like. He raises an interesting question about machine learning for machines that can see the effects of their actions:

There is a huge amount of extant material about Bayesian learning, formation of Bayesian networks, decision making using a-priori Bayesian networks, and so on. However, a quick search (online and in MITECS) surprisingly failed to yield the idea that a failed action results in Bayesian disconfirmation of the hypothesis that linked the action to its parent goal. It's easy to find papers on Bayesian reevaluation caused by new data, but I can't find anything on Bayesian reevaluation resulting from actions, or the outcomes of failed/succeeded actions, with the attendant reinforcement effects on the decision system. Even so, my Bayesian priors are such as to find unlikely the idea that "Bayesian pride/disappointment" is unknown to cognitive science, so if anyone knows what search terms I should be looking under, please email me.

Maybe someone here can help?

On a more futuristic note, Yudnowsky introduces the concept of the Bayesian boundary for transhuman artificial intelligences:

Eventually, any Friendly AI undergoing a hard takeoff will cross the Bayesian Boundary. There comes a point when all programmer inputs can be anticipated; when the AI's understanding - the SI's understanding - embraces everything that exists within the minds of the programmers, and indeed, the minds of humanity. This is meant, not in the sense of omniscience, but in the simpler sense that the AI basically understands people in general and Friendship programmers in particular, and can track the few hundred or thousand interacting "chunks" that are the high-level description of a human mind. (If you firmly believe this to be impossible, then there is no Bayesian Boundary and you have nothing to worry about; a Friendly AI does not rationalize, and has no bias toward believing ve possesses a greater capability than ve actually does.)

Beyond this point it is not the

actual programmer inputsthat matter, but whether thesame forces that act on the programmers are acting on the AI. To put it another way, a transhuman AI knows the programmers will say a certain thing, and thus the programmers' physical action adds no information, but that does not mean the content of the statement will be ignored. If the programmers "would have" said the statement for valid reasons, the transhuman AI will "obey" the subjunctive instruction. This is a semi-anthropomorphic way to think about it, possibly even bordering on the adversarial, but it's the basic idea. Note that such anticipation is only possible to transhumans. One human being can never know what another human being will do well enough to substitute the expectation for the reality; thus, the fact that this behavior would be annoying in a human (probably indicating failure of caring) does not indicate failure of Friendliness in AI.Philosophers might also be interested in Yudnowsky's discussion of philosophical crises that might afflict an artificially intelligent being.

John,

I see you pair this work by Yukodsky with the demise of the world's oceans in the 1 Aug entry of your diary. Using Bayesian language, my degree of belief that in 200 years, or 1000 years, we will have sufficiently screwed up the planet that no country will have the economic strength to support significant research in AI is far greater than my degree of belief that we will have crossed the 'Bayesian Boundary'. Let us not fantasise the helpful other into existence.

The real science of political economy, which has yet to be distinguished from the bastard science, as medicine from witchcraft, and astronomy from astrology, is that which teaches nations to desire and labour for the things that lead to life: and which teaches them to scorn and destroy the things that lead to destruction.John RuskinIf Cuba, a nation "changing from an industrial to an agrarian society", is a good example of an unravelling economy, we may wonder how others will cope.

Your quote from Ruskin, David, is interesting. Although off-topic, I think he was being over-confident. I don't believe we have yet clearly distinguished medicine from witchcraft, given the fact that placebo drugs, pretend operations (opening someone under anaesthetic for heart surgery, but not actually doing the surgery) and pretend psychoanalysis (letting people talk to an untrained intern rather than to a trained psychoanalyst) have been shown to have real, statistically-significant, effects.

...that placebo drugs, pretend operations ... have been shown to have real, statistically-significant, effects.The perfect opportunity to put in a plug for my forthcoming book, now expected March 2007.

David writes:

I see you pair this work by Yudkowsky with the demise of the world's oceans in the 1 Aug entry of your diary. Using Bayesian language, my degree of belief that in 200 years, or 1000 years, we will have sufficiently screwed up the planet that no country will have the economic strength to support significant research in AI is far greater than my degree of belief that we will have crossed the 'Bayesian Boundary'.

I bet even Yudkowsky might agree that 200-1000 years of the current population levels and current-style industrial economy would destroy the environment to the point where high tech becomes impossible. But, he and his ilk don't think artificial intelligence will take that long. They believe humans will become obsolete in less than a century. And, while I have nothing like their faith in this, I don't think they're completely nuts. Ray Kurzweil has studied how changes in technology are accelerating, and makes the following extrapolations in computing power:

* We achieve one Human Brain capability (2 * 10^16 flops) for $1,000 around the year 2023.

* We achieve one Human Brain capability (2 * 10^16 flops) for one cent around the year 2037.

* We achieve one Human Race capability (2 * 10^26 flops) for $1,000 around the year 2049.

* We achieve one Human Race capability (2 * 10^26 flops) for one cent around the year 2059.

Of course it's tendentious to measure the "human brain capability" in flops (floating point operations per second). For one thing, what's special about the human brain is not its sheer computational power but its wonderfully evolved structure and functioning. We're already pretty close to building a petaflop machine (that's 10^15 flops), but nowhere near making it act like a human brain. Maybe information geometry will help. :-)

But Kurzweil's more robust point is this: change is accelerating so rapidly that a

linearextrapolation of changes now is likely to be all wrong past a few decades. So, I like to keep an eye on the wild-eyed visionaries as well as the sober pessimists.I guess what worries me most about AI speculation is that it, along with much academic philosophy, generally fails to take into account the richly intricate social structure that humankind has relied upon to proceed along the path of understanding, and thus how precarious our knowledge is.

David wrote:

Suresh, yes it's intriguing trying to figure out why a promising program seems not to be flying. Whether it's not as powerful as you thought, or just hasn't been pushed hard enough. I'm not sure where synthetic differential geometry (SDG) is positioned here. I've always wished it well since I heard of it.

I think if you want to help out information geometry, you should not bring synthetic differential geometry into the game. Synthetic differential geometry is basically just another formalism for doing differential geometry, based on infinitesimals instead of the usual version of calculus. Ultimately a lot of things can be made clearer using synthetic differential geometry, but there's a high startup cost, since far more mathematicians understand the usual formalism - and most of them don't want to learn another formalism.

So, we need some people to write good expository textbooks on SDG, but in the meantime, anyone with a fledgling research program involving differential geometry should stick to the usual formalism. Using SDG instead would be like trying to start up a new company that builds PCs, competing with Dell and the rest... but only making PCs that use the Dvorak keyboard. Yes, the Dvorak keyboard is better than the usual QWERTY one. But....

Mind you, I love synthetic differential geometry and may wind up using it to prove some theorems in a paper I'm writing with Urs. But, before I do so, I'll have to carefully weigh the costs against the benefits: the cost in diminished readership against the benefits in elegance. And this is for work on categorified gauge theory! Presumably people working on machine learning are even less fond of mathematical elegance for its own sake.

Post a Comment

<< Home