Google Answers Logo
View Question
 
Q: Page Rank Definition - Proof of convergence and uniqueness ( Answered,   2 Comments )
Question  
Subject: Page Rank Definition - Proof of convergence and uniqueness
Category: Computers > Algorithms
Asked by: dvd03-ga
List Price: $50.00
Posted: 27 Jul 2004 01:42 PDT
Expires: 26 Aug 2004 01:42 PDT
Question ID: 379557
In Page & Brin's "Anatomy of a large scale hypertextual web search
engine", the page rank vector, r, is defined to be:
          r = (1-c)*u + r*D
where u is the unit vector (I'm not concerned with personalised page
rank at the minute) and D is the transition matrix for which Dij =
1/deg(i) if i outlinks to j and 0 otherwise. (I note that Page and
Brin actually provide a different, contradictory, definition in their
98 article, and that Bianchini et al., 2004, and Haveliwala, 2002,
also provide similar, though also different definitions in their
treatments of page rank.)
Using the power method, we get an iterative algorithm for calculation
of the page rank vector:
	r_n+1 = (1-c)*u + r_n*A.
I'm looking for a proof of the algorithm's convergence, and I'm also
looking for a proof that the convergence limit is unique.
Many thanks.

Request for Question Clarification by mathtalk-ga on 27 Jul 2004 07:30 PDT
Hi, dvd03-ga:

Is A (in the "power method") the same as D in the previous definition?

Assuming convergence, the uniqueness of any bounded limit (independent
of the starting value) can be related to the properties of A easily,
as the limit r must by continuity satisfy the equation:

  r = (1-c)*u + r*A

  r*(I-A) = (1-c)*u

and the limit is unique provided (I-A) is invertible.

On the other hand to prove convergence of the "power" iteration, one
needs to show something like a contraction property:

|| r_{n+1} - r_n || < z * || r_n - r_{n-1} ||

where 0 < z < 1, for sufficiently large n.  The spectrum (eigenvalues)
of A can be expected to be important in making such an analysis:

   r_{n+1} - r_n    =   ( r_n - r_{n-1} )*A

So are you trying to carry this out for A = D ?

regards, mathtalk-ga

Clarification of Question by dvd03-ga on 27 Jul 2004 07:58 PDT
Dear mathtalk,
Apologies, A is indeed equal to D.
Thank you, and looking forward to hearing from you again.

Request for Question Clarification by mathtalk-ga on 27 Jul 2004 21:04 PDT
Hi, dvd03-ga:

I believe that while the equation (for a single page rank entry) given
in Page & Brin's "Anatomy..." can be interpreted in the vector form
you give here, a wider reading of the passage motivates a different
interpretation.

To me the passage points to the formulation of page ranks in terms of
the equilibrium vector for a Markov model.

Let be M = (m_i,j) be a Markov (probability) transition matrix in which:

  m_i,j  =  Pr( going to page j next, given page i now )

In the simplified Page Rank formula, we have:

   M  =  (1-c)u'u + c A

where 0 < c < 1 is the scalar "boredom" or damping coefficient, and:

   u  =  [ 1 1 1 ... 1 ] is the "unit" row vector

and A is essentially as you've defined the transition matrix D.

Note that all row sums for A are 1, and the same is true for M.

It follows that A and M both have u' as a column "right" eigenvector
for the eigenvalue 1:

Au' = Mu' = u'

Consequently A and M must also have some row "left" eigenvector for
eigenvalue 1, though because of the possible asymmetry of these
matrices, the left eigenvector will not (in general) be u.

I interpret the "simple power method" alluded to in the paper as being
the iteration:

u, uM, uM^2, ... , uM^n, ...  --> v

suitably scaled so that vu' = 1.

The existence and uniqueness results for such an equilibrium vector
are then quite standard, as is the convergence analysis for the power
method.  Would a write-up for these results be of interest to you?

regards, mathtalk-ga

Clarification of Question by dvd03-ga on 28 Jul 2004 00:37 PDT
Dear Mathtalk,
    I'm agreeing with: I've been working on the problem also and have
also taken a Marcov line of attack. Actually, I've generalised the
problem slightly to allow for personalisation of what was formally the
unit vector - allowing each entry in the row vector to be any real >=
0. This complicates the problem somewhat as it could now be that we
haven't a transition matrix - because it might now no longer strongly
connected. I think we can get round the problem by proving that
particular entries in the row vector, for example, say the entry
corresponding to page a, converge to zero (in fact, = 0) iff there is
no b for which the personalisation vector (formally the unit vector)
entry is no zero and from which a path exists to a. We can then
restrict the matrix M, as you call it, to just those pages which do
not converge to zero. The resultant M', say, is I believe a transition
matrix.
     Must confess, though, I'm finding it quite a bit easier to work
with functions, as opposed to in matrices, and would appreciate a wee
bit of help with the matrices.
     For example, we know that M is defined to be:
         rM = crM + (1-c)p             [where p is the personalisation vector]
     How did you jump from this to your definition of M? What is u'?

Request for Question Clarification by mathtalk-ga on 28 Jul 2004 04:26 PDT
Sorry, u' is simply the column transpose of row vector u, and I missed
a scalar factor in writing M.

u'u is a rank-one square matrix containing all 1's, and (1/n)u'u is a
Markov transition matrix in which one randomly jumps from any state to
any state (even the same state) with probability 1/n (where n is the
number of states, or pages in our case).

So I need to correct my formula for M to get a transition matrix to:

M = (1-c)(1/n)u'u  +  c A

where A is the transition matrix given by "backlinks" among pages.  We
are blending the probability of transition along the links between
pages with the equiprobability of jumping to any page.

The parameter 0 < c < 1 guarantees that the entire "graph" of
states/pages is connected, and in this case the uniqueness of the
equilibrium vector can be proven.

regards, mathtalk-ga

Clarification of Question by dvd03-ga on 28 Jul 2004 05:01 PDT
Ought the scalar factor to have been 1/||r||  (1/one norm of r) - not
1/n, as you propose?

Also, could we suppose that u's entries are "strictly" positive reals
rather than simply 1s (I think I can deal with zero entries impinging
on strong connectedness)? How does this effect your definition of M?

I'm rather curious to see your unique convergence proof.

Regards dvd03

Request for Question Clarification by mathtalk-ga on 28 Jul 2004 05:26 PDT
The factor 1/n is defined in the context of matrix M, namely that A
and hence M will be n x n matrices.

It is also a feature of the power method that one scales the vector
iterations in some fashion to prevent either degeneration to zero or
unbounded growth.  So in this sense a scaling:

 r |--> (1/||r||) r

is sensible.  In fact the Anatomy article proposes that the sum of
entries in r is 1, which is consistent with the interpretation of r as
the equilibrium vector for a Markov process.

In terms of allowing more general vectors u than simply all 1's, this
is hinted at by Page & Brin as well in that article.  Indeed we could
define a one parameter family:

 M(c) = (1-c)G + c A

for any transition matrices G,A whatever, and each M(c) would again be
a transition matrix with:

 M(0) = G ; M(1) = A

One may expect that when c is close to 1 then the equilibrium vector
for M will approach an equilibrium vector for A.  This is true under
connectivity assumptions on A, but as there is no a prior reason to
expect that the Web graph would be connected (there could easily be
small "enclaves"), one advantage of introducing the "boredom" factor
is to obtain connectivity.

I believe that by playing around with the entries in G, ie. using as
you have suggested something other than simply (1/n)u'u, one can
mitigate the ranking distortions from link farms, etc.  However this
is where presumably some human recognition/art is required to choose
an improved "perturbation" G.

I'll put together an Answer that gives the existence and uniqueness
results for the equilibrium vector of a Markov model, and the
numerical convergence properties of the power iteration (and some ways
to accelerate convergence).

regards, mathtalk-ga

Clarification of Question by dvd03-ga on 28 Jul 2004 06:30 PDT
Dear Mathtalk,
	My thoughts as regards the one norm "scaling" were actually purely
definitional in origin.
         The definition given in the "anatomy" article reads: 
 
         	PR(A) = (1-c) + c(PR(T1)/C(T1)+...+PR(Tn)/C(Tn))

         Am I mistaken in thinking this yields the matrix definition:

         	r_ (n+1)= (1-c)(1...1) + r_nA

	And also,
 
    		r_(n+1) = r_n ( (1-c)(1/||r_n||)N + cA )    - where N is a
matrix comprising just 1s? (Things get a bit trickier than this with a
personalised boredom vector, do they not? What then is the
definition?)

 	Page and Brin ignore the 1/||r_n|| scalar factor as they claim
||r_n|| = 1 (in their 98 paper). In this respect, I believe they were
mistaken - see http://members.optusnet.com.au/clausen/reputation/rep-cost-attack.pdf
.

     	I look to your solution.

	Regards,
	dvd03

Clarification of Question by dvd03-ga on 28 Jul 2004 06:39 PDT
Convergence acceleration would be very useful.

Many thanks,

dvd03

Clarification of Question by dvd03-ga on 28 Jul 2004 13:33 PDT
Having thought a bit further on the varying u such that entries are
from (0,infitity) - let us call this vector p, I'm proposing that from
       PR(A) = (1-c)p_a + c(PR(T1)/C(T1)+...+PR(Tn)/C(Tn))    
       
       (where p_a is the entry in p corresponding to page A)

we get the matrix definition:
       r_(n+1)= (1-c)p + cr_nA

from which we get:
       r_(n+1) = r_n( (1-c)*(1/||r_n||)*p1' + cA )


To illustrate, consider a simple two page case, where r = (r1 r2):

(r1' r2') = c(r1 r2) m11 m12   + (1-c)(p1 p2)
                    m21 m22

          = ( c*sum ri*mi1 + (1-c)p1   c*sum ri*mi2 + (1-c)p2 )

          = (r1 r2)  (1-c)*p1/||r|| + c*m11       (1-c)*p2/||r|| + c*m12
                     (1-c)*p1/||r|| + c*m21       (1-c)*p2/||r|| + c*m22

          = r( (1-c)*(1/||r_n||)*p1' + c*A )

Your thoughts?

Request for Question Clarification by mathtalk-ga on 04 Aug 2004 05:17 PDT
Hi, dvd03-ga:

I collected some preliminary remarks on notation and general
definitions related to Markov chains in my second Comment below.  I
plan to post an Answer that, as your Subject asks, gives the proof of
convergence and uniqueness to solving:

  r M = r

by the power method, where M = (1-c)(1/n)u'u  +  c A.  From the
properties of M and how they are deduced from the matrix (1/n)u'u,
namely irreducibility and non-periodicity, it will be apparent that
many alternative "perturbation" terms would provide the same
advantages.

regards, mathtalk-ga

Request for Question Clarification by mathtalk-ga on 06 Aug 2004 07:20 PDT
Hi, dvd03-ga:

The treatments of A.A. Markov's famous "ergodic theorem" which are
found on the Web are mostly for the special case of the matrix having
a complete set of eigenvectors.  However because a probability
transition matrix is not generally symmetric, that cannot be assumed.

If you want a brief treatment of the convergence and uniqueness, I can
supply one based on the assumption of "diagonalizability" (in effect,
a complete set of eigenvectors).  Or if you want a more rigorous
treatment, I can give one based on canonical forms (eg. Shur's lemma),
but it will take longer to prepare the Answer.

regards, mathtalk-ga

Clarification of Question by dvd03-ga on 06 Aug 2004 08:31 PDT
My personal bias would be towards rigor.

Many thanks,

dvd
Answer  
Subject: Re: Page Rank Definition - Proof of convergence and uniqueness
Answered By: mathtalk-ga on 08 Aug 2004 20:00 PDT
 
Hi, dvd03-ga:

Introduction
============

From your comments it seems that you are interested in how a proof of
convergence and uniqueness for the power method applied to the Markov
chain equilibrium problem might be extended to matrices which are
nonnegative but not "stochastic" (row sums equal one).

To prove convergence and uniqueness for Page and Brin's original "Web
matrix" formulation, my approach is structured in a way that I hope
shows where assumption of a probability transition matrix is used and
what parts of the proof are independent of that.

Let me briefly recount how we made our way to a matrix eigenvalue
problem, which will be our focus for the remainder of this Answer. 
You cited Page & Brin's "Anatomy of a large scale hypertextual web
search engine" for this "row vector" equation:

  r = (1-c)u + rA

where u = [1 1 ... 1], c between 0 and 1, and A the discrete
probability transition matrix such that:

          / 
         |  1/N(i) if there exists a link on page i to page j
A(i,j) = <            where N(i) = # pages linked by page i;
         |
          \    0    otherwise.

Note that A is stochastic (row sums are equal to one):

  A u' = u'

Here we detected some inconsistencies.  Page & Brin omitted a factor
1/n in the term (1-c)*u, as subsequently I did also.  You omitted a
factor c in the term r*A.  Making up these omissions:

  r = (1-c)(1/n)u + crA

Furthermore Page & Brin specify the entries in r are nonnegative and
should sum to 1.  So we can write, using ' to denote transpose:

  ru' = 1

Substituting r u' as a factor of unity in one place:

  r = (1-c)(1/n)ru'u + crA

    = rM

where we now have defined a family of probability transition matrices
parameterized by c:

  M = (1-c)(1/n)u'u + cA

Note that M inherits from A the stochastic property, because uu' = n:

  M u' = (1-c)(1/n)u'u u' + cA u' = (1 - c + c) u' = u'

I interpret your Question as wanting a proof of convergence and
uniqueness for finding the "equilibrium" vector in the corresponding
discrete Markov chain by the power method.  For suitable initial
choice of row vector r_0, we define:

  r_{i+1} = r_i M,  i = 0,1,2,...

By induction r_k = r_0 M^k.  Iterate until some threshold of agreement is met:

  || r_{i+1} - r_i || < epsilon

Generally a scaling step is required to satisfy ru' = 1 in the power
method.  As a practical matter, unless the iterates r_k either grow or
shrink too much, this scaling can be deferred to the end:

            1
  r <--| ( --- )r
           ru'

because the equation r = rM is homogeneous in r (and not
coincidentally, the eigenvalue sought here is 1).  But special
circumstances apply here!  If the initial vector r_0 is nonnegative
and r_0 u' = 1, then with exact arithmetic it would remain true that
r_k is nonnegative and r_k u' = 1:

  r_k u' = r_0 M^k u' = r_0 u' = 1

because M u' = u'.

For simplicity we will omit scaling in our description of the power
method, admitting that it would be of some practical use due to
floating-point arithmetic errors.  It will also be necessary for
generalizing to nonnegative M that are not stochastic.

Our analysis will be in three stages.  First we will claim some facts
about the eigenvalues and eigenvectors of M and prove from these
assumptions that the power method converges to a unique solution r (up
to an arbitrary scaling factor).

Second we will prove those claims about M's eigenvalues and
eigenvectors using some results in matrix theory that are perhaps
unfamiliar to you, the Perron-Frobenius theorem in particular.

Finally we give an elementary proof of that result for the sake of completeness.

In closing we'll point out some ways suggested to accelerate convergence.

Convergence and Uniqueness
==========================

The following claims about the eigenvalues ("spectrum") of M will be
used here and justified subsequently:

1.  The maximum absolute value of an eigenvalue of M is 1.

2.  1 is a simple eigenvalue of M and has a positive eigenvector.

3.  Other eigenvalues of M are strictly less than 1 in absolute value.

The last claim we can state more concretely:

3.  The other eigenvalues of M have absolute value less than or equal to c < 1.

Our immediate goal is to show convergence of {r_k : k=0,1,2,... }.

If M were diagonalizable (equiv. had a complete basis of
eigenvectors), then we could write:

  M = S^-1 D S
  
  r_k = r_0 M^k = r_0 S^-1 D^k S

where eigenvalues of M ranked in descending order of their absolute
values, listed according to multiplicity, define diagonal matrix D:

The first eigenvalue d1 = 1 is simple (multiplicity 1), and all
subsequent eigenvalues have absolute value less than or equal to
parameter c.

D = diag(d1,d2,...,dn) where |d1| >= |d2| >= ... >= |dn|

The convergence proceeds "linearly" as the diagonal entries of D^k
beyond the first decay to 0 like O(c^k).

For practical purposes this analysis is adequate, but for rigor we
must consider a more general canonical form for M.  The Jordan
canonical form, to be specific, gives:

  M = S^-1 J S

where J is a block-diagonal upper triangular matrix:

        |              |
        |  1          0  |
        |    J_2         |
        |       .        |
        |         .      |
  J  =  |           .    |
        |  0         J_p |
        |_              _|



whose "Jordan blocks" J_m (for m = 1,..,p) are of the form:

        |               |
        | d_m  1 ...   0  |
        |  0  d_m 1 ...   |
J_m  =  |  .   0  .  ...  |  
        |  .       .      |
        |  .         . 1  |
        |  0       0  d_m |
        |_               _|

Note that diagonal entries d_m are constant in block J_m.  Because d_1
is a simple eigenvalue, the corresponding Jordan block J_1 is just a
11 entry.

[Jordan Canonical Form]
http://www.math.mcmaster.ca/wolkowic/Courses/M741/jordanc.pdf

Now J = D + N where N is nilpotent, and the blocks J_m correspond to
invariant subspaces so that the calculation of J^k is simplified. 
Thus:

  M^k = S^-1 J^k S

where:

        |               | ^ k
        |  1           0  |
        |    J_2          |
        |       .         |
        |         .       |
J^k  =  |           .     |
        |  0          J_p |
        |_               _|

If n_m is the order of block J_m, then:

  J_m  = d_m I + N_m

with N_m nilpotent of order n_m.  Then for blocks m > 1:

 (J_m)^k = (d_m I + N_m)^k
 
         = (d_m)^k I + SUM C(k,j)*(d_m)^(k-j)*(N_m)^j

where the sum is over integers 0 < j < n_m since (N_m)^n_m = 0.

The factors |(d_m)^k| =< c^k shrink geometrically as |d_m| =< c, but
C(k,j) grows like an j'th order polynomial in k.  (N_m)^j is constant
(with respect to k).

Consequently each (J_m)^k tends to zero (m > 1), but not quite at
O(c^k) due to the "algebraic" growth in terms C(k,j) that offsets it
slightly.  However for any b > c, we can show:

(J_m)^k ~ o(b^k)

because polynomials can't keep pace with an exponential.  So choose c
< b < 1 and convergence is still linear, just not quite as rapid.

So J^k converges linearly to a diagonal matrix with first entry 1 and
the remaining entries 0, and the rate of convergence is for practical
purposes like O(c^k).  The convergence of M^k and hence r_k = r_0 M^k
is therefore shown.  However we want to avoid the possibility that the
limit is zero, as would be the case if our initial choice r_0 had no
component in the direction of the (left) eigenvector of M
corresponding to eigenvalue d_1 = 1.

Fortunately a suitable initial choice that avoids this is easy to
make.  Recall from our earlier discussion that if r_0 is nonnegative
and has sum of entries 1, then these facts remain true for all r_k (in
exact arithmetic).

In particular the choice r_0 = (1/n) u always converges to a nonzero limit.

Finally the solution r we get from the convergence of {r_k} is unique, because:

r M = r

and (from Claim 2) the eigenvalue d_1 = 1 is simple.  Subject to the
constraint of nonnegative entries that sum to 1, there can be only one
such eigenvector.  (Any two such would be multiples of one another by
the simplicity of the eigenvalue, and their respective entries' sums
equality forces those multiples to be 1.)

Claims about M
==============

Now we can turn to establishing the Claims made earlier about the
spectrum of M, as its set of eigenvalues (counting multiplicity) is
often called.

Claim 1
-------

As already argued elsewhere, the 1-norm (sum of absolute values of a
vector's entries) can be used to show the modulus (absolute value) of
any eigenvalue of M is at most 1.  Throughout we use norm ||.|| to
mean the 1-norm.

Each row of M, being nonnegative and with entries summing to 1, has a
1-norm equal to 1.  By the definition of matrix multiplication, given
a row x = [x1 x2 ... xn]:

  xM = x1*(1st row of M) + ... + xn(nth row of M)
  
     =   x1 M(1,-) + x2 M(2,-) + ... + xn M(n,-)

Now appeal to the triangle inequality for norms:

  ||xM|| =< |x1|*||M(1,-)|| + ... + |xn|*||M(n,-)||
  
         =  |x1| + |x2| + ... + |xn|

         = ||x||

If d is any eigenvalue of M, and nonzero x an eigenvector corresponding to it:

  ||xM|| =< ||x||

  |d|*||x|| =< ||x||

  |d| =< 1

The last step follows from the prior inequality by dividing by ||x|| > 0.

Thus 1 is an upper bound on the eigenvalues of M, and we already know
from M u' = u' that this upper bound is attained.  (See Comments below
for equivalence of "left" and "right" eigenvalues.)

Claim 2
-------

For the result that 1 is a simple eigenvalue of M, we appeal to a
classical result of Oskar Perron:

Theorem (Perron, 1907)  The eigenvalue of largest absolute value of a
positive (square) matrix A is both simple and positive and belongs to
a positive eigenvector. All other eigenvalues are smaller in absolute
value.

Many proofs of this are known, including two by Perron.  We will
present an elementary proof of a stronger version, the
Perron-Frobenius theorem, in the next section.  For now we simply
reference:

[Some links to Proofs of the Perron-Frobenius theorem]
http://www.csudh.edu/math/gjennings/perron.html

By inspection of the definition of M:

  M = (1-c)(1/n)u'u + cA

we see that each entry is at least (1-c)/n.  So M is positive provided c < 1.

Claim 3
-------

The Perron theorem already encompasses the weak form of Claim 3, that
other eigenvalues are strictly smaller than 1 in absolute value.

The stronger version of Claim 3 identifies parameter c as an upper
bound for the remaining eigenvalues.  For this, see:

[The Second Eigenvalue of the Google Matrix - Haveliwala and Kamvar]
http://www.stanford.edu/~taherh/papers/secondeigenvalue.pdf

"Abstract. We determine analytically the modulus of the second
eigenvalue for the web hyperlink matrix used by Google for computing
PageRank. Specifically, we prove the following statement:

"For any matrix A = [cP + (1-c)E]' where P is an nn row-stochastic
matrix, E is a nonnegative nn rank-one row-stochastic matrix, and 0
=< c =< 1, the second eigenvalue of A has modulus less than or equal
to c. Furthermore, if P has at least two irreducible closed subsets,
the second eigenvalue is equal to c.

"This statement has implications for the convergence rate of the
standard PageRank algorithm as the web scales, for the stability of
PageRank to perturbations to the link structure of the web, for the
detection of Google spammers, and for the design of algorithms to
speed up PageRank."

We have a bit more to say about this paper in the final section on
acceleration techniques.

(to be continued...)

Request for Answer Clarification by dvd03-ga on 09 Aug 2004 09:33 PDT
Dear Mathtalk,
	Very much enjoying your solution. Thank you.
 	Just  a quick question with respect to the 1/n scalar you
introduced, how would things have been affected had it not been
introduced?
 	More generally, supposing we know that v_(n+1) = v*A + r converges
for some row vector r, then can we infer that v_(n+1) = v+A + c*r
converges also, where c is some positive scalar? So, could we infer
from your proof that Page's algorithm, as it was actually given,
converges also? Also, if we can, how do the stationary distributions
(limit vectors) compare with one another? Is the second simply a
scalar multiple of the first, or...?
	Many thanks

Clarification of Answer by mathtalk-ga on 09 Aug 2004 11:47 PDT
Without the 1/n factor, the "fixed point" iteration:

  r := (1-c)u + crA

is likely to converge provided 0 < c < 1.  The converged solution would satisfy:

  r(I - cA) = (1-c)u

  r = (1-c)u(I + cA + (cA)^2 + ... )

The latter matrix series converges because the spectral radius of cA
is c (see your other Question), and this is less than 1.  It gives an
"explicit" form of the inverse of (I - cA).

However this is _not_ the same as solving the "equilibrium" problem:

  r = r[(1-c)(1/n)u'u + cA]

  r(I - cA) = (1-c)(1/n)ru'u

unless ru' = n.  But:

  ru' = (1-c)u(I + cA + (cA)^2 + ...)u'

      = (1-c)u(1-c)^-1 u'

      =  1

Scaling the solution r doesn't affect whether it solves the
equilibrium condition (unless you scale it to zero!), so in general
these conditions are not compatible.

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 09 Aug 2004 12:06 PDT
Hmmm, wait a minute.  That last calculation shows ru' = n, not 1.  So
it looks like the conditions are compatible after all.  Let me
double-check, I have a feeling of confusion, not conviction about
this...

-- mathtalk-ga

Request for Answer Clarification by dvd03-ga on 14 Aug 2004 02:01 PDT
Dear Mathtalk,
       You write: 
The latter matrix series converges because the spectral radius of cA
is c (see your other Question), and this is less than 1.

       Which "other Question" did you mean? I've subsequently asked a
question about this very fact
(http://answers.google.com/answers/threadview?id=385860), but I don't
believe I've been answered yet.

Clarification of Answer by mathtalk-ga on 14 Aug 2004 10:52 PDT
Yes, I've not forgotten that one, but actually the "other" Question I meant was:

http://answers.google.com/answers/threadview?id=384263

I'm close to polishing up the proof of Perron-Frobenius (just need to
prove two lemmas), and I wanted to finish up my Answer here before
starting the other.

However the easy part of the Answer is that the eigenvalues of cA are
just c times the eigenvalues of A (for any square matrix), so the
spectral radius (maximum absolute value of an eigenvalue) of cA is |c|
times the spectral radius of A.

For justification that the "geometric" series I + cA + (cA)^2 + ...
converges, we need to appeal to something like the argument used in
this Question to estimate the growth of powers of cA.

As you may recall, we showed using the Jordan canonical form that the
powers of cA tend to zero geometrically (which confusingly is called
linear convergence) with any constant b slightly greater than c, e.g.
like

       (c + 1)
    (  -------  )^k
          2

if we should need to be definite.

Provided only that the series converges "absolutely" (as the geometric
rate implies), then we have a matrix series identity involving an
inverse:

    (I - cA)^(-1) =  I + cA + (cA)^2 + ...

Naturally the inverse shown could exist for many values of c not
restricted to those less than 1, and in fact this is the idea behind
studying the spectrum of A through a complex function called the
resolvant (with poles at the eigenvalues of A).

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 15 Aug 2004 08:52 PDT
This is an experiment in writing with a Unicode text editor,
which allows us the use of Greek letters, some superscripts,
and a few other familiar mathematical symbols.

Perron-Frobenius Theorem
========================

The proof below is based on one given by James M. Ortega in
Thm. 6.1.4 of this book:

[Matrix Theory: A Second Course]
http://www.riskbook.com/titles/ortega_j_(1987).htm

It's elementary in the sense of not introducing any theory
of complex functions or fixed-point results requiring a
topological foundation but does depend on such introductory
principles of real analysis as bounded sequences in finite
dimensional vector spaces having convergent subsequences.  

For the sake of consistency with our earlier analysis, the
proof has been adapted to use the 1-norm ||.|| as well as
row-vector-on-the-left multiplication.  The changes are not
substantial however.

Recall that u = [1 1 ? 1] denotes the vector of 1's.  We'll
also use a slightly nonstandard notation that for row vector
x, |x| means the entry-by-entry absolute values (giving again
a row vector of the same length).  Thus |u| = u, for example.

Previously we've defined when a probability transition matrix
is irreducible.  The generalization of this to nonnegative
square matrices is slight but worthy of careful statement.

Defn.  Nonnegative nn matrix A is reducible iff there exist
1 ? i,j ? n s.t. for all a > 0, the row i, column j entry of
A to the power a is zero:

  A(i,j) = 0
  
Otherwise (if for all i,j there exists power a > 0 such that
A(i,j) > 0), we say A is irreducible.  Note that in general
the required power a depends on the indices i,j.

For a probability transition matrix, A(i,j) > 0 means there
is a positive probability of going from state i to state j in
exactly a steps, presuming a ? 0 is a whole number.

Note that if some row (resp. some column) of A is zero, that
same row (resp. same column) of any power A is again zero.
Hence such matrices are reducible.

The proof makes repeated use of (one half of) the following:

Lemma 1: Let A be an nn nonnegative matrix, n > 1.

Then A is irreducible if and only if (I + A)?? > 0.

Proof:  We will only use the forward direction, so naturally
let's start with proving that.  Expanding (I + A)?? as a
binomial (because I commutes with A), we see that the terms
all involve nonnegative matrices.  The leading term I takes
care of all the diagonal terms being positive, so we only 
need to worry about off-diagonal entries.

Let i ? j give a row and column respectively of (I + A)??.
From the definition of irreducible we know that A(i,j) > 0
for some a > 0.  If we could show this with a < n, we'd be
done as this would give a positive contribution to the same
entry in (I + A)??.

The expansion of A(i,j) gives a sum of all products:

  A(k(0),k(1)*A(k(1),k(2))*...*A(k(a-1),k(a))

such that k(0) = i and k(a) = j.  First remove any diagonal
terms A(k,k) from the product, thereby shortening the "path"
from i to j by some possible amount.  More generally remove
any subproduct that "revisits" a previous index, e.g.

  A(k,k')A(k',k")...A(_,k)

wherever it may occur in the product.  Since i ? j we don't
remove the entire product.

Now we have product in which all the indexes are distinct,
and i ? j implies there are at most n-1 factors (since a
factors introduce a+1 indexes).  So we've shown that for
i ? j it is possible to find A(i,j) > 0 with 0 < a < n.
That fills in the off-diagonal entries of (I + A)?? with
positive values as outlined above.

For the sake of completeness let's also prove the reverse
implication.  As we've just analyzed, a positive entry in
(I + A)?? at off-diagonal position (i,j) means that:

  A(i,j) > 0

for some 0 < a < n.  It remains to argue that A(i,i) > 0
for some a > 0.  But since n > 1, we can pick j different
from i. Combine a power of A that has (i,j) entry positive
with one that has (j,i) entry positive, and the product (or
sum of the two powers) gives diagonal (i,i) entry positive.

QED

Remark: I added the qualification n > 1 partly to avoid doubt
over the trivial case A = [0].  Arguably this is irreducible
in the sense that A = [1], but we did not include such sense
in our definition.  Despite this it's easy to verify that the
conclusions of the Perron-Frobenius theorem apply in the 11
case to nonnegative A, and we'll not belabor the point.  Also
we will see from examples later that A(i,i) > 0 may require
a = n, which relates to our filling of off-diagonal entries.

We will also require at one point the following fact:

Lemma 2:  Let b(i), i = 1,?,n be complex numbers such that:

  | ? b(i) | = ? |b(i)|  (sums over i = 1,?,n)

Then there exists a unit complex number c with |c| = 1 s.t.

  b(i) = c*|b(i)| for all i = 1,?,n.

Proof:  This says equality holds in the triangle inequality
only if all the complex numbers have the same direction c.
The converse is true but not stated here, and easy to show
by direct calculation.  Our proof is by induction.

To shorten the proof a bit, let's assume without loss of
generality that all the summands b(i) are nonzero, since
for any zero summand the choice 0 = c*|0| doesn't matter.

The case n = 1 is simple.  We choose c = b(1)/|b(1)|.

The case n = 2 contains all the hard work.  Suppose:

  | b(1) + b(2) | = |b(1)| + |b(2)|

As we assume neither of b(1) or b(2) is zero, divide through
by |b(2)| and label z = b(1)/b(2):

  | z + 1 | = |z| + 1

Square both sides and use the fact that multiplying complex
numbers by their conjugates gives the absolute value squared:
          _         _
  (z + 1)(z + 1) = zz + 2|z| + 1

After clearing common terms from both sides we have:
      _
  z + z = 2|z|
  
  Re(z) = |z|

which implies Im(z) = 0, else |z| > |Re(z)|.

Therefore z = |z| and we can choose:

  c = b(1)/|b(1)| = z b(2)/|z b(2)| = b(2)/|b(2)|

Now the induction step.  Assume the proposition has been shown
for some value n ? 2 and consider the situation when new term
b(n+1) is introduced.  By the triangle inequality:

  | b(n+1) + ? b(i) | ? |b(n+1)| + | ? b(i) |
  
                      ? |b(n+1)|  +  ? |b(i)|

But our hypothesis for n+1 is that the first and last of these
expressions are equal, which forces equality in the middle:

  |b(n+1)| + | ? b(i) | = |b(n+1)|  +  ? |b(i)|
  
  | ? b(i) | = ? |b(i)|  (sums over i = 1,?,n)

So the induction hypothesis applies to the first n terms, and
there exists a complex number c such that:

  |c| = 1 and b(i) = c|b(i)| for all i = 1,?,n

Write p = ? |b(i)| (sum over i = 1,?,n), and also:

  cp = ? c|b(i)| = ? b(i)

Therefore | b(n+1) + cp | = |b(n+1)| + |cp|,

and from the case n = 2 we deduce that:

  c = cp/|cp| = b(n+1)/|b(n+1)|

Now we're done with the induction step.

QED


Theorem (Perron-Frobenius)

Let A be an irreducible nn nonnegative matrix.
  
Then the largest absolute value of an eigenvalue of A is itself
a simple eigenvalue of A with an associated positive eigenvector.

Also, if any row (resp. any column) of A is positive, then all
other eigenvalues of A are strictly smaller in absolute value
than the largest one.

Proof:  Let S = { ? > 0 : for some nonzero x ? 0, xA ? ?x }.

Since A is irreducible, no column of A is zero.  Thus uA > 0,
and by taking ? to be the smallest entry of uA, we have:

  uA ? ?u

Thus ? belongs to S, so S is nonempty.  Moreover for any ? in S,
there exists nonzero x ? 0 such that:

  xA ? ?x

Consequently ? ||x|| ? ||xA|| ? ||x|| * C  where:

  C = max { ||A(i,-)|| : 1 ? i ? n }

is the operator norm on multiplication by A induced by the 1-norm
on row vectors x.  This proves that set S is bounded.

Therefore S has a least upper bound ?.  Let {?_k} be a sequence
of values in S converging to ? with corresponding nonzero vectors
{x_k} normalized so that ||x_k|| = 1.  Then a subsequence of {x_k}
exists which converges to vector x with ||x|| = 1.  Since:

  x_k A ? ?_k x_k

it follows that:

  x A ? ? x

We claim that equality actually holds, for the proof of which we
make our first use of Lemma 1.

Consider y = x A - ? x = x (A - ?I) ? 0.  If y = 0, we are
done with our claim.

Otherwise, if y ? 0, then y (I + A)?? > 0 because any nonzero
entry in y corresponds to a positive row in (I + A)??.

If we set w = x (I + A)??, then w > 0 (again because a nonzero
entry in x meets a positive row in (I + A)??).  Thus:
  
  wA - ?w = w (A - ?I)
           = x (I + A)?? (A - ?I)
           = x (A - ?I) (I + A)??
           = y (I + A)??
           > 0

where use has been made of how A and (I + A)?? commute.

This quickly leads to a contradiction of ? being the least upper
bound of S.  Let ? > 0 be the minimum ratio of respective entries
in wA - ?w to those in w > 0, so that:

  wA - ?w ? ?w

  wA ? (? + ?)w

By this contradiction the claim that y = 0, and thus x A = ? x,
is established.  Now ||x|| = 1, so clearly ? is an eigenvalue of
A of maximum absolute value.

Also, x must have at least one positive entry, and because the
corresponding row in (I + A)?? is positive:

  x (I + A)?? > 0
  
I.e. (1 + ?) x > 0, and thus x > 0 is strictly positive.

The simplicity of the eigenvalue is a bit tedious, but bear with
me.  If the geometric multiplicity of ? were more than one, then
we'd have eigenvector z corresponding to ? linearly independent
of x.  Since ? is real (and A is real), we may take z real.

By linear independence all combinations x + ?z are eigenvectors
of A corresponding to ? for any real ?.  Perturbing x with ?z
using positive or negative scalar ? as necessary, we could obtain

   x + ?z ? 0

with at least one entry equal to zero.  But the above proof that
x > 0 applies equally to x + ?z, and we have a contradiction.

This establishes that the geometric multiplicity of ? is 1, but
we must also rule out any algebraic multiplicity greater than 1,
i.e. that ? appears in a Jordan block of size m > 1.

If the latter were the case, then there'd be vector v such that:

  v (A - ?I) = 0,  v (A - ?I) ? 0

and again the fact A and ? are real allows us to "solve" for v
to be real as well.

Since v (A - ?I) is an eigenvector associated with ?, it must
by what has been shown so far be a nonzero multiple of x.

Conversely we could write, using the reciprocal of that scalar:

  x = ?v (A - ?I) > 0

But this implies:

  ?v A > ?(?v)

and if we take |?v| to be the entry-by-entry absolute value of
?v, then by the triangle inequality applied column-wise to A:

  |?v|A > ?|?v|

which gives us a contradiction to maximality of ? in S.

So eigenvalue ? has both algebraic and geometric multiplicity
1, i.e. a simple eigenvalue.

It remains to show that no other eigenvalue has an equal modulus
or absolute value as ?.  For this we need to assume more than
just the irreducibility of A, as can be illustrated by taking A
to be any cyclic permutation matrix, so that A? = I is a minimal
polynomial.  All the eigenvalues (characteristic roots) fall on
the unit circle, so ? = 1 is not strictly greater in absolute
value than the other eigenvalues.

For our purpose we'll analyze the case that A has at least one
column of all positive entries.  The dual case of A with a row
of all positive entries then follows as previously noted in our
Comments by considering the equal "spectrum" (eigenvalues) of
transpose A'.  [See Remark below for how it is usually phrased
in connection with Markov chains, which gives slightly greater
generality.]

Suppose for the sake of contradiction a different eigenvalue ?
with |?| = ?.  Let nonzero x be an associated eigenvector:

  xA = ?x

Now ?|x| = |?x| ? |x| A by the triangle inequality once more.

Repeat the argument leading from x A ? ?x to x A = ?x
but applied to |x|, and we have that ?|x| = |x|A and so |x|
is a multiple of x, say |x| = ?x where ? > 0.

We further claim that |xA| = |x| A, since by the above:

  |xA| = |?x| = ?|x| = |x|A

Given that say the k'th column of A is positive, this means:

  | ? x(i)*a(i,k) | = ? |x(i)|*a(i,k)

where the indicated sums are over i = 1,?,n.

This equality implies by Lemma 2 there exists complex c with
|c| = 1 such that:

      x(i)*a(i,k) = c*|x(i)|*a(i,k)

Then because each a(i,k) ? 0, x = c|x| = c?x.  Since x is a
multiple of x, we have shown eigenvalue ? = ?.

QED

Remark:  In the context of Markov chains, one often asks
that the probability transition matrix A is both irreducible
and _not_ "periodic".  Periodic means that for the chance
of going from state i back to state i to be positive, the
number of steps needed is always divisible by some integer
t > 1.  This condition is closely related to the positive
row or column condition that we've used in our statement of
the Perron-Frobenius theorem.  It can be shown that if A is
irreducible and not periodic, then A > 0 for some power a
of A.  Hence our assumption of a positive row or column is
then applicable to the power of A, with the result that all
conclusions remain valid.  Likewise one can show inductively
that if any row (or column) of irreducible nonnegative nn
matrix A is positive, then A? > 0.  Ortega gives this as an
Exercise 6.1.10, calling nonnegative matrix A "primitive"
if A > 0 for some power a of A.  The difficulty we see if
A is a cyclic permutation matrix is precisely that there is
no common power a for which all the entries of A > 0, and
indeed we reach A? = I only to start over.  In these terms
we could state the half of Lemma 1 we used repeatedly as
A irreducible implies (I+A) primitive.

Acceleration methods
====================

We already had occasion to mention this paper:

[The Second Eigenvalue of the Google Matrix]
http://www.stanford.edu/~taherh/papers/secondeigenvalue.pdf

in which Taher H. Haveliwala and Sepandar D. Kamvar give an
explicit bound |?| ? c, where c is the parameter appearing
the the PageRank matrix definition, on the "second" (and
smaller) eigenvalue(s) of A.

Knowing a good estimate for the linear rate of convergence
of the power method permits Aitken extrapolation.  Authors
of the above paper are joined by two more Stanford Univ.
colleagues in exploring variations on this algorithm here:

[Extrapolation Methods for Accelerating PageRank Computations]
http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/

A different approach to accelerating convergence has been
proposed by:

[Updating PageRank with Iterative Aggregation]
(by Amy N. Langville and Carl D. Meyer)
http://www.www2004.org/proceedings/docs/2p392.pdf

"ABSTRACT  We present an algorithm for updating the PageRank
vector [1]. Due to the scale of the web, Google only updates
its famous PageRank vector on a monthly basis. However, the Web
changes much more frequently. Drastically speeding the PageRank
computation can lead to fresher, more accurate rankings of the
webpages retrieved by search engines.  It can also make the goal
of real-time personalized rankings within reach. On two small
subsets of the web, our algorithm updates PageRank using just
25% and 14%, respectively, of the time required by the original
PageRank algorithm. Our algorithm uses iterative aggregation
techniques [7, 8] to focus on the slow-converging states of the
Markov chain. The most exciting feature of this algorithm is
that it can be joined with other PageRank acceleration methods,
such as the dangling node lumpability algorithm [6], quadratic
extrapolation [4], and adaptive PageRank [3], to realize even
greater speedups (potentially a factor of 60 or more speedup
when all algorithms are combined)."

Some analysis of the convergence properties of the IAD method
is provided here:

[Convergence Analysis of an Improved Pagerank Algorithm]
(by Ilse C.F. Ipsen and Steve Kirkland)
http://www4.ncsu.edu/~ipsen/ps/tr04-02.pdf

"ABSTRACT  The iterative aggregation/disaggregation (IAD) method
is an improvement of the PageRank algorithm used by the search
engine Google to compute stationary probabilities of very large
Markov chains.

"In this paper the convergence, in exact arithmetic, of the IAD
method is analyzed. The IAD method is expressed as the power
method preconditioned by an incomplete LU factorization. This
leads to a simple derivation of the asymptotic convergence rate
of the IAD method.

"It is shown that the power method applied to the Google matrix
always converges, and that the convergence rate of the IAD method
is at least as good as that of the power method. Furthermore, by
exploiting the hyperlink structure of the web it can be shown
that the convergence rate of the IAD method applied to the Google
matrix can be made strictly faster than that of the power method."

regards, mathtalk-ga
Comments  
Subject: Re: Page Rank Definition - Proof of convergence and uniqueness
From: mathtalk-ga on 29 Jul 2004 04:02 PDT
 
One version of the Page & Brin paper here:

[Anatomy of a Large-Scale Hypertextual Web Search Engine]
http://www-db.stanford.edu/~backrub/google.html

gives this definition of PageRank:

"We assume page A has pages T1...Tn which point to it (i.e., are
citations). The parameter d is a damping factor which can be set
between 0 and 1. We usually set d to 0.85. There are more details
about d in the next section. Also C(A) is defined as the number of
links going out of page A. The PageRank of a page A is given as
follows:

  PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

"Note that the PageRanks form a probability distribution over web
pages, so the sum of all web pages' PageRanks will be one."

The parameter d here is equivalent to the parameter c used in the
formulation that dvd03-ga asks about.  There is clearly a factor
missing from the term (1-d) because if we consider the limiting case d
= 0, then PR(A) = 1 for all pages.  This contradicts the claim about
having a probability distribution, ie. the sum of all PageRanks being
1.

I think from the context it's clear that a factor 1/N (where N is the
total number of all pages) was intended:

  PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Then PR(A) = 1/N in the limiting case d = 0, which is a probability
distribution.  Taking 1-d > 0 assures (among other things) that every
page has a positive probability of being reached from every other
page, actually in a single step.  This in turn implies that the Markov
"chain" which includes the (1-d)/N perturbations is "irreducible", and
in turn contributes to the uniqueness of its equilibrium vector (whose
components are PR(A) or r_n, in the notation used above by dvd03-ga).

[Markov Chains in Discrete Time]
http://www.staff.city.ac.uk/r.j.gerrard/courses/2dsm/dsm03_4.htm

"A Markov chain is called irreducible if, starting from any one of the
states, it is possible to get to any other state (not necessarily in
one jump)."

regards, mathtalk-ga
Subject: Re: Page Rank Definition - Proof of convergence and uniqueness
From: mathtalk-ga on 04 Aug 2004 05:07 PDT
 
Some preliminary discussion of terminology and notation:

An n by n matrix M is a Markov transition matrix iff:

a) 0 <= M(i,j) <= 1  for each entry (row i, column j)

b) 1 = SUM M(i,j) OVER j = 1,..,n  for each row i

Notice that the definition is not "symmetric" with respect to the
roles of rows and columns; (b) asks only that row sums are unity.  So
there is some opportunity for mischief here in getting the definition
backward, and in some sense the literature on Markov chain models
often does seem "left-handed" in comparison to the usual approach for
solving systems of simultaneous linear equations (where unknowns are
usually a column in the matrix formulation).

We identify the entries of M as conditional probabilities of
transition among n "states".  That is:

M(i,j) = Pr( going to state j, given state i )

This interpretation makes clear the "meaning" of (b), ie. given state
i, the transitions to states j are mutually exclusive and exhaustive
possibilities.  It also shows that we cannot expect M to be a
symmetric matrix in most cases.

The name for these discrete models is taken from Russian mathematician
A.A. Markov, who developed their theory roughly one hundred years ago,
long before computers would make application of them practical for
large sets of states.  Today we consider such an application with one
state for every page on the Web!

A Markov chain model consists of an initial "state", which can
actually be a distribution of probabilities on all n states, expressed
as a row vector:

u_0 = [ a1, a2, . . . , an ]

together with a transition matrix M.  We then define by induction:

u_{i+1} = u_i M  for i = 0,1,2,...

Thus u_k = u_0 M^k, and one often thinks of these as discrete time
steps, eg. as the history of a system evolving under a "stationary"
action (so called because M is constant with respect to time).

This iterative multiplication of u_0 by M is precisely the power
method for finding a dominant eigenvalue and eigenvector of M, albeit
with the multiplication being done from the right and forming rows
rather than columns as is typically chosen in matrix linear algebra.

Some "math facts" about eigenvalues are in order here.  Usually we
define an eigenvalue for matrix M by requiring a non-zero column
vector v' (I'll use ' to denote transpose) such that:

M v' = c v'

where c is a scalar.  The eigenvalue c of M is also then a root of the
characteristic equation for M:

det( xI - M ) = 0

for unknown x, and because determinant is a scalar function preserved
by transposing its argument, M and M' have the same characteristic
equation and the same eigenvalues.

Stated another way:

  there exists non-zero vector u such that u M = c u
  
    <==>  there exists non-zero vector v such that M v' = c v'

But while "left" and "right" eigenvalues are the same for general
(square) matrices, it is not so with "left" and "right" eigenvectors:

[Left Eigenvector - mathworld]
http://mathworld.wolfram.com/LeftEigenvector.html

As it happens, we already know a right eigenvector of M for c = 1, namely by b:

M v' = v' when v = [ 1, 1, . . . , 1 ]

So we can be sure that for Markov transition matrices there is a
vector u such that:

u M = u

Such a vector is called an equilibrium in this context, if the entries
are nonnegative and scaled so that they sum to 1 (so that they form a
discrete probability distribution on the states).  The interpretation
is that such a "blend" of states is stable with respect to the Markov
chain's history.

One can of course "solve" the homogeneous problem u (M - I) = 0 to
find such vectors directly.  But for large matrices it is attractive
to try iterative approximations like the power method that are less
"expensive".

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy