View Question
Q: Page Rank Definition - Proof of convergence and uniqueness ( Answered,   2 Comments )
 Question
 Subject: Page Rank Definition - Proof of convergence and uniqueness Category: Computers > Algorithms Asked by: dvd03-ga List Price: \$50.00 Posted: 27 Jul 2004 01:42 PDT Expires: 26 Aug 2004 01:42 PDT Question ID: 379557
 ```In Page & Brin's "Anatomy of a large scale hypertextual web search engine", the page rank vector, r, is defined to be: r = (1-c)*u + r*D where u is the unit vector (I'm not concerned with personalised page rank at the minute) and D is the transition matrix for which Dij = 1/deg(i) if i outlinks to j and 0 otherwise. (I note that Page and Brin actually provide a different, contradictory, definition in their 98 article, and that Bianchini et al., 2004, and Haveliwala, 2002, also provide similar, though also different definitions in their treatments of page rank.) Using the power method, we get an iterative algorithm for calculation of the page rank vector: r_n+1 = (1-c)*u + r_n*A. I'm looking for a proof of the algorithm's convergence, and I'm also looking for a proof that the convergence limit is unique. Many thanks.``` Request for Question Clarification by mathtalk-ga on 27 Jul 2004 07:30 PDT ```Hi, dvd03-ga: Is A (in the "power method") the same as D in the previous definition? Assuming convergence, the uniqueness of any bounded limit (independent of the starting value) can be related to the properties of A easily, as the limit r must by continuity satisfy the equation: r = (1-c)*u + r*A r*(I-A) = (1-c)*u and the limit is unique provided (I-A) is invertible. On the other hand to prove convergence of the "power" iteration, one needs to show something like a contraction property: || r_{n+1} - r_n || < z * || r_n - r_{n-1} || where 0 < z < 1, for sufficiently large n. The spectrum (eigenvalues) of A can be expected to be important in making such an analysis: r_{n+1} - r_n = ( r_n - r_{n-1} )*A So are you trying to carry this out for A = D ? regards, mathtalk-ga``` Clarification of Question by dvd03-ga on 27 Jul 2004 07:58 PDT ```Dear mathtalk, Apologies, A is indeed equal to D. Thank you, and looking forward to hearing from you again.``` Request for Question Clarification by mathtalk-ga on 27 Jul 2004 21:04 PDT ```Hi, dvd03-ga: I believe that while the equation (for a single page rank entry) given in Page & Brin's "Anatomy..." can be interpreted in the vector form you give here, a wider reading of the passage motivates a different interpretation. To me the passage points to the formulation of page ranks in terms of the equilibrium vector for a Markov model. Let be M = (m_i,j) be a Markov (probability) transition matrix in which: m_i,j = Pr( going to page j next, given page i now ) In the simplified Page Rank formula, we have: M = (1-c)u'u + c A where 0 < c < 1 is the scalar "boredom" or damping coefficient, and: u = [ 1 1 1 ... 1 ] is the "unit" row vector and A is essentially as you've defined the transition matrix D. Note that all row sums for A are 1, and the same is true for M. It follows that A and M both have u' as a column "right" eigenvector for the eigenvalue 1: Au' = Mu' = u' Consequently A and M must also have some row "left" eigenvector for eigenvalue 1, though because of the possible asymmetry of these matrices, the left eigenvector will not (in general) be u. I interpret the "simple power method" alluded to in the paper as being the iteration: u, uM, uM^2, ... , uM^n, ... --> v suitably scaled so that vu' = 1. The existence and uniqueness results for such an equilibrium vector are then quite standard, as is the convergence analysis for the power method. Would a write-up for these results be of interest to you? regards, mathtalk-ga``` Clarification of Question by dvd03-ga on 28 Jul 2004 00:37 PDT ```Dear Mathtalk, I'm agreeing with: I've been working on the problem also and have also taken a Marcov line of attack. Actually, I've generalised the problem slightly to allow for personalisation of what was formally the unit vector - allowing each entry in the row vector to be any real >= 0. This complicates the problem somewhat as it could now be that we haven't a transition matrix - because it might now no longer strongly connected. I think we can get round the problem by proving that particular entries in the row vector, for example, say the entry corresponding to page a, converge to zero (in fact, = 0) iff there is no b for which the personalisation vector (formally the unit vector) entry is no zero and from which a path exists to a. We can then restrict the matrix M, as you call it, to just those pages which do not converge to zero. The resultant M', say, is I believe a transition matrix. Must confess, though, I'm finding it quite a bit easier to work with functions, as opposed to in matrices, and would appreciate a wee bit of help with the matrices. For example, we know that M is defined to be: rM = crM + (1-c)p [where p is the personalisation vector] How did you jump from this to your definition of M? What is u'?``` Request for Question Clarification by mathtalk-ga on 28 Jul 2004 04:26 PDT ```Sorry, u' is simply the column transpose of row vector u, and I missed a scalar factor in writing M. u'u is a rank-one square matrix containing all 1's, and (1/n)u'u is a Markov transition matrix in which one randomly jumps from any state to any state (even the same state) with probability 1/n (where n is the number of states, or pages in our case). So I need to correct my formula for M to get a transition matrix to: M = (1-c)(1/n)u'u + c A where A is the transition matrix given by "backlinks" among pages. We are blending the probability of transition along the links between pages with the equiprobability of jumping to any page. The parameter 0 < c < 1 guarantees that the entire "graph" of states/pages is connected, and in this case the uniqueness of the equilibrium vector can be proven. regards, mathtalk-ga``` Clarification of Question by dvd03-ga on 28 Jul 2004 05:01 PDT ```Ought the scalar factor to have been 1/||r|| (1/one norm of r) - not 1/n, as you propose? Also, could we suppose that u's entries are "strictly" positive reals rather than simply 1s (I think I can deal with zero entries impinging on strong connectedness)? How does this effect your definition of M? I'm rather curious to see your unique convergence proof. Regards dvd03``` Request for Question Clarification by mathtalk-ga on 28 Jul 2004 05:26 PDT ```The factor 1/n is defined in the context of matrix M, namely that A and hence M will be n x n matrices. It is also a feature of the power method that one scales the vector iterations in some fashion to prevent either degeneration to zero or unbounded growth. So in this sense a scaling: r |--> (1/||r||) r is sensible. In fact the Anatomy article proposes that the sum of entries in r is 1, which is consistent with the interpretation of r as the equilibrium vector for a Markov process. In terms of allowing more general vectors u than simply all 1's, this is hinted at by Page & Brin as well in that article. Indeed we could define a one parameter family: M(c) = (1-c)G + c A for any transition matrices G,A whatever, and each M(c) would again be a transition matrix with: M(0) = G ; M(1) = A One may expect that when c is close to 1 then the equilibrium vector for M will approach an equilibrium vector for A. This is true under connectivity assumptions on A, but as there is no a prior reason to expect that the Web graph would be connected (there could easily be small "enclaves"), one advantage of introducing the "boredom" factor is to obtain connectivity. I believe that by playing around with the entries in G, ie. using as you have suggested something other than simply (1/n)u'u, one can mitigate the ranking distortions from link farms, etc. However this is where presumably some human recognition/art is required to choose an improved "perturbation" G. I'll put together an Answer that gives the existence and uniqueness results for the equilibrium vector of a Markov model, and the numerical convergence properties of the power iteration (and some ways to accelerate convergence). regards, mathtalk-ga``` Clarification of Question by dvd03-ga on 28 Jul 2004 06:30 PDT ```Dear Mathtalk, My thoughts as regards the one norm "scaling" were actually purely definitional in origin. The definition given in the "anatomy" article reads: PR(A) = (1-c) + c(PR(T1)/C(T1)+...+PR(Tn)/C(Tn)) Am I mistaken in thinking this yields the matrix definition: r_ (n+1)= (1-c)(1...1) + r_nA And also, r_(n+1) = r_n ( (1-c)(1/||r_n||)N + cA ) - where N is a matrix comprising just 1s? (Things get a bit trickier than this with a personalised boredom vector, do they not? What then is the definition?) Page and Brin ignore the 1/||r_n|| scalar factor as they claim ||r_n|| = 1 (in their 98 paper). In this respect, I believe they were mistaken - see http://members.optusnet.com.au/clausen/reputation/rep-cost-attack.pdf . I look to your solution. Regards, dvd03``` Clarification of Question by dvd03-ga on 28 Jul 2004 06:39 PDT ```Convergence acceleration would be very useful. Many thanks, dvd03``` Clarification of Question by dvd03-ga on 28 Jul 2004 13:33 PDT ```Having thought a bit further on the varying u such that entries are from (0,infitity) - let us call this vector p, I'm proposing that from PR(A) = (1-c)p_a + c(PR(T1)/C(T1)+...+PR(Tn)/C(Tn)) (where p_a is the entry in p corresponding to page A) we get the matrix definition: r_(n+1)= (1-c)p + cr_nA from which we get: r_(n+1) = r_n( (1-c)*(1/||r_n||)*p1' + cA ) To illustrate, consider a simple two page case, where r = (r1 r2): (r1' r2') = c(r1 r2) m11 m12 + (1-c)(p1 p2) m21 m22 = ( c*sum ri*mi1 + (1-c)p1 c*sum ri*mi2 + (1-c)p2 ) = (r1 r2) (1-c)*p1/||r|| + c*m11 (1-c)*p2/||r|| + c*m12 (1-c)*p1/||r|| + c*m21 (1-c)*p2/||r|| + c*m22 = r( (1-c)*(1/||r_n||)*p1' + c*A ) Your thoughts?``` Request for Question Clarification by mathtalk-ga on 04 Aug 2004 05:17 PDT ```Hi, dvd03-ga: I collected some preliminary remarks on notation and general definitions related to Markov chains in my second Comment below. I plan to post an Answer that, as your Subject asks, gives the proof of convergence and uniqueness to solving: r M = r by the power method, where M = (1-c)(1/n)u'u + c A. From the properties of M and how they are deduced from the matrix (1/n)u'u, namely irreducibility and non-periodicity, it will be apparent that many alternative "perturbation" terms would provide the same advantages. regards, mathtalk-ga``` Request for Question Clarification by mathtalk-ga on 06 Aug 2004 07:20 PDT ```Hi, dvd03-ga: The treatments of A.A. Markov's famous "ergodic theorem" which are found on the Web are mostly for the special case of the matrix having a complete set of eigenvectors. However because a probability transition matrix is not generally symmetric, that cannot be assumed. If you want a brief treatment of the convergence and uniqueness, I can supply one based on the assumption of "diagonalizability" (in effect, a complete set of eigenvectors). Or if you want a more rigorous treatment, I can give one based on canonical forms (eg. Shur's lemma), but it will take longer to prepare the Answer. regards, mathtalk-ga``` Clarification of Question by dvd03-ga on 06 Aug 2004 08:31 PDT ```My personal bias would be towards rigor. Many thanks, dvd```
 ```Hi, dvd03-ga: Introduction ============ From your comments it seems that you are interested in how a proof of convergence and uniqueness for the power method applied to the Markov chain equilibrium problem might be extended to matrices which are nonnegative but not "stochastic" (row sums equal one). To prove convergence and uniqueness for Page and Brin's original "Web matrix" formulation, my approach is structured in a way that I hope shows where assumption of a probability transition matrix is used and what parts of the proof are independent of that. Let me briefly recount how we made our way to a matrix eigenvalue problem, which will be our focus for the remainder of this Answer. You cited Page & Brin's "Anatomy of a large scale hypertextual web search engine" for this "row vector" equation: r = (1-c)u + rA where u = [1 1 ... 1], c between 0 and 1, and A the discrete probability transition matrix such that: / | 1/N(i) if there exists a link on page i to page j A(i,j) = < where N(i) = # pages linked by page i; | \ 0 otherwise. Note that A is stochastic (row sums are equal to one): A u' = u' Here we detected some inconsistencies. Page & Brin omitted a factor 1/n in the term (1-c)*u, as subsequently I did also. You omitted a factor c in the term r*A. Making up these omissions: r = (1-c)(1/n)u + crA Furthermore Page & Brin specify the entries in r are nonnegative and should sum to 1. So we can write, using ' to denote transpose: ru' = 1 Substituting r u' as a factor of unity in one place: r = (1-c)(1/n)ru'u + crA = rM where we now have defined a family of probability transition matrices parameterized by c: M = (1-c)(1/n)u'u + cA Note that M inherits from A the stochastic property, because uu' = n: M u' = (1-c)(1/n)u'u u' + cA u' = (1 - c + c) u' = u' I interpret your Question as wanting a proof of convergence and uniqueness for finding the "equilibrium" vector in the corresponding discrete Markov chain by the power method. For suitable initial choice of row vector r_0, we define: r_{i+1} = r_i M, i = 0,1,2,... By induction r_k = r_0 M^k. Iterate until some threshold of agreement is met: || r_{i+1} - r_i || < epsilon Generally a scaling step is required to satisfy ru' = 1 in the power method. As a practical matter, unless the iterates r_k either grow or shrink too much, this scaling can be deferred to the end: 1 r <--| ( --- )r ru' because the equation r = rM is homogeneous in r (and not coincidentally, the eigenvalue sought here is 1). But special circumstances apply here! If the initial vector r_0 is nonnegative and r_0 u' = 1, then with exact arithmetic it would remain true that r_k is nonnegative and r_k u' = 1: r_k u' = r_0 M^k u' = r_0 u' = 1 because M u' = u'. For simplicity we will omit scaling in our description of the power method, admitting that it would be of some practical use due to floating-point arithmetic errors. It will also be necessary for generalizing to nonnegative M that are not stochastic. Our analysis will be in three stages. First we will claim some facts about the eigenvalues and eigenvectors of M and prove from these assumptions that the power method converges to a unique solution r (up to an arbitrary scaling factor). Second we will prove those claims about M's eigenvalues and eigenvectors using some results in matrix theory that are perhaps unfamiliar to you, the Perron-Frobenius theorem in particular. Finally we give an elementary proof of that result for the sake of completeness. In closing we'll point out some ways suggested to accelerate convergence. Convergence and Uniqueness ========================== The following claims about the eigenvalues ("spectrum") of M will be used here and justified subsequently: 1. The maximum absolute value of an eigenvalue of M is 1. 2. 1 is a simple eigenvalue of M and has a positive eigenvector. 3. Other eigenvalues of M are strictly less than 1 in absolute value. The last claim we can state more concretely: 3. The other eigenvalues of M have absolute value less than or equal to c < 1. Our immediate goal is to show convergence of {r_k : k=0,1,2,... }. If M were diagonalizable (equiv. had a complete basis of eigenvectors), then we could write: M = S^-1 D S r_k = r_0 M^k = r_0 S^-1 D^k S where eigenvalues of M ranked in descending order of their absolute values, listed according to multiplicity, define diagonal matrix D: The first eigenvalue d1 = 1 is simple (multiplicity 1), and all subsequent eigenvalues have absolute value less than or equal to parameter c. D = diag(d1,d2,...,dn) where |d1| >= |d2| >= ... >= |dn| The convergence proceeds "linearly" as the diagonal entries of D^k beyond the first decay to 0 like O(c^k). For practical purposes this analysis is adequate, but for rigor we must consider a more general canonical form for M. The Jordan canonical form, to be specific, gives: M = S^-1 J S where J is a block-diagonal upper triangular matrix: |¯ ¯| | 1 0 | | J_2 | | . | | . | J = | . | | 0 J_p | |_ _| whose "Jordan blocks" J_m (for m = 1,..,p) are of the form: |¯ ¯| | d_m 1 ... 0 | | 0 d_m 1 ... | J_m = | . 0 . ... | | . . | | . . 1 | | 0 0 d_m | |_ _| Note that diagonal entries d_m are constant in block J_m. Because d_1 is a simple eigenvalue, the corresponding Jordan block J_1 is just a 1×1 entry. [Jordan Canonical Form] http://www.math.mcmaster.ca/wolkowic/Courses/M741/jordanc.pdf Now J = D + N where N is nilpotent, and the blocks J_m correspond to invariant subspaces so that the calculation of J^k is simplified. Thus: M^k = S^-1 J^k S where: |¯ ¯| ^ k | 1 0 | | J_2 | | . | | . | J^k = | . | | 0 J_p | |_ _| If n_m is the order of block J_m, then: J_m = d_m I + N_m with N_m nilpotent of order n_m. Then for blocks m > 1: (J_m)^k = (d_m I + N_m)^k = (d_m)^k I + SUM C(k,j)*(d_m)^(k-j)*(N_m)^j where the sum is over integers 0 < j < n_m since (N_m)^n_m = 0. The factors |(d_m)^k| =< c^k shrink geometrically as |d_m| =< c, but C(k,j) grows like an j'th order polynomial in k. (N_m)^j is constant (with respect to k). Consequently each (J_m)^k tends to zero (m > 1), but not quite at O(c^k) due to the "algebraic" growth in terms C(k,j) that offsets it slightly. However for any b > c, we can show: (J_m)^k ~ o(b^k) because polynomials can't keep pace with an exponential. So choose c < b < 1 and convergence is still linear, just not quite as rapid. So J^k converges linearly to a diagonal matrix with first entry 1 and the remaining entries 0, and the rate of convergence is for practical purposes like O(c^k). The convergence of M^k and hence r_k = r_0 M^k is therefore shown. However we want to avoid the possibility that the limit is zero, as would be the case if our initial choice r_0 had no component in the direction of the (left) eigenvector of M corresponding to eigenvalue d_1 = 1. Fortunately a suitable initial choice that avoids this is easy to make. Recall from our earlier discussion that if r_0 is nonnegative and has sum of entries 1, then these facts remain true for all r_k (in exact arithmetic). In particular the choice r_0 = (1/n) u always converges to a nonzero limit. Finally the solution r we get from the convergence of {r_k} is unique, because: r M = r and (from Claim 2) the eigenvalue d_1 = 1 is simple. Subject to the constraint of nonnegative entries that sum to 1, there can be only one such eigenvector. (Any two such would be multiples of one another by the simplicity of the eigenvalue, and their respective entries' sums equality forces those multiples to be 1.) Claims about M ============== Now we can turn to establishing the Claims made earlier about the spectrum of M, as its set of eigenvalues (counting multiplicity) is often called. Claim 1 ------- As already argued elsewhere, the 1-norm (sum of absolute values of a vector's entries) can be used to show the modulus (absolute value) of any eigenvalue of M is at most 1. Throughout we use norm ||.|| to mean the 1-norm. Each row of M, being nonnegative and with entries summing to 1, has a 1-norm equal to 1. By the definition of matrix multiplication, given a row x = [x1 x2 ... xn]: xM = x1*(1st row of M) + ... + xn(nth row of M) = x1 M(1,-) + x2 M(2,-) + ... + xn M(n,-) Now appeal to the triangle inequality for norms: ||xM|| =< |x1|*||M(1,-)|| + ... + |xn|*||M(n,-)|| = |x1| + |x2| + ... + |xn| = ||x|| If d is any eigenvalue of M, and nonzero x an eigenvector corresponding to it: ||xM|| =< ||x|| |d|*||x|| =< ||x|| |d| =< 1 The last step follows from the prior inequality by dividing by ||x|| > 0. Thus 1 is an upper bound on the eigenvalues of M, and we already know from M u' = u' that this upper bound is attained. (See Comments below for equivalence of "left" and "right" eigenvalues.) Claim 2 ------- For the result that 1 is a simple eigenvalue of M, we appeal to a classical result of Oskar Perron: Theorem (Perron, 1907) The eigenvalue of largest absolute value of a positive (square) matrix A is both simple and positive and belongs to a positive eigenvector. All other eigenvalues are smaller in absolute value. Many proofs of this are known, including two by Perron. We will present an elementary proof of a stronger version, the Perron-Frobenius theorem, in the next section. For now we simply reference: [Some links to Proofs of the Perron-Frobenius theorem] http://www.csudh.edu/math/gjennings/perron.html By inspection of the definition of M: M = (1-c)(1/n)u'u + cA we see that each entry is at least (1-c)/n. So M is positive provided c < 1. Claim 3 ------- The Perron theorem already encompasses the weak form of Claim 3, that other eigenvalues are strictly smaller than 1 in absolute value. The stronger version of Claim 3 identifies parameter c as an upper bound for the remaining eigenvalues. For this, see: [The Second Eigenvalue of the Google Matrix - Haveliwala and Kamvar] http://www.stanford.edu/~taherh/papers/secondeigenvalue.pdf "Abstract. We determine analytically the modulus of the second eigenvalue for the web hyperlink matrix used by Google for computing PageRank. Specifically, we prove the following statement: "For any matrix A = [cP + (1-c)E]' where P is an n×n row-stochastic matrix, E is a nonnegative n×n rank-one row-stochastic matrix, and 0 =< c =< 1, the second eigenvalue of A has modulus less than or equal to c. Furthermore, if P has at least two irreducible closed subsets, the second eigenvalue is equal to c. "This statement has implications for the convergence rate of the standard PageRank algorithm as the web scales, for the stability of PageRank to perturbations to the link structure of the web, for the detection of Google spammers, and for the design of algorithms to speed up PageRank." We have a bit more to say about this paper in the final section on acceleration techniques. (to be continued...)``` Request for Answer Clarification by dvd03-ga on 09 Aug 2004 09:33 PDT ```Dear Mathtalk, Very much enjoying your solution. Thank you. Just a quick question with respect to the 1/n scalar you introduced, how would things have been affected had it not been introduced? More generally, supposing we know that v_(n+1) = v*A + r converges for some row vector r, then can we infer that v_(n+1) = v+A + c*r converges also, where c is some positive scalar? So, could we infer from your proof that Page's algorithm, as it was actually given, converges also? Also, if we can, how do the stationary distributions (limit vectors) compare with one another? Is the second simply a scalar multiple of the first, or...? Many thanks``` Clarification of Answer by mathtalk-ga on 09 Aug 2004 11:47 PDT ```Without the 1/n factor, the "fixed point" iteration: r := (1-c)u + crA is likely to converge provided 0 < c < 1. The converged solution would satisfy: r(I - cA) = (1-c)u r = (1-c)u(I + cA + (cA)^2 + ... ) The latter matrix series converges because the spectral radius of cA is c (see your other Question), and this is less than 1. It gives an "explicit" form of the inverse of (I - cA). However this is _not_ the same as solving the "equilibrium" problem: r = r[(1-c)(1/n)u'u + cA] r(I - cA) = (1-c)(1/n)ru'u unless ru' = n. But: ru' = (1-c)u(I + cA + (cA)^2 + ...)u' = (1-c)u(1-c)^-1 u' = 1 Scaling the solution r doesn't affect whether it solves the equilibrium condition (unless you scale it to zero!), so in general these conditions are not compatible. regards, mathtalk-ga``` Clarification of Answer by mathtalk-ga on 09 Aug 2004 12:06 PDT ```Hmmm, wait a minute. That last calculation shows ru' = n, not 1. So it looks like the conditions are compatible after all. Let me double-check, I have a feeling of confusion, not conviction about this... -- mathtalk-ga``` Request for Answer Clarification by dvd03-ga on 14 Aug 2004 02:01 PDT ```Dear Mathtalk, You write: The latter matrix series converges because the spectral radius of cA is c (see your other Question), and this is less than 1. Which "other Question" did you mean? I've subsequently asked a question about this very fact (http://answers.google.com/answers/threadview?id=385860), but I don't believe I've been answered yet.``` Clarification of Answer by mathtalk-ga on 14 Aug 2004 10:52 PDT ```Yes, I've not forgotten that one, but actually the "other" Question I meant was: http://answers.google.com/answers/threadview?id=384263 I'm close to polishing up the proof of Perron-Frobenius (just need to prove two lemmas), and I wanted to finish up my Answer here before starting the other. However the easy part of the Answer is that the eigenvalues of cA are just c times the eigenvalues of A (for any square matrix), so the spectral radius (maximum absolute value of an eigenvalue) of cA is |c| times the spectral radius of A. For justification that the "geometric" series I + cA + (cA)^2 + ... converges, we need to appeal to something like the argument used in this Question to estimate the growth of powers of cA. As you may recall, we showed using the Jordan canonical form that the powers of cA tend to zero geometrically (which confusingly is called linear convergence) with any constant b slightly greater than c, e.g. like (c + 1) ( ------- )^k 2 if we should need to be definite. Provided only that the series converges "absolutely" (as the geometric rate implies), then we have a matrix series identity involving an inverse: (I - cA)^(-1) = I + cA + (cA)^2 + ... Naturally the inverse shown could exist for many values of c not restricted to those less than 1, and in fact this is the idea behind studying the spectrum of A through a complex function called the resolvant (with poles at the eigenvalues of A). regards, mathtalk-ga``` Clarification of Answer by mathtalk-ga on 15 Aug 2004 08:52 PDT ```This is an experiment in writing with a Unicode text editor, which allows us the use of Greek letters, some superscripts, and a few other familiar mathematical symbols. Perron-Frobenius Theorem ======================== The proof below is based on one given by James M. Ortega in Thm. 6.1.4 of this book: [Matrix Theory: A Second Course] http://www.riskbook.com/titles/ortega_j_(1987).htm It's elementary in the sense of not introducing any theory of complex functions or fixed-point results requiring a topological foundation but does depend on such introductory principles of real analysis as bounded sequences in finite dimensional vector spaces having convergent subsequences. For the sake of consistency with our earlier analysis, the proof has been adapted to use the 1-norm ||.|| as well as row-vector-on-the-left multiplication. The changes are not substantial however. Recall that u = [1 1 ? 1] denotes the vector of 1's. We'll also use a slightly nonstandard notation that for row vector x, |x| means the entry-by-entry absolute values (giving again a row vector of the same length). Thus |u| = u, for example. Previously we've defined when a probability transition matrix is irreducible. The generalization of this to nonnegative square matrices is slight but worthy of careful statement. Defn. Nonnegative n×n matrix A is reducible iff there exist 1 ? i,j ? n s.t. for all a > 0, the row i, column j entry of A to the power a is zero: Aª(i,j) = 0 Otherwise (if for all i,j there exists power a > 0 such that Aª(i,j) > 0), we say A is irreducible. Note that in general the required power a depends on the indices i,j. For a probability transition matrix, Aª(i,j) > 0 means there is a positive probability of going from state i to state j in exactly a steps, presuming a ? 0 is a whole number. Note that if some row (resp. some column) of A is zero, that same row (resp. same column) of any power Aª is again zero. Hence such matrices are reducible. The proof makes repeated use of (one half of) the following: Lemma 1: Let A be an n×n nonnegative matrix, n > 1. Then A is irreducible if and only if (I + A)??¹ > 0. Proof: We will only use the forward direction, so naturally let's start with proving that. Expanding (I + A)??¹ as a binomial (because I commutes with A), we see that the terms all involve nonnegative matrices. The leading term I takes care of all the diagonal terms being positive, so we only need to worry about off-diagonal entries. Let i ? j give a row and column respectively of (I + A)??¹. From the definition of irreducible we know that Aª(i,j) > 0 for some a > 0. If we could show this with a < n, we'd be done as this would give a positive contribution to the same entry in (I + A)??¹. The expansion of Aª(i,j) gives a sum of all products: A(k(0),k(1)*A(k(1),k(2))*...*A(k(a-1),k(a)) such that k(0) = i and k(a) = j. First remove any diagonal terms A(k,k) from the product, thereby shortening the "path" from i to j by some possible amount. More generally remove any subproduct that "revisits" a previous index, e.g. A(k,k')A(k',k")...A(_,k) wherever it may occur in the product. Since i ? j we don't remove the entire product. Now we have product in which all the indexes are distinct, and i ? j implies there are at most n-1 factors (since a factors introduce a+1 indexes). So we've shown that for i ? j it is possible to find Aª(i,j) > 0 with 0 < a < n. That fills in the off-diagonal entries of (I + A)??¹ with positive values as outlined above. For the sake of completeness let's also prove the reverse implication. As we've just analyzed, a positive entry in (I + A)??¹ at off-diagonal position (i,j) means that: Aª(i,j) > 0 for some 0 < a < n. It remains to argue that Aª(i,i) > 0 for some a > 0. But since n > 1, we can pick j different from i. Combine a power of A that has (i,j) entry positive with one that has (j,i) entry positive, and the product (or sum of the two powers) gives diagonal (i,i) entry positive. QED Remark: I added the qualification n > 1 partly to avoid doubt over the trivial case A = [0]. Arguably this is irreducible in the sense that A° = [1], but we did not include such sense in our definition. Despite this it's easy to verify that the conclusions of the Perron-Frobenius theorem apply in the 1×1 case to nonnegative A, and we'll not belabor the point. Also we will see from examples later that Aª(i,i) > 0 may require a = n, which relates to our filling of off-diagonal entries. We will also require at one point the following fact: Lemma 2: Let b(i), i = 1,?,n be complex numbers such that: | ? b(i) | = ? |b(i)| (sums over i = 1,?,n) Then there exists a unit complex number c with |c| = 1 s.t. b(i) = c*|b(i)| for all i = 1,?,n. Proof: This says equality holds in the triangle inequality only if all the complex numbers have the same direction c. The converse is true but not stated here, and easy to show by direct calculation. Our proof is by induction. To shorten the proof a bit, let's assume without loss of generality that all the summands b(i) are nonzero, since for any zero summand the choice 0 = c*|0| doesn't matter. The case n = 1 is simple. We choose c = b(1)/|b(1)|. The case n = 2 contains all the hard work. Suppose: | b(1) + b(2) | = |b(1)| + |b(2)| As we assume neither of b(1) or b(2) is zero, divide through by |b(2)| and label z = b(1)/b(2): | z + 1 | = |z| + 1 Square both sides and use the fact that multiplying complex numbers by their conjugates gives the absolute value squared: _ _ (z + 1)(z + 1) = zz + 2|z| + 1 After clearing common terms from both sides we have: _ z + z = 2|z| Re(z) = |z| which implies Im(z) = 0, else |z| > |Re(z)|. Therefore z = |z| and we can choose: c = b(1)/|b(1)| = z b(2)/|z b(2)| = b(2)/|b(2)| Now the induction step. Assume the proposition has been shown for some value n ? 2 and consider the situation when new term b(n+1) is introduced. By the triangle inequality: | b(n+1) + ? b(i) | ? |b(n+1)| + | ? b(i) | ? |b(n+1)| + ? |b(i)| But our hypothesis for n+1 is that the first and last of these expressions are equal, which forces equality in the middle: |b(n+1)| + | ? b(i) | = |b(n+1)| + ? |b(i)| | ? b(i) | = ? |b(i)| (sums over i = 1,?,n) So the induction hypothesis applies to the first n terms, and there exists a complex number c such that: |c| = 1 and b(i) = c|b(i)| for all i = 1,?,n Write p = ? |b(i)| (sum over i = 1,?,n), and also: cp = ? c|b(i)| = ? b(i) Therefore | b(n+1) + cp | = |b(n+1)| + |cp|, and from the case n = 2 we deduce that: c = cp/|cp| = b(n+1)/|b(n+1)| Now we're done with the induction step. QED Theorem (Perron-Frobenius) Let A be an irreducible n×n nonnegative matrix. Then the largest absolute value of an eigenvalue of A is itself a simple eigenvalue of A with an associated positive eigenvector. Also, if any row (resp. any column) of A is positive, then all other eigenvalues of A are strictly smaller in absolute value than the largest one. Proof: Let S = { ? > 0 : for some nonzero x ? 0, xA ? ?x }. Since A is irreducible, no column of A is zero. Thus uA > 0, and by taking ? to be the smallest entry of uA, we have: uA ? ?u Thus ? belongs to S, so S is nonempty. Moreover for any ? in S, there exists nonzero x ? 0 such that: xA ? ?x Consequently ? ||x|| ? ||xA|| ? ||x|| * C where: C = max { ||A(i,-)|| : 1 ? i ? n } is the operator norm on multiplication by A induced by the 1-norm on row vectors x. This proves that set S is bounded. Therefore S has a least upper bound ?º. Let {?_k} be a sequence of values in S converging to ?º with corresponding nonzero vectors {x_k} normalized so that ||x_k|| = 1. Then a subsequence of {x_k} exists which converges to vector xº with ||xº|| = 1. Since: x_k A ? ?_k x_k it follows that: xº A ? ?º xº We claim that equality actually holds, for the proof of which we make our first use of Lemma 1. Consider y = xº A - ?º xº = xº (A - ?ºI) ? 0. If y = 0, we are done with our claim. Otherwise, if y ? 0, then y (I + A)??¹ > 0 because any nonzero entry in y corresponds to a positive row in (I + A)??¹. If we set w = xº (I + A)??¹, then w > 0 (again because a nonzero entry in xº meets a positive row in (I + A)??¹). Thus: wA - ?ºw = w (A - ?ºI) = xº (I + A)??¹ (A - ?ºI) = xº (A - ?ºI) (I + A)??¹ = y (I + A)??¹ > 0 where use has been made of how A and (I + A)??¹ commute. This quickly leads to a contradiction of ?º being the least upper bound of S. Let ? > 0 be the minimum ratio of respective entries in wA - ?ºw to those in w > 0, so that: wA - ?ºw ? ?w wA ? (?º + ?)w By this contradiction the claim that y = 0, and thus xº A = ?º xº, is established. Now ||xº|| = 1, so clearly ?º is an eigenvalue of A of maximum absolute value. Also, xº must have at least one positive entry, and because the corresponding row in (I + A)??¹ is positive: xº (I + A)??¹ > 0 I.e. (1 + ?º) xº > 0, and thus xº > 0 is strictly positive. The simplicity of the eigenvalue is a bit tedious, but bear with me. If the geometric multiplicity of ?º were more than one, then we'd have eigenvector z corresponding to ?º linearly independent of xº. Since ?º is real (and A is real), we may take z real. By linear independence all combinations xº + ?z are eigenvectors of A corresponding to ?º for any real ?. Perturbing xº with ?z using positive or negative scalar ? as necessary, we could obtain xº + ?z ? 0 with at least one entry equal to zero. But the above proof that xº > 0 applies equally to xº + ?z, and we have a contradiction. This establishes that the geometric multiplicity of ?º is 1, but we must also rule out any algebraic multiplicity greater than 1, i.e. that ?º appears in a Jordan block of size m > 1. If the latter were the case, then there'd be vector v such that: v (A - ?ºI)² = 0, v (A - ?ºI) ? 0 and again the fact A and ?º are real allows us to "solve" for v to be real as well. Since v (A - ?ºI) is an eigenvector associated with ?º, it must by what has been shown so far be a nonzero multiple of xº. Conversely we could write, using the reciprocal of that scalar: xº = ?v (A - ?ºI) > 0 But this implies: ?v A > ?º(?v) and if we take |?v| to be the entry-by-entry absolute value of ?v, then by the triangle inequality applied column-wise to A: |?v|A > ?º|?v| which gives us a contradiction to maximality of ?º in S. So eigenvalue ?º has both algebraic and geometric multiplicity 1, i.e. a simple eigenvalue. It remains to show that no other eigenvalue has an equal modulus or absolute value as ?º. For this we need to assume more than just the irreducibility of A, as can be illustrated by taking A to be any cyclic permutation matrix, so that A? = I is a minimal polynomial. All the eigenvalues (characteristic roots) fall on the unit circle, so ?º = 1 is not strictly greater in absolute value than the other eigenvalues. For our purpose we'll analyze the case that A has at least one column of all positive entries. The dual case of A with a row of all positive entries then follows as previously noted in our Comments by considering the equal "spectrum" (eigenvalues) of transpose A'. [See Remark below for how it is usually phrased in connection with Markov chains, which gives slightly greater generality.] Suppose for the sake of contradiction a different eigenvalue ? with |?| = ?º. Let nonzero x be an associated eigenvector: xA = ?x Now ?º|x| = |?x| ? |x| A by the triangle inequality once more. Repeat the argument leading from xº A ? ?ºxº to xº A = ?ºxº but applied to |x|, and we have that ?º|x| = |x|A and so |x| is a multiple of xº, say |x| = ?xº where ? > 0. We further claim that |xA| = |x| A, since by the above: |xA| = |?x| = ?º|x| = |x|A Given that say the k'th column of A is positive, this means: | ? x(i)*a(i,k) | = ? |x(i)|*a(i,k) where the indicated sums are over i = 1,?,n. This equality implies by Lemma 2 there exists complex c with |c| = 1 such that: x(i)*a(i,k) = c*|x(i)|*a(i,k) Then because each a(i,k) ? 0, x = c|x| = c?xº. Since x is a multiple of xº, we have shown eigenvalue ? = ?º. QED Remark: In the context of Markov chains, one often asks that the probability transition matrix A is both irreducible and _not_ "periodic". Periodic means that for the chance of going from state i back to state i to be positive, the number of steps needed is always divisible by some integer t > 1. This condition is closely related to the positive row or column condition that we've used in our statement of the Perron-Frobenius theorem. It can be shown that if A is irreducible and not periodic, then Aª > 0 for some power a of A. Hence our assumption of a positive row or column is then applicable to the power of A, with the result that all conclusions remain valid. Likewise one can show inductively that if any row (or column) of irreducible nonnegative n×n matrix A is positive, then A? > 0. Ortega gives this as an Exercise 6.1.10, calling nonnegative matrix A "primitive" if Aª > 0 for some power a of A. The difficulty we see if A is a cyclic permutation matrix is precisely that there is no common power a for which all the entries of Aª > 0, and indeed we reach A? = I only to start over. In these terms we could state the half of Lemma 1 we used repeatedly as A irreducible implies (I+A) primitive. Acceleration methods ==================== We already had occasion to mention this paper: [The Second Eigenvalue of the Google Matrix] http://www.stanford.edu/~taherh/papers/secondeigenvalue.pdf in which Taher H. Haveliwala and Sepandar D. Kamvar give an explicit bound |?| ? c, where c is the parameter appearing the the PageRank matrix definition, on the "second" (and smaller) eigenvalue(s) of A. Knowing a good estimate for the linear rate of convergence of the power method permits Aitken extrapolation. Authors of the above paper are joined by two more Stanford Univ. colleagues in exploring variations on this algorithm here: [Extrapolation Methods for Accelerating PageRank Computations] http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/ A different approach to accelerating convergence has been proposed by: [Updating PageRank with Iterative Aggregation] (by Amy N. Langville and Carl D. Meyer) http://www.www2004.org/proceedings/docs/2p392.pdf "ABSTRACT We present an algorithm for updating the PageRank vector [1]. Due to the scale of the web, Google only updates its famous PageRank vector on a monthly basis. However, the Web changes much more frequently. Drastically speeding the PageRank computation can lead to fresher, more accurate rankings of the webpages retrieved by search engines. It can also make the goal of real-time personalized rankings within reach. On two small subsets of the web, our algorithm updates PageRank using just 25% and 14%, respectively, of the time required by the original PageRank algorithm. Our algorithm uses iterative aggregation techniques [7, 8] to focus on the slow-converging states of the Markov chain. The most exciting feature of this algorithm is that it can be joined with other PageRank acceleration methods, such as the dangling node lumpability algorithm [6], quadratic extrapolation [4], and adaptive PageRank [3], to realize even greater speedups (potentially a factor of 60 or more speedup when all algorithms are combined)." Some analysis of the convergence properties of the IAD method is provided here: [Convergence Analysis of an Improved Pagerank Algorithm] (by Ilse C.F. Ipsen and Steve Kirkland) http://www4.ncsu.edu/~ipsen/ps/tr04-02.pdf "ABSTRACT The iterative aggregation/disaggregation (IAD) method is an improvement of the PageRank algorithm used by the search engine Google to compute stationary probabilities of very large Markov chains. "In this paper the convergence, in exact arithmetic, of the IAD method is analyzed. The IAD method is expressed as the power method preconditioned by an incomplete LU factorization. This leads to a simple derivation of the asymptotic convergence rate of the IAD method. "It is shown that the power method applied to the Google matrix always converges, and that the convergence rate of the IAD method is at least as good as that of the power method. Furthermore, by exploiting the hyperlink structure of the web it can be shown that the convergence rate of the IAD method applied to the Google matrix can be made strictly faster than that of the power method." regards, mathtalk-ga```
 ```One version of the Page & Brin paper here: [Anatomy of a Large-Scale Hypertextual Web Search Engine] http://www-db.stanford.edu/~backrub/google.html gives this definition of PageRank: "We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) "Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one." The parameter d here is equivalent to the parameter c used in the formulation that dvd03-ga asks about. There is clearly a factor missing from the term (1-d) because if we consider the limiting case d = 0, then PR(A) = 1 for all pages. This contradicts the claim about having a probability distribution, ie. the sum of all PageRanks being 1. I think from the context it's clear that a factor 1/N (where N is the total number of all pages) was intended: PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Then PR(A) = 1/N in the limiting case d = 0, which is a probability distribution. Taking 1-d > 0 assures (among other things) that every page has a positive probability of being reached from every other page, actually in a single step. This in turn implies that the Markov "chain" which includes the (1-d)/N perturbations is "irreducible", and in turn contributes to the uniqueness of its equilibrium vector (whose components are PR(A) or r_n, in the notation used above by dvd03-ga). [Markov Chains in Discrete Time] http://www.staff.city.ac.uk/r.j.gerrard/courses/2dsm/dsm03_4.htm "A Markov chain is called irreducible if, starting from any one of the states, it is possible to get to any other state (not necessarily in one jump)." regards, mathtalk-ga```
 ```Some preliminary discussion of terminology and notation: An n by n matrix M is a Markov transition matrix iff: a) 0 <= M(i,j) <= 1 for each entry (row i, column j) b) 1 = SUM M(i,j) OVER j = 1,..,n for each row i Notice that the definition is not "symmetric" with respect to the roles of rows and columns; (b) asks only that row sums are unity. So there is some opportunity for mischief here in getting the definition backward, and in some sense the literature on Markov chain models often does seem "left-handed" in comparison to the usual approach for solving systems of simultaneous linear equations (where unknowns are usually a column in the matrix formulation). We identify the entries of M as conditional probabilities of transition among n "states". That is: M(i,j) = Pr( going to state j, given state i ) This interpretation makes clear the "meaning" of (b), ie. given state i, the transitions to states j are mutually exclusive and exhaustive possibilities. It also shows that we cannot expect M to be a symmetric matrix in most cases. The name for these discrete models is taken from Russian mathematician A.A. Markov, who developed their theory roughly one hundred years ago, long before computers would make application of them practical for large sets of states. Today we consider such an application with one state for every page on the Web! A Markov chain model consists of an initial "state", which can actually be a distribution of probabilities on all n states, expressed as a row vector: u_0 = [ a1, a2, . . . , an ] together with a transition matrix M. We then define by induction: u_{i+1} = u_i M for i = 0,1,2,... Thus u_k = u_0 M^k, and one often thinks of these as discrete time steps, eg. as the history of a system evolving under a "stationary" action (so called because M is constant with respect to time). This iterative multiplication of u_0 by M is precisely the power method for finding a dominant eigenvalue and eigenvector of M, albeit with the multiplication being done from the right and forming rows rather than columns as is typically chosen in matrix linear algebra. Some "math facts" about eigenvalues are in order here. Usually we define an eigenvalue for matrix M by requiring a non-zero column vector v' (I'll use ' to denote transpose) such that: M v' = c v' where c is a scalar. The eigenvalue c of M is also then a root of the characteristic equation for M: det( xI - M ) = 0 for unknown x, and because determinant is a scalar function preserved by transposing its argument, M and M' have the same characteristic equation and the same eigenvalues. Stated another way: there exists non-zero vector u such that u M = c u <==> there exists non-zero vector v such that M v' = c v' But while "left" and "right" eigenvalues are the same for general (square) matrices, it is not so with "left" and "right" eigenvectors: [Left Eigenvector - mathworld] http://mathworld.wolfram.com/LeftEigenvector.html As it happens, we already know a right eigenvector of M for c = 1, namely by b: M v' = v' when v = [ 1, 1, . . . , 1 ] So we can be sure that for Markov transition matrices there is a vector u such that: u M = u Such a vector is called an equilibrium in this context, if the entries are nonnegative and scaled so that they sum to 1 (so that they form a discrete probability distribution on the states). The interpretation is that such a "blend" of states is stable with respect to the Markov chain's history. One can of course "solve" the homogeneous problem u (M - I) = 0 to find such vectors directly. But for large matrices it is attractive to try iterative approximations like the power method that are less "expensive". regards, mathtalk-ga```