We also assume that the power is positive, or else we need to perform an additional division to obtain x^(-y)=1/x^y.
If x!=0 is known to a relative precision epsilon, then x^y has the relative precision epsilon*y. This means a loss of precision if Abs(y)>1 and an improvement of precision otherwise.
The algorithm is based on the following trick: if n is even, say n=2*k, then x^n=x^k^2; and if n is odd, n=2*k+1, then x^n=x*x^k^2. Thus we can reduce the calculation of x^n to the calculation of x^k with k<=n/2, using at most two long multiplications. This reduction is one step of the algorithm; at each step n is reduced to at most half. This algorithm stops when n becomes 1, which happens after m steps where m is the number of bits in n. So the total number of long multiplications is at most 2*m=(2*Ln(n))/Ln(2). More precisely, it is equal to m plus the number of nonzero bits in the binary representation of n. On the average, we shall have 3/2*Ln(n)/Ln(2) long multiplications. The computational cost of the algorithm is therefore O(M(P)*Ln(n)). This should be compared with e.g. the cost of the best method for Ln(x) which is O(P*M(P)).
The outlined procedure is most easily implemented using recursive calls. The depth of recursion is of order Ln(n) and should be manageable for most real-life applications. The Yacas code would look like this:
10# power(_x,1)<--x; 20# power(_x,n_IsEven)<-- power(x,n>>1)^2; 30# power(_x,n_IsOdd)<--x*power(x,n>>1)^2;
If we wanted to avoid recursion with its overhead, we would have to obtain the bits of the number n in reverse order. This is possible but is somewhat cumbersome unless we store the bits in an array.
It is easier to implement the non-recursive version of the squaring algorithm in a slightly different form. Suppose we obtain the bits b[i] of the number n in the usual order, so that n=b+2*b+...+b[m]*2^m. Then we can express the power x^n as
In the Yacas script form, the algorithm looks like this:
power(x_IsPositiveInteger,n_IsPositiveInteger)<-- [ Local(result, p); result:=1; p := x; While(n != 0) [ // at step k, p = x^(2^k) if (IsOdd(n)) result := result*p; p := p*p; n := n>>1; ]; result; ];
The same algorithm can be used to obtain a power of an integer modulo another integer, Mod(x^n,M), if we replace the multiplication p*p by a modular multiplication, such as p:=Mod(p*p,M). Since the remainder modulo m would be computed at each step, the results do not grow beyond M. This allows to efficiently compute even extremely large modular powers of integers.
Matrix multiplication, or, more generally, multiplication in any given ring, can be substituted into the algorithm instead of the normal multiplication. The function IntPowerNum encapsulates the computation of the n-th power of an expression using the binary squaring algorithm.
The squaring algorithm can be improved a little bit if we are willing to use recursion or to obtain the bits of n in the reverse order. (This was suggested in the exercise 4.21 in the book [von zur Gathen et al. 1999].) Let us represent the power n in base 4 instead of base 2. If q[k] are the digits of n in base 4, then we can express
We might then use the base 8 instead of 4 and obtain a further small improvement. (Using bases other than powers of 2 is less efficient.) But the small gain in speed probably does not justify the increased complexity of the algorithm.
An exceptional case is when n is a rational number with a very small numerator and denominator, for example, n=2/3. In this case it is faster to take the square of the cubic root of x. (See the section on the computation of roots below.) Then the case of negative x should be handled separately. This speedup is not implemented in Yacas.
Note that the relative precision changes when taking powers. If x is known to relative precision epsilon, i.e. x represents a real number that could be x*(1+epsilon), then x^2<=>x*(1+2*epsilon) has relative precision 2*epsilon, while Sqrt(x) has relative precision epsilon/2. So if we square a number x, we lose one significant bit of x, and when we take a square root of x, we gain one significant bit.
Note that the relative precision is improved after taking a root with n>1.
For integer N, the following steps are performed:
The intermediate results, u^2, v^2 and 2*u*v can be maintained easily too, due to the nature of the numbers involved ( v having only one bit set, and it being known which bit that is).
For floating point numbers, first the required number of decimals p after the decimal point is determined. Then the input number N is multiplied by a power of 10 until it has 2*p decimal. Then the integer square root calculation is performed, and the resulting number has p digits of precision.
Below is some Yacas script code to perform the calculation for integers.
//sqrt(1) = 1, sqrt(0) = 0 10 # BisectSqrt(0) <-- 0; 10 # BisectSqrt(1) <-- 1; 20 # BisectSqrt(N_IsPositiveInteger) <-- [ Local(l2,u,v,u2,v2,uv2,n); // Find highest set bit, l2 u := N; l2 := 0; While (u!=0) [ u:=u>>1; l2++; ]; l2--; // 1<<(l2/2) now would be a good under estimate // for the square root. 1<<(l2/2) is definitely // set in the result. Also it is the highest // set bit. l2 := l2>>1; // initialize u and u2 (u2==u^2). u := 1 << l2; u2 := u << l2; // Now for each lower bit: While( l2 != 0 ) [ l2--; // Get that bit in v, and v2 == v^2. v := 1<<l2; v2 := v<<l2; // uv2 == 2*u*v, where 2==1<<1, and // v==1<<l2, thus 2*u*v == // (1<<1)*u*(1<<l2) == u<<(l2+1) uv2 := u<<(l2 + 1); // n = (u+v)^2 = u^2 + 2*u*v + v^2 // = u2+uv2+v2 n := u2 + uv2 + v2; // if n (possible new best estimate for // sqrt(N)^2 is smaller than N, then the // bit l2 is set in the result, and // add v to u. if( n <= N ) [ u := u+v; // u <- u+v u2 := n; // u^2 <- u^2 + 2*u*v + v^2 ]; l2--; ]; u; // return result, accumulated in u. ];
The bisection algorithm uses only additions and bit shifting operations. Suppose the integer N has P decimal digits, then it has n=P*Ln(10)/Ln(2) bits. For each bit, the number of additions is about 4. Since the cost of an addition is linear in the number of bits, the total complexity of the bisection method is roughly 4*n^2=O(P^2).
In most implementations of arbitrary-precision arithmetic, the time to perform a long division is several times that of a long multiplication. Therefore it makes sense to use a method that avoids divisions. One variant of Newton's method is to solve the equation 1/r^2=x. The solution of this equation r=1/Sqrt(x) is the limit of the iteration
As usual with Newton's method, all errors are automatically corrected, so the working precision can be gradually increased until the last iteration. The full precision of P digits is used only at the last iteration; the last-but-one iteration uses P/2 digits and so on.
An optimization trick is to combine the multiplication by x with the last iteration. Then computations can be organized in a special way to avoid the last full-precision multiplication. (This is described in [Karp et al. 1997] where the same trick is also applied to Newton's iteration for division.)
The idea is the following: let r be the P-digit approximation to 1/Sqrt(x) at the beginning of the last iteration. (In this notation, 2*P is the precision of the final result, so x is also known to about 2*P digits.) The unmodified procedure would have run as follows:
Now consider Newton's iteration for s<=>Sqrt(x),
Consider the cost of the last iteration of this combined method. First, we compute s=x*r, but since we only need P correct digits of s, we can use only P digits of x, so this costs us M(P). Then we compute s^2*x which, as before, costs M(P)+M(2*P), and then we compute r*(1-s^2*x) which costs only M(P). The total cost is therefore 3*M(P)+M(2*P), so we have traded one multiplication with 2*P digits for one multiplication with P digits. Since the time of the last iteration dominates the total computing time, this is a significant cost savings. For example, if the multiplication is quadratic, M(P)=O(P^2), then this saves about 30% of total execution time; for linear multiplication, the savings is about 16.67%.
These optimizations do not change the asymptotic complexity of the method, although they do reduce the constant in front of O().
Suppose we need to find Sqrt(x). Choose an integer n such that 1/4<x':=4^(-n)*x<=1. The value of n is easily found from bit counting: if b is the bit count of x, then
To compute Sqrt(x'), we use Newton's method with the initial value x' obtained by interpolation of the function Sqrt(x) on the interval [1/4, 1]. A suitable interpolation function might be taken as simply (2*x+1)/3 or more precisely
This may save a few iterations, at the small expense of evaluating the interpolation function once at the beginning. However, in computing with high precision the initial iterations are very fast and this argument reduction does not give a significant speed gain. But the gain may be important at low precisions, and this technique is sometimes used in microprocessors.
Since we only need the integer part of the root, it is enough to use integer division in the Halley iteration. The sequence x[k] will monotonically approximate the number n^(1/s) from below if we start from an initial guess that is less than the exact value. (We start from below so that we have to deal with smaller integers rather than with larger integers.) If n=p^s, then after enough iterations the floating-point value of x[k] would be slightly less than p; our value is the integer part of x[k]. Therefore, at each step we check whether 1+x[k] is a solution of x^s=n, in which case we are done; and we also check whether (1+x[k])^s>n, in which case the integer part of the root is x[k]. To speed up the Halley iteration in the worst case when s^s>n, it is combined with bisection. The root bracket interval x1<x<x2 is maintained and the next iteration x[k+1] is assigned to the midpoint of the interval if Halley's formula does not give sufficiently rapid convergence. The initial root bracket interval can be taken as x, 2*x.
If s is very large ( s^s>n), the convergence of both Newton's and Halley's iterations is almost linear until the final few iterations. Therefore it is faster to evaluate the floating-point power for large b using the exponential and the logarithm.
The trick of combining the last iteration with the final multiplication by x can be also used with all higher-order schemes.
Consider the cost of one iteration of n-th order. Let the initial precision of r be P; then the final precision is k*P and we use up to n*P digits of x. First we compute y:=1-r^2*x to P*(n-1) digits, this costs M(P) for r^2 and then M(P*n) for r^2*x. The value of y is of order 10^(-P) and it has P*(n-1) digits, so we only need to use that many digits to multiply it by r, and r*y now costs us M(P*(n-1)). To compute y^k (here 2<=k<=n-1), we need M(P*(n-k)) digits of y; since we need all consecutive powers of y, it is best to compute the powers one after another, lowering the precision on the way. The cost of computing r*y^k*y after having computed r*y^k is therefore M(P*(n-k-1)). The total cost of the iteration comes to
From the general considerations in the previous chapter (see the section on Newton's method) it follows that the optimal order is n=2 and that higher-order schemes are slower in this case.
Newton's method (2) is best for all other cases: large precision and/or roots other than square roots.
Logarithms of complex numbers can be reduced to elementary functions of real numbers, for example:
The basic algorithm consists of (integer-) dividing x by b repeatedly until x becomes 0 and counting the necessary number of divisions. If x has P digits and b and P are small numbers, then division is linear in P and the total number of divisions is O(P). Therefore this algorithm costs O(P^2) operations.
A speed-up for large x is achieved by first comparing x with b, then with b^2, b^4, etc., without performing any divisions. We perform n such steps until the factor b^2^n is larger than x. At this point, x is divided by the previous power of b and the remaining value is iteratively compared with and divided by successively smaller powers of b. The number of squarings needed to compute b^2^n is logarithmic in P. However, the last few of these multiplications are long multiplications with numbers of length P/4, P/2, P digits. These multiplications take the time O(M(P)). Then we need to perform another long division and a series of progressively shorter divisions. The total cost is still O(M(P)). For large P, the cost of multiplication M(P) is less than O(P^2) and therefore this method is preferable.
There is one special case, the binary (base 2) logarithm. Since the internal representation of floating-point numbers is usually in binary, the integer part of the binary logarithm can be usually implemented as a constant-time operation.
The logarithm satisfies Ln(1/x)= -Ln(x). Therefore we need to consider only x>1, or alternatively, only 0<x<1.
Note that the relative precision for x translates into absolute precision for Ln(x). This is because Ln(x*(1+epsilon))<=>Ln(x)+epsilon for small epsilon. Therefore, the relative precision of the result is at best epsilon/Ln(x). So to obtain P decimal digits of Ln(x), we need to know P-Ln(Abs(Ln(x)))/Ln(10) digits of x. This is better than the relative precision of x if x>e but worse if x<=>1.
If x>1, then we can compute -Ln(1/x) instead of Ln(x). However, the series converges very slowly if x is close to 0 or to 2.
Here is an estimate of the necessary number of terms to achieve a (relative) precision of P decimal digits when computing Ln(1+x) for small real x. Suppose that x is of order 10^(-N), where N>=1. The error after keeping n terms is not greater than the first discarded term, x^(n+1)/(n+1). The magnitude of the sum is approximately x, so the relative error is x^n/(n+1) and this should be smaller than 10^(-P). We obtain a sufficient condition n>P/N.
All calculations need to be performed with P digits of precision. The "rectangular" scheme for evaluating n terms of the Taylor series needs about 2*Sqrt(n) long multiplications. Therefore the cost of this calculation is 2*Sqrt(P/N)*M(P).
When P is very large (so that a fast multiplication can be used) and x is a small rational number, then the binary splitting technique can be used to compute the Taylor series. In this case the cost is O(M(P)*Ln(P)).
Note that we need to know P+N digits of 1+x to be able to extract P digits of Ln(1+x). The N extra digits will be lost when we subtract 1 from 1+x.
One way is to take several square roots, reducing x to x^2^(-k) until x becomes close to 1. Then we can compute Ln(x^2^(-k)) using the Taylor series and use the identity Ln(x)=2^k*Ln(x^2^(-k)).
The number of times to take the square root can be chosen to minimize the total computational cost. Each square root operation takes the time equivalent to a fixed number c of long multiplications. (According to the estimate of [Brent 1975], c<=>13/2.) Suppose x is initially of order 10^L where L>0. Then we can take the square root k times and reduce x to about 1.33. Here we can take k<=>Ln(L)/Ln(2)+3. After that, we can take the square root k times and reduce x to 1+10^(-N) with N>=1. For this we need k<=>1+N*Ln(10)/Ln(2) square roots. The cost of all square roots is c*(k+k) long multiplications. Now we can use the Taylor series and obtain Ln(x^2^(-k-k)) in 2*Sqrt(P/N) multiplications. We can choose N to minimize the total cost for a given L.
The initial value for x can be found by bit counting on the number a. If m is the "bit count" of a, i.e. m is an integer such that 1/2<=a*2^(-m)<1, then the first approximation to Ln(a) is m*Ln(2). (Here we can use a very rough approximation to Ln(2), for example, 2/3.)
The initial value found in this fashion will be correct to about one bit. The number of digits triples at each Halley iteration, so the result will have about 3*k correct bits after k iterations (this disregards round-off error). Therefore the required number of iterations for P decimal digits is 1/Ln(3)*Ln(P*Ln(2)/Ln(10)).
This method is currently faster than other methods (with internal math) and so it is implemented in the routine Internal'LnNum.
This method can be generalized to higher orders. Let y:=1-a*Exp(-x), where x is a good approximation to Ln(a) so y is small. Then Ln(a)=x+Ln(1-y) and we can expand in y to obtain
The optimal number of terms to take depends on the speed of the implementation of Exp(x).
The required number of AGM iterations is approximately 2*Ln(P)/Ln(2). For smaller values of x (but x>1), one can either raise x to a large integer power r and then compute 1/r*Ln(x^r) (this is quick only if x is itself an integer or a rational), or multiply x by a large integer power of 2 and compute Ln(2^s*x)-s*Ln(2) (this is better for floating-point x). Here the required powers are
If x<1, then (-Ln(1/x)) is computed.
Finally, there is a special case when x is very close to 1, where the Taylor series converges quickly but the AGM algorithm requires to multiply x by a large power of 2 and then subtract two almost equal numbers, leading to a great waste of precision. Suppose 1<x<1+10^(-M), where M is large (say of order P). The Taylor series for Ln(1+epsilon) needs about N= -P*Ln(10)/Ln(epsilon)=P/M terms. If we evaluate the Taylor series using the rectangular scheme, we need 2*Sqrt(N) multiplications and Sqrt(N) units of storage. On the other hand, the main slow operation for the AGM sequence is the geometric mean Sqrt(a*b). If Sqrt(a*b) takes an equivalent of c multiplications (Brent's estimate is c=13/2 but it may be more in practice), then the AGM sequence requires 2*c*Ln(P)/Ln(2) multiplications. Therefore the Taylor series method is more efficient for
For larger x>1+10^(-M), the AGM method is more efficient. It is necessary to increase the working precision to P+M*Ln(2)/Ln(10) but this does not decrease the asymptotic speed of the algorithm. To compute Ln(x) with P digits of precision for any x, only O(Ln(P)) long multiplications are required.
The simplest version is this: for integer m, we have the identity Ln(x)=m+Ln(x*e^(-m)). Assuming that e:=Exp(1) is precomputed, we can find the smallest integer m for which x<=e^m by computing the integer powers of e and comparing with x. (If x is large, we do not really have to go through all integer m: instead we can estimate m by bit counting on x and start from e^m.) Once we found m, we can use the Taylor series on 1-delta:=x*e^(-m) since we have found the smallest possible m, so 0<=delta<1-1/e.
A refinement of this method requires to precompute b=Exp(2^(-k)) for some fixed integer k>=1. (This can be done efficiently using the squaring trick for the exponentials.) First we find the smallest power m of b which is above x. To do this, we compute successive powers of b and find the first integer m such that x<=b^m=Exp(m*2^(-k)). When we find such m, we define 1-delta:=x*b^(-m) and then delta will be small, because 0<delta<1-1/b<=>2^(-k) (the latter approximation is good if k is large). We compute Ln(1-delta) using the Taylor series and finally find Ln(x)=m*2^k+Ln(1-delta).
For smaller delta, the Taylor series of Ln(1-delta) is more efficient. Therefore, we have a trade-off between having to perform more multiplications to find m, and having a faster convergence of the Taylor series.
This series converges for all z such that Re(a+z)>0 if a>0. The convergence rate is, however, the same as for the original Taylor series. In other words, it converges slowly unless z/(2*a+z) is small. The parameter a can be chosen to optimize the convergence; however, Ln(a) should be either precomputed or easily computable for this method to be efficient.
For instance, if x>1, we can choose a=2^k for an integer k>=1, such that 2^(k-1)<=x<2^k=a. (In other words, k is the bit count of x.) In that case, we represent x=a-z and we find that the expansion parameter z/(2*a-z)<1/3. So a certain rate of convergence is guaranteed, and it is enough to take a fixed number of terms, about P*Ln(10)/Ln(3), to obtain P decimal digits of Ln(x) for any x. (We should also precompute Ln(2) for this scheme to work.)
If 0<x<1, we can compute -Ln(1/x).
This method works robustly but is slower than the Taylor series with some kind of argument reduction. With the "rectangular" method of summation, the total cost is O(Sqrt(P)*M(P)).
The method shall compute Ln(1+x) for real x such that Abs(x)<1/2. For other x, some sort of argument reduction needs to be applied. (So this method is a replacement for the Taylor series that is asymptotically faster at very high precision.)
The main idea is to use the property
More formally, we can write the method as a loop over k, starting with k=1 and stopping when 2^(-k)<10^(-P) is below the required precision. At the beginning of the loop we have y=0, z=x, k=1 and Abs(z)<1/2. The loop invariants are (1+z)*Exp(y) which is always equal to the original number 1+x, and the condition Abs(z)<2^(-k). If we construct this loop, then it is clear that at the end of the loop 1+z will become 1 to required precision and therefore y will be equal to Ln(1+x).
The body of the loop consists of the following steps:
The total number of steps in the loop is at most Ln(P*Ln(10)/Ln(2))/Ln(2). Each step requires O(M(P)*Ln(P)) operations because the exponential Exp(-f) is taken at a rational arguments f and can be computed using the binary splitting technique. (Toward the end of the loop, the number of significant digits of f grows, but the number of digits we need to obtain is decreased. At the last iteration, f contains about half of the digits of x but computing Exp(-f) requires only one term of the Taylor series.) Therefore the total cost is O(M(P)*Ln(P)^2).
Essentially the same method can be used to evaluate a complex logarithm, Ln(a+I*b). It is slower but the asymptotic cost is the same.
This method does not seem to provide a computational advantage compared with the other methods.
First, we need to divide x by a certain power of 2 to reduce x to y in the interval 1<=y<2. We can use the bit count m=BitCount(x) to find an integer m such that 1/2<=x*2^(-m)<1 and take y=x*2^(1-m). Then Ln(x)/Ln(2)=Ln(y)/Ln(2)+m-1.
Now we shall find the bits in the binary representation of Ln(y)/Ln(2), one by one. Given a real y such that 1<=y<2, the value Ln(y)/Ln(2) is between 0 and 1. Now,
The process is finished either when the required number of bits of Ln(y)/Ln(2) is found, or when the precision of the argument is exhausted, whichever occurs first. Note that each iteration requires a long multiplication (squaring) of a number, and each squaring loses 1 bit of relative precision, so after k iterations the number of precise bits of y would be P-k. Therefore we cannot have more iterations than P (the number of precise bits in the original value of x). The total cost is O(P*M(P)).
The squaring at each iteration needs to be performed not with all digits, but with the number of precise digits left in the current value of y. This does not reduce the asymptotic complexity; it remains O(P*M(P)).
Comparing this method with the Taylor series, we find that the only advantage of this method is simplicity. The Taylor series requires about P terms, with one long multiplication and one short division per term, while the bisection method does not need any short divisions. However, the rectangular method of Taylor summation cuts the time down to O(Sqrt(P)) long multiplications, at a cost of some storage and bookkeeping overhead. Therefore, the bisection method may give an advantage only at very low precisions. (This is why it is sometimes used in microprocessors.) The similar method for the exponential function requires a square root at every iteration and is never competitive with the Taylor series.
The exponential function satisfies Exp(-x)=1/Exp(x). Therefore we need to consider only x>0.
Note that the absolute precision for x translates into relative precision for Exp(x). This is because Exp(x+epsilon)<=>Exp(x)*(1+epsilon) for small epsilon. Therefore, to obtain P decimal digits of Exp(x) we need to know x with absolute precision of at least 10^(-P), that is, we need to know P+Ln(Abs(x))/Ln(10) digits of x. Thus, the relative precision becomes worse after taking the exponential if x>1 but improves if x is very small.
If x is sufficiently small, e.g. Abs(x)<10^(-M) and M>Ln(P)/Ln(10), then it is enough to take about P/M terms in the Taylor series. If x is of order 1, one needs about P*Ln(10)/Ln(P) terms.
If x=p/q is a small rational number, and if a fast multiplication is available, then the binary splitting technique should be used to evaluate the Taylor series. The computational cost of that is O(M(P*Ln(P))*Ln(P)).
A modification of the squaring reduction allows to significantly reduce the round-off error [Brent 1978]. Instead of Exp(x)=Exp(x/2)^2, we use the identity
Newton's method gives the iteration
A cubically convergent formula is obtained if we replace Ln(x)=a by an equivalent equation
This cubically convergent iteration seems to follow from a good equivalent equation that we guessed. But it turns out that it can be generalized to higher orders. Let y:=a-Ln(x) where x is an approximation to Exp(a); if it is a good approximation, then y is small. Then Exp(a)=x*Exp(y). Expanding in y, we obtain
The optimal number of terms to take depends on the speed of the implementation of Ln(x).
A refinement of this method is to subtract not only the integer part of x, but also the first few binary digits. We fix an integer k>=1 and precompute b:=Exp(2^(-k)). Then we find the integer m such that 0<=x-m*2^(-k)<2^(-k). (The rational number m*2^(-k) contains the integer part of x and the first k bits of x after the binary point.) Then we compute Exp(x-m*2^(-k)) using the Taylor series and Exp(m*2^(-k))=b^m by the integer powering algorithm from the precomputed value of b.
The parameter k should be chosen to minimize the computational effort.
Take the binary decomposition of x of the following form,
The cost of this method is O(M(P*Ln(P))*Ln(P)) operations.
Essentially the same method can be used to compute the complex exponential, Exp(a+I*b). This is slower but the asymptotic cost is the same.
This method does not seem to provide a computational advantage compared with the other methods.
Efficient iterative algorithms for computing pi with arbitrary precision have been recently developed by Brent, Salamin, Borwein and others. However, limitations of the current multiple-precision implementation in Yacas (compiled with the "internal" math option) make these advanced algorithms run slower because they require many more arbitrary-precision multiplications at each iteration.
The file examples/pi.ys implements several different algorithms that duplicate the functionality of Internal'Pi(). See [Gourdon et al. 2001] for more details of computations of pi and generalizations of Newton-Raphson iteration.
Since pi is a solution of Sin(x)=0, one may start sufficiently close, e.g. at x0=3.14159265 and iterate x'=x-Tan(x). In fact it is faster to iterate x'=x+Sin(x) which solves a different equation for pi. PiMethod0() is the straightforward implementation of the latter iteration. A significant speed improvement is achieved by doing calculations at each iteration only with the precision of the root that we expect to get from that iteration. Any imprecision introduced by round-off will be automatically corrected at the next iteration.
If at some iteration x=pi+epsilon for small epsilon, then from the Taylor expansion of Sin(x) it follows that the value x' at the next iteration will differ from pi by O(epsilon^3). Therefore, the number of correct digits triples at each iteration. If we know the number of correct digits of pi in the initial approximation, we can decide in advance how many iterations to compute and what precision to use at each iteration.
The final speed-up in PiMethod0() is to avoid computing at unnecessarily high precision. This may happen if, for example, we need to evaluate 200 digits of pi starting with 20 correct digits. After 2 iterations we would be calculating with 180 digits; the next iteration would have given us 540 digits but we only need 200, so the third iteration would be wasteful. This can be avoided by first computing pi to just over 1/3 of the required precision, i.e. to 67 digits, and then executing the last iteration at full 200 digits. There is still a wasteful step when we would go from 60 digits to 67, but much less time would be wasted than in the calculation with 200 digits of precision.
Newton's method is based on approximating the function f(x) by a straight line. One can achieve better approximation and therefore faster convergence to the root if one approximates the function with a polynomial curve of higher order. The routine PiMethod1() uses the iteration
Both PiMethod0() and PiMethod1() require a computation of Sin(x) at every iteration. An industrial-strength arbitrary precision library such as gmp can multiply numbers much faster than it can evaluate a trigonometric function. Therefore, it would be good to have a method which does not require trigonometrics. PiMethod2() is a simple attempt to remedy the problem. It computes the Taylor series for ArcTan(x),
While(Not enough precision) [
At each iteration, the variable pi will have twice as many correct digits as it had at the previous iteration.
To obtain the value of Pi with P decimal digits, one needs to take
If this series is evaluated using Horner's scheme (the routine PiChudnovsky), then about Ln(n)/Ln(10) extra digits are needed to compensate for round-off error while adding n terms. This method does not require any long multiplications and costs O(P^2) operations.
A potentially much faster way to evaluate this series at high precision is by using the binary splitting technique. This would give the asymptotic cost O(M(P*Ln(P))*Ln(P)).
Tangent is computed by dividing Sin(x)/Cos(x) or from Sin(x) using the identity
For Cos(x), the bisection identity can be used more efficiently if it is written as
For Sin(x), the trisection identity is
The optimal number of bisections or trisections should be estimated to reduce the total computational cost. The resulting number will depend on the magnitude of the argument x, on the required precision P, and on the speed of the available multiplication M(P).
Alternatively, ArcSin(x) may be found from the Taylor series and inverted to obtain Sin(x).
This method seems to be of marginal value since efficient direct methods for Cos(x), Sin(x) are available.
By the identity ArcCos(x):=Pi/2-ArcSin(x), the inverse cosine is reduced to the inverse sine. Newton's method for ArcSin(x) consists of solving the equation Sin(y)=x for y. Implementation is similar to the calculation of pi in PiMethod0().
For x close to 1, Newton's method for ArcSin(x) converges very slowly. An identity
Inverse tangent can also be related to inverse sine by
Alternatively, the Taylor series can be used for the inverse sine:
An everywhere convergent continued fraction can be used for the tangent:
Hyperbolic and inverse hyperbolic functions are reduced to exponentials and logarithms: Cosh(x)=1/2*(Exp(x)+Exp(-x)), Sinh(x)=1/2*(Exp(x)-Exp(-x)), Tanh(x)=Sinh(x)/Cosh(x),
The convergence of the continued fraction expansion of ArcTan(x) is indeed better than convergence of the Taylor series. Namely, the Taylor series converges only for Abs(x)<1 while the continued fraction converges for all x. However, the speed of its convergence is not uniform in x; the larger the value of x, the slower the convergence. The necessary number of terms of the continued fraction is in any case proportional to the required number of digits of precision, but the constant of proportionality depends on x.
This can be understood by the following argument. The difference between two partial continued fractions that differ only by one extra last term can be estimated as
If we compare the rate of convergence of the continued fraction for ArcTan(x) with the Taylor series
The "double factorial" n!! :=n*(n-2)*(n-4)*... is also useful for some calculations. For convenience, one defines 0! :=1, 0!! :=1, and (-1)!! :=1; with these definitions, the recurrence relations
There are two tasks related to the factorial: the exact integer calculation and an approximate calculation to some floating-point precision. Factorial of n has approximately n*Ln(n)/Ln(10) decimal digits, so an exact calculation is practical only for relatively small n. In the current implementation, exact factorials for n>65535 are not computed but print an error message advising the user to avoid exact computations of large factorials. For example, Internal'LnGammaNum(n+1) is able to compute Ln(n!) for very large n to any desired floating-point precision.
A second method uses a binary tree arrangement of the numbers 1, 2, ..., n similar to the recursive sorting routine ("merge-sort"). If we denote by a *** b the "partial factorial" product a*(a+1)*...(b-1)*b, then the tree-factorial algorithm consists of replacing n! by 1***n and recursively evaluating (1***m)*((m+1)***n) for some integer m near n/2. The partial factorials of nearby numbers such as m***(m+2) are evaluated explicitly. The binary tree algorithm requires one multiplication of P/2 digit integers at the last step, two P/4 digit multiplications at the last-but-one step and so on. There are O(Ln(n)) total steps of the recursion. If the cost of multiplication is M(P)=P^(1+a)*Ln(P)^b, then one can show that the total cost of the binary tree algorithm is O(M(P)) if a>0 and O(M(P)*Ln(n)) if a=0 (which is the best asymptotic multiplication algorithm).
Therefore, the tree method wins over the simple method if the cost of multiplication is lower than quadratic.
The tree method can also be used to compute "double factorials" ( n!!). This is faster than to use the identities
Binomial coefficients Bin(n,m) are found by first selecting the smaller of m, n-m and using the identity Bin(n,m)=Bin(n,n-m). Then a partial factorial is used to compute
In principle, one could choose any (non-negative) weight function rho(x) and any interval [a, b] and construct the corresponding family of orthogonal polynomials q _n(x). For example, take q _0=1, then take q _1=x+c with unknown c and find such c that q _0 and q _1 satisfy the orthogonality condition; this requires solving a linear equation. Then we can similarly find two unknown coefficients of q _2 and so on. (This is called the Gramm-Schmidt orthogonalization procedure.)
But of course not all weight functions rho(x) and not all intervals [a, b] are equally interesting. There are several "classical" families of orthogonal polynomials that have been of use to theoretical and applied science. The "classical" polynomials are always solutions of a simple second-order differential equation and are always a specific case of some hypergeometric function.
The construction of "classical" polynomials can be described by the following scheme. The function rho(x) must satisfy the so-called Pearson's equation,
If the function rho(x) and the interval [a, b] are chosen in this way, then the corresponding orthogonal polynomials q _n(x) are solutions of the differential equation
Finally, there is a formula for the generating function of the polynomials,
The classical families of (normalized) orthogonal polynomials are obtained in this framework with the following definitions:
The Rodrigues formula or the generating function are not efficient ways to calculate the polynomials. A better way is to use linear recurrence relations connecting q[n+1] with q[n] and q[n-1]. (These recurrence relations can also be written out in full generality through alpha(x) and beta(x) but we shall save the space.)
There are three computational tasks related to orthogonal polynomials:
In the next section we shall give some formulae that allow to calculate particular polynomials more efficiently.
There is a way to implement this method without recursion. The idea is to build the sequence of numbers n, n, ... that are needed to compute OrthoT(n,x).
For example, to compute OrthoT(19,x) using the second recurrence relation, we need OrthoT(10,x) and OrthoT(9,x). We can write this chain symbolically as 19<>c(9,10). For OrthoT(10,x) we need only OrthoT(5,x). This we can write as 10<>c(5). Similarly we find: 9<>c(4,5). Therefore, we can find both OrthoT(9,x) and OrthoT(10,x) if we know OrthoT(4,x) and OrthoT(5,x). Eventually we find the following chain of pairs:
There are about 2*Ln(n)/Ln(2) elements in the chain that leads to the number n. We can generate this chain in a straightforward way by examining the bits in the binary representation of n. Therefore, we find that this method requires no storage and time logarithmic in n. A recursive routine would also take logarithmic time but require logarithmic storage space.
Note that using these recurrence relations we do not obtain any individual coefficients of the Chebyshev polynomials. This method does not seem very useful for symbolic calculations (with symbolic x), because the resulting expressions are rather complicated combinations of nested products. It is difficult to expand such an expression into powers of x or manipulate it in any other way, except compute a numerical value. However, these fast recurrences are numerically unstable, so numerical values need to be evaluated with extended working precision. Currently this method is not used in Yacas, despite its speed.
An alternative method for very large n would be to use the identities
Coefficients for Legendre, Hermite, Laguerre, Chebyshev polynomials can be obtained by explicit formulae. This is faster than using recurrences if we need the entire polynomial symbolically, but still slower than the recurrences for numerical calculations.
In all formulae for the coefficients, there is no need to compute factorials every time: the next coefficient can be obtained from the previous one by a few short multiplications and divisions. Therefore this computation costs O(n^2) short operations.
Suppose a family of functions q _n(x), n=0, 1, ... satisfies known recurrence relations of the form
The procedure goes as follows [Luke 1975]. First, for convenience, we define q[-1]:=0 and the coefficient A _1(x) so that q=A*q. This allows us to use the above recurrence relation formally also at n=1. Then, we take the array of coefficients f[n] and define a backward recurrence relation
The book [Luke 1975] warns that the recurrence relation for X[n] is not always numerically stable.
Note that in the book there seems to be some confusion as to how the coefficient A is defined. (It is not defined explicitly there.) Our final formula differs from the formula in [Luke 1975] for this reason.
The Clenshaw-Smith procedure is analogous to the Horner scheme of calculating polynomials. This procedure can also be generalized for linear recurrence relations having more than two terms. The functions q _0(x), q _1(x), A _n(x), and B _n(x) do not actually have to be polynomials for this to work.