Matthew Roop
Deanna Tenorio
English 1120-007
12/1/2023
Integer Factorization Approaches
The Integer Factoring Problem
Integer factorization is the process of decomposing an integer into the product of prime numbers. For example, the integer factorization of 15 is 5x3, where both 5 & 3 are prime numbers. The integer factorization problem is important in the study of mathematics and has been around since ancient times. It arises from the fact that factoring large numbers is computationally expensive and time consuming. (Rivest, R. L., et al.) As the number to be factored gets larger and larger, it's possible factors increase. It gets to a point where there are so many possible factors that to factor the number will take a large amount of time. Essentially, there are too many factors to check. Currently there is no known feasible efficient algorithm to solve the integer factorization problem, although there are many algorithms that have improved the computationally efficient of predicting the prime factorization of numbers.
Why is the Integer Factoring Problem Important?
The Integer Factorization Problem is important because the strength of Modern-day encryption relies on the difficulty of factoring large integers. (Rivest, R. L., et al.) If the integer factorization problem was solved it would have a major impact on the entire world. It will lead to private or sensitive data or information to be able to be easily decrypted. Privacy on the internet would be effectively gone. Any email that was ever sent will be able to be decrypted. Messages transmitted on the internet would be compromised. Transactions will be insecure, leading to major financial disruptions. It would likely also have a major impact on Physics and Computer Science. The growth of the quantum computing field is fueled because researchers believe that integer factorization can be efficiently done on quantum computers. If the Integer factorization problem was solved the growth of the field might decrease. The Integer Factorization Problem being solved might also affect another problem in theoretical computer science called the “P versus NP” Problem. (Fortnow, Lance.) It would give evidence that P may equal NP, or in other words that for every problem whose solution can be quickly checked, can also be solved quickly. For example, integer factorization’s solution can be quickly checked by simply multiplying the factors together, to find the original number. Currently solving the integer factorization is slow, if it can be shown to be efficient, then that gives evidence that P may equal NP. Many researchers believe that P ≠ NP, but it still unproven. If P = NP that would have a huge impact on the field of computer science as that opens the stage for major optimizations of all algorithms. It would prove that everything that can easily be verified can also be easily solved. Meaning that there really is no upper bound for optimization which would be a really good thing.
Algorithms
Below multiple algorithms will be examined for factoring semiprimes.
Trial Division
Factoring by trial division is the easiest to understand and is the most straight forward. To find the factors that make up a number all the prime integers from 2 up to the sqrt of the number are checked to find the factors. When a factor is found it then divides the integer to be factored and it is then stored as a factor. The integer to be factored is updated and the process is repeated until no more factors can be found. For example: The first step in factoring the composite 21 is to find the sqrt(21) which is 4.5. You then test the prime integers 2 to 4. 2 doesn’t divide 21 so you then test 3. 3 does divide 21 once, so you then store 3 as a factor then the new number to be factored becomes 7. sqrt(7) = 2.6 so you only need to check if 2 divides 7 which it doesn’t so 7 is then stores a factor of 21. The factorization of 21 is 3x7. The biggest advantage of trial division is the ease of implementation and understanding. It is a simple algorithm to understand especially for beginners. Because it is a simple algorithm it is inefficient, and the time complexity is O(sqrt(N)). Say it takes 1 second to factor a 10-digit number, for each additional digit it takes about 3 times longer. For a 30-digit integer it would take about 317 years to factor which is inefficient. Just for Comparison the number of digits in RSA-2048 is 617, making it practically impossible to factor.
Pollard’s Rho Algorithm
Pollard’s rho algorithm works by Floyd’s Cycle Detection Algorithm & the birthday paradox. (Pollard, J.M.)
Floyd’s cycle detection
Imagine two runners are running on a track. One, (the hare) moves twice as fast as the other (the tortoise). Eventually will meet up because of the difference in speeds. This is the idea behind Floyd’s cycle detection. When there is a cycle, they are guaranteed to meet inside the circle.
Birthday paradox
Imagine a room with 23 people in it. What are the chances that 2 or more people have the same birthday? 50%, seemingly high considering that there are 365 days. It has to do with the combinatorics nature of the problem. Because there is only a set number of factors to choose, the more you choose, the more accurate the prediction or likelihood that you find a factor becomes high.
How it works
Regarding Pollard’s Rho Algorithm, Floyd’s cycle detection generates a pseudo random sequence of integers which will eventually repeat. Because of the random nature and a cycle or range of number can be produced, the likely hood with just a few numbers is paradoxically high even though there are a lot of numbers to check.
Quadratic Sieve
The quadratic sieve algorithm is the second fastest classical factoring algorithm and is the fastest algorithm for numbers smaller than 100 digits. The first step of the algorithm is to collect data about the number that is stored in memory, and can be used to find a number that congruent to a square (mod N) meaning that when diving the number by N its remainder gives a number that is squared, which means that that number minus the square divides n or shares a factor with N. (Carl Pomerance) for example you get 16 congruent to 4 mod 20 you then rearrange the formula to get 16 – 4 congruent to 0 mod 20 which means that 16-4 contains a factor in common with 20. You can expand 16-4 to (4^2 – 2^2) then by differences of squares you get: (4-2)(4+2) so that means that 20 shares a 2 or 6. 2 is the factor of 20. This algorithm is a relatively fast method for solving for factors but requires a large amount of data to be stored in memory. The time complexity is O(N log N log log N)
General Number Field Sieve
This algorithm is complex and is like the Quadratic sieve method. It is the fastest known algorithm for number greater than 100 digits long. The algorithm searches for numbers that have small prime factors first which leads to its efficiency. (Arjen K.) The algorithm is quite complex & memory intensive but can reliably factor numbers in the fastest possible time.
Shor’s Algorithm
This algorithm is probably one of the most well-known algorithms, as it claims to be able to factor any semiprime in polynomial time, which is astronomically fast compared to the classical methods. (Shor, Peter W.) Sounds great, right? Well Shor’s algorithm requires the use of a large quantum computer, something that just isn’t possible right now. The algorithm works by the fact that if you take any number a and raise it to a number r eventually as you iterate through 0<=r<=n that that number is congruent to 1 (mod N). In other words, N shares a factor with a^r -1 which can be rewritten as a difference of squares (a^r/2 + 1)(a^r/2 - 1). Meaning that (a^r/2 - 1) shares a factor with n, thereby factoring the number. The quantum computer uses a special technique to solve for r, that is complex to understand, but is what makes it so special. Every heard the saying that Quantum computers will break the internet, this is why.
Introduction of Machine Learning
Machine Learning (ML) is a hot topic in Computer Science & in other sciences in general. ML is being applied to many different fields and is showing promising results in speeding up computational times of various tasks. Can ML be applied to number theory & the factoring of prime numbers? Various researchers have investigated this question, and it can indeed be done. (Murat, Et al.) One big concern though is the amount of data needed to train the model and accuracy of the model. The problem is that as more digits are added, the data requirements will grow exponentially, which is a bad thing because it will be extremely computationally expensive to make and train the model.
ML Methods
Direct Method
The idea of directly training the ML model, is to have the semiprime as a binary string as input to ML model and have the ML model output the predicted prime factorization as a binary string. This method is the method described in the paper “Integer Prime Factorization with Deep Learning” Murat et Al. It can factor semiprimes, but it doesn’t seem to be scalable enough to be useful to factor RSA semiprimes, because of the large amount of data needed to train the model.
Transfer Learning
Due to the large amount of data needed to train larger model, and how the requirements may grow exponentially, a method of reducing the model’s data requirements can be useful in making future ML model. Transfer learning seems to be a promising method to do that. The idea is to use an estimation from one ML model as an input to another ML model. Hopefully a pattern can be recognized with the first model and allow for the model to extrapolate to a more complex problem, while using less data to extrapolate. (Zhuang, Fuzhen, et al.) A possible way to implement this is to implement a semiprime factoring ML model for n digits and feed that into a ML model that predicts n + 1 digits.
Stacking Ensemble
A ML model to predict the prime composition of a semiprime doesn’t need to have the highest accuracy alone because an ensemble of ML model can be made, and it is relatively easy to check if any one of them is a factor of N. A super accurate model isn’t needed, it can be combined with many other models and only one needs to be correct. In essence, the ML model only needs to be better than random guess, or other classical algorithmic guesses if it is trying to be optimal. Because of the idea that only one needs to be correct and you can have many ml models with various degrees of accuracy and different train hyperparameters and feed those predictions into a ML governor model that based of the many models as input can use those best guesses to predict the actual semi prime. This process of combining model and having a governor model is called a Stacking ensemble and may be an effective tool to improve the accuracy of the ml models, meaning that you can get away with less data. (Anurag Bantu.)
Stacking Ensemble w/ tests
Another idea to improve the accuracy of the ML model’s final prediction is to run tests on the various ML estimates then feed the estimates and the test result to the ML model that is trained on the ML result and the test to choose the best guess. This may further increase the accuracy of the ml prediction of semi primes.
Preprocessing
Inputting binary string as input to the ML model may not be the most efficient way to input the data. It may make it harder for the ML model to grasp the number and to do math on it. A better approach may be to preprocess the input data so as to make the data into something that gives more insight into the properties of the number.
Simulating Shor’s Algorithm
Another idea to improve ML’s capabilities to predict primes number is to use ML to find r in Shor’s algorithm so that it can be solved on a classical computer rather than a quantum one. This may be computationally expensive to train the ML model because it must raise a number up to a high exponent and may not work as well as envisioned on a classical computer.
Using ML w/ Classical Simulation of Quantum Computing
Another approach along the lines of trying to make Shor’s algorithm more usable in classical computing is to use ML to speed up or increase the computational efficiency of simulating quantum computers using classical methods. This would be a project of its own and will likely not be feasible.
Training ML models to do other classical algorithms.
Another Idea is to train ML model on the other classical models of solving the prime factorization of N. This may or may not be useful to increase the computational efficiency of the ml model but maybe be useful in making more complex ML models that may unlock better and more efficient methods to factor large semiprimes.
Conclusion
In conclusion, The Integer Factoring Problem continues to be a big unsolved problem in the field of math and computer science. It has been pondered about for centuries and continues to baffle the minds of great scientists. Better approaches to factoring large semiprimes have been found but not maximally efficient to break RSA encryption, except for Shor’s Algorithm but it unfortunately relies on the use of a large quantum computer which are currently researcher dreams, and not reality. Because the fact that the problem is still wide open and doesn’t seem to be going away soon, it is important to explore other ideas and to test their feasibility, just to see how well they can perform. ML can be such an idea that may be able to better optimize the factoring of semiprimes, although it doesn’t seem that it has been extensively investigated at the current moment.
Works Cited
Anurag Bantu. “Stacking Ensemble Learning - Beginner’s Guide.” Kaggle.com,
www.kaggle.com/code/anuragbantu/stacking-ensemble-learning-beginner-s-guide. Accessed 27 Nov. 2023.
Arjen K. Lenstra and H. W. Lenstra, Jr. (eds.). "The development of the number field sieve". Lecture Notes in Math.
(1993) 1554. Springer-Verlag.
Carl Pomerance, Analysis and Comparison of Some Integer Factoring Algorithms, in Computational Methods in
Number Theory, Part I, H.W. Lenstra, Jr. and R. Tijdeman, eds., Math. Centre Tract 154, Amsterdam, 1982, pp
89-139.
Fortnow, Lance. “Fifty Years of P vs. NP and the Possibility of the Impossible.” Cacm.acm.org,
Murat, Begaiym, Shirali Kadyrov, & Ryszhan Tabarek. " Integer Prime Factorization with Deep Learning." Advances
in Interdisciplinary Sciences [Online], 2.1 (2021): 1-5. Web. 23 Nov. 2023
Pollard, J.M. A Monte Carlo method for factorization. BIT 15, 331–334 (1975). https://doi.org/10.1007/BF0193366
Rivest, R. L., et al. ‘A Method for Obtaining Digital Signatures and Public-Key Cryptosystems’. Commune. ACM,
vol. 21, no. 2, Association for Computing Machinery, Feb. 1978, pp. 120–126,
https://doi.org10.1145/359340.359342.
Shor, Peter W.. "Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum
Computer". SIAM Journal on Computing 26. 5(1997): 1484–1509.
Zhuang, Fuzhen, et al. “A Comprehensive Survey on Transfer Learning.” Proceedings of the IEEE, vol. 109, no. 1,
Jan. 2021, pp. 43–76, www.arxiv.org/pdf/1911.02685.pdf, https://doi.org/10.1109/jproc.2020.3004555.