A math formula normally contains the abstract thoughts of mathematicians, and its math symbols often require somewhat typing efforts to present nicely in Microsoft docx files or their converted pdf files. Unfortunately, Microsoft Math formulas are not the standard Latex format, thus LLM normally does not recognize them.
Github site (https://github.com/VikParuchuri/marker) has an open-source tool called “Marker”, capable of converting pdf files into Markdown “.md” files that is LLM ready, particularly the math equations. In addition to install torch, you need to do:
pip install marker-pdf
Then the following command will convert your pdf file into md file:
marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10 --langs English
You only need to do the following for those who have only cpu (no gpu and vram), but you need to wait longer to process:
marker_single /path/to/file.pdf /path/to/output/folder
When it comes to converting PDF files, especially those containing complex mathematical equations, into an LLM-ready format, both Llama Parser and Marker have their strengths and weaknesses. Here’s a comparison based on your specific needs:
2.2.1 Llama Parser:
Strengths:
Accuracy in Text Extraction: Llama Parser is designed for extracting text and metadata from documents with a high degree of accuracy, making it suitable for processing complex documents like academic papers or textbooks.
Mathematical Equations: It has better support for handling LaTeX and other mathematical notations within PDFs, which can be crucial for preserving the integrity of mathematical content.
Customization: Llama Parser offers more options for customization, allowing you to fine-tune how it processes different types of content within a PDF.
Weaknesses:
Complexity: The tool might be more complex to set up and use, especially if you need to customize it for specific types of documents.
Speed: It might be slower in processing large documents due to its focus on accuracy and detail.
Strengths:
User-Friendly: Marker is generally easier to use and more straightforward, making it accessible for quick conversions without much setup.
Speed: Marker tends to be faster in processing documents, which can be beneficial if you’re dealing with a large volume of PDFs.
General Content: It handles general text content well, making it a good option for simpler documents.
Weaknesses:
Mathematical Equations: Marker might struggle with accurately parsing complex mathematical equations or converting them into a format that is easily readable by an LLM. Equations could be misinterpreted or lose their formatting.
Less Control: Marker provides fewer customization options, so it might not handle specific or complex content types as well as Llama Parser.
For Complex Mathematical Equations: Llama Parser is likely the better option due to its superior handling of LaTeX and mathematical notations. It can ensure that the mathematical content is accurately represented in the LLM-ready format, preserving the equations in a way that the model can process effectively.
For General Text Conversion: Marker might be sufficient if your PDFs are less complex or if speed and ease of use are higher priorities.
If you need the best possible accuracy for mathematical content, you might also consider a two-step approach:
Convert Equations to LaTeX: Use a specialized tool to extract and convert the mathematical equations into LaTeX format from the PDF.
Text Conversion: Then use Llama Parser or a similar tool to convert the rest of the document, merging the LaTeX-rendered equations into the final output.
This hybrid approach could give you the best of both worlds: accurate mathematical content and well-processed text.