@as-cle-bert on Hugging Face: "🚀𝐍𝐞𝐰 𝐝𝐞𝐦𝐨 𝐚𝐥𝐞𝐫𝐭🚀 Convert (almost) everything to PDF with…"

as-cle-bert

posted an update Jan 20

Post

1599

🚀𝐍𝐞𝐰 𝐝𝐞𝐦𝐨 𝐚𝐥𝐞𝐫𝐭🚀

Convert (almost) everything to PDF with 𝐏𝐝𝐟𝐈𝐭𝐃𝐨𝐰𝐧, now on Spaces! 👉 as-cle-bert/pdfitdown

You can also install it locally:

python3 -m pip install pdfitdown

Don't forget to star it on GitHub, if you find it useful! 👉 https://www.github.com/AstraBert/PdfItDown

JLouisBiz

Jan 20

I gave few comments on Github as gnusupport. I am asking, does it extract text as in the context of how human reads the text, or in the digital context?

Here is in particular book which I would like to convert to text for learning purposes:
https://www.startyourowngoldmine.com/files/books/sampling/Sampling-Series-No-1-3.pdf

I know it is PDF already, but I do not know how to extract text to make new clean PDF.

Getting this type of nonsense is not good with the pdftotext:


4

Arizona State Bureau of Mines

31dMS HUM MDVS Nl lDd 5/ 3b>± SIHL

Kb

s

1

I<0

1

I know that pdfitdown is for making PDF, not extracting for PDF, but maybe you know the way how to extract conceptually the text from PDF?

as-cle-bert

Jan 20

Hi!

I generally use LangChain + PyPDF, I leave here a code snippet:

from langchain_community.document_loaders import PyPDFLoader

def preprocess(pdf: str) -> list:
    """
    Uses LangChain's PyPDFLoader to extract text.
    """
    loader = PyPDFLoader(pdf)
    documents = loader.load()
    for doc in documents:
        print(doc.page_content)

This should give a more solid result :)

PS: Langchain is distributed under an MIT license, see their GitHub (https://github.com/langchain-ai/langchain)

Join the conversation