Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
as-cle-bertΒ 
posted an update 5 days ago
Post
1493
πŸš€ππžπ° 𝐝𝐞𝐦𝐨 𝐚π₯πžπ«π­πŸš€

Convert (almost) everything to PDF with πππŸπˆπ­πƒπ¨π°π§, now on Spaces! πŸ‘‰ as-cle-bert/pdfitdown

You can also install it locally:

python3 -m pip install pdfitdown


Don't forget to star it on GitHub, if you find it useful! πŸ‘‰ https://www.github.com/AstraBert/PdfItDown

I gave few comments on Github as gnusupport. I am asking, does it extract text as in the context of how human reads the text, or in the digital context?

Here is in particular book which I would like to convert to text for learning purposes:
https://www.startyourowngoldmine.com/files/books/sampling/Sampling-Series-No-1-3.pdf

I know it is PDF already, but I do not know how to extract text to make new clean PDF.

Getting this type of nonsense is not good with the pdftotext:


4

Arizona State Bureau of Mines

31dMS HUM MDVS Nl lDd 5/ 3b>Β± SIHL

Kb

s

1

I<0

1

I know that pdfitdown is for making PDF, not extracting for PDF, but maybe you know the way how to extract conceptually the text from PDF?

Β·

Hi!

I generally use LangChain + PyPDF, I leave here a code snippet:

from langchain_community.document_loaders import PyPDFLoader

def preprocess(pdf: str) -> list:
    """
    Uses LangChain's PyPDFLoader to extract text.
    """
    loader = PyPDFLoader(pdf)
    documents = loader.load()
    for doc in documents:
        print(doc.page_content)    

This should give a more solid result :)

PS: Langchain is distributed under an MIT license, see their GitHub (https://github.com/langchain-ai/langchain)