UNIX Socket FAQ

A forum for questions and answers about network programming on Linux and all other Unix-like systems

You are not logged in.

#1 2006-09-20 09:04 AM

surya_jc
Member
Registered: 2006-07-18
Posts: 8

Re: PDF crawling

Hi.....
I'm trying to make one search engine.
& for that first step I have to crawl through all the documents with me and get the keywords and have to do the indexing. I done the first step for the text documents, xls, html files.... but I was not able to achieve this for the pdf files......

I'm not able to read or parse the pdf files....
I tried to convert the pdf file to text file.... but that to doesn't work.....

If any body can help in resolving this issue i'll be so thankfull....
Also if you do have any suggestion then plz.....

Surya

Offline

#2 2006-09-20 11:54 AM

mlampkin
Administrator
From: Sol 3
Registered: 2002-06-12
Posts: 911
Website

Re: PDF crawling

The only one I've ever used was the icePDF product... but thats a commercial product....

You might want to try this one:

http://www.pdfbox.org/

It appears to do reading / conversion to text...

So you aren't finished that project yet...? come on... its been months... ;-)


Michael


"The only difference between me and a madman is that I'm not mad."

Salvador Dali (1904-1989)

Offline

#3 2007-05-11 02:16 PM

PTPT
Guest

Re: PDF crawling

Will it be just as good as google??

#4 2009-12-17 07:40 AM

Naguellegange
Guest

Re: PDF crawling

Hey there,

Were currently using the PDF Control to convert from RTF to PDF which works fine.

Im trying to add in the ability to convert from HTML/MHT to PDF will both these use the same conversion?

Ive looked all through the help files and the forums and so forth but cannot find a way to actually do this with any form of output. Can anyone help?

Thanks

Board footer

Powered by FluxBB