Extract tabular data from PDF and sort

I have a PDF file that has a mark for a specific exam.

I'm particularly interested in the first list, but unfortunately it has 2,112 entries. And they are not formatted properly. I need to sort all these records (based on the scores of the last 2 columns - the sum of the scores in Aptitude and Computer) to see what my rank is.

I tried to copy to MS Word and Excel, but if you try this you will see it doesn't help. After pasting it into a simple text file, I tried to format it with regex (in Notepad ++), wrote C code to properly separate each field by "\ t" (so that later I could copy them correctly to the Excel Sheet), but from - I failed due to inconsistencies (some records are generated by several lines, "names" do not have fixed field numbers).

Can anyone come up with some idea that will allow the first list in PDF to be copied into a spreadsheet as a spreadsheet exactly like the original file?

+1


a source to share


3 answers




+1


a source


I was once tasked with creating a parser that would extract data from pdf with tabular and non-tabular data in a number of different encodings and mix rtl and ltr text. This project took a lot of effort, but with a simple english spreadsheet, you could parse the pdf file instantly. Look for PDF specs on adobe.com and if this is a desperate start to digging.

Also you will need to use pdftk.exe to unzip the file.



A shortcut that might help me: http://www.adobe.com/devnet/pdf/pdf_reference.html

This is the shortcut I had in mind: http://www.codeproject.com/KB/cs/PDFToText.aspx

0


a source


Well, I succeeded. I first copied it into a text file, removed all letters from it, leaving only the serial number and the corresponding labels, separated by spaces or tabs. Then using "import" in an OpenOffice spreadsheet said that separators are spaces and tabs (combine them if necessary) and bingo! I got my title.

But I would still like to know if the whole table can be copied as it is. So keep this question open.

0


a source







All Articles