Sunlight Labs is having three "Hack-a-Thons" coming up. One at Transparency Camp, one at PyCon, and Web2.0 expo. Heck, a few of us are going to SxSW this year and we may do one there, too. Please vote on these ideas and let us know what you think we should be working on at these hack-a-thons. A note: we're naturally averse to advocacy type ideas. We'll likely not do them. It isn't that we don't believe in democracy though! It is because we're a c3 organization, and are legally prohibited in certain ways from doing very much in the form of advocacy.

Vote
199 Votes

PDF To Text

Submitted on 02/27 6:08 p.m. by anonymous (6 comments)

a utility to take in a document such as ARRA, posted at Recovery.gov as a 13 MB PDF, and turn it into searchable, abstractable text

6 Comments

  1. Craig Wood 02/28 7:09 p.m.

    Got a URL for this ? I've done pretty great things with the perl modules use CAM::PDF; use CAM::PDF::PageText; I've had to do a few custom tweaks.

  2. Amanda 02/28 7:18 p.m.

    The "document cloud" project means to do a pretty good job of that: http://www.niemanlab.org/2008/11/propublica-seeks-1m-to-put-everyones-documents-online/ But the monster PDF of legislation itself is here: http://www.whitehouse.gov/the_press_office/ARRA_public_review/

  3. Craig Wood 03/01 8:45 a.m.

    Strangely (or not) the link only had a 1M document. I through it through some pdf to text scripts and got http://www.thefederalregister.com/ARRA-text/ 4 or 5 hours with perl and regexs identifying the bad characters, spacing, paging and general layout and you'd have a very readable , indexible, and searchable document. I'll take some of that Knight money if they're just giving it away :-)

  4. Ian Bicking 03/02 11:24 a.m.

    Would PDF to HTML conversion (of which there are at least several hosted services) be a good starting place? These tools already handle a lot of the work with characters, exposing some layout, etc. Then scraping tools built on extracting data from HTML (of which there are several, and more could be created) could be used.

  5. Micah L. Sifry 03/12 9:03 p.m.

    I think the guys at mySociety have got some tools that do this already.

  6. XWrDTO 03/05 6:29 a.m.

    nRojzs

Submit New Idea

Posting new ideas has been disabled for this site.

What We're A Part Of