Sunlight Labs is having three "Hack-a-Thons" coming up. One at Transparency Camp, one at PyCon, and Web2.0 expo. Heck, a few of us are going to SxSW this year and we may do one there, too. Please vote on these ideas and let us know what you think we should be working on at these hack-a-thons. A note: we're naturally averse to advocacy type ideas. We'll likely not do them. It isn't that we don't believe in democracy though! It is because we're a c3 organization, and are legally prohibited in certain ways from doing very much in the form of advocacy.
a utility to take in a document such as ARRA, posted at Recovery.gov as a 13 MB PDF, and turn it into searchable, abstractable text
Posting new ideas has been disabled for this site.
6 Comments
Got a URL for this ? I've done pretty great things with the perl modules use CAM::PDF; use CAM::PDF::PageText; I've had to do a few custom tweaks.
The "document cloud" project means to do a pretty good job of that: http://www.niemanlab.org/2008/11/propublica-seeks-1m-to-put-everyones-documents-online/ But the monster PDF of legislation itself is here: http://www.whitehouse.gov/the_press_office/ARRA_public_review/
Strangely (or not) the link only had a 1M document. I through it through some pdf to text scripts and got http://www.thefederalregister.com/ARRA-text/ 4 or 5 hours with perl and regexs identifying the bad characters, spacing, paging and general layout and you'd have a very readable , indexible, and searchable document. I'll take some of that Knight money if they're just giving it away :-)
Would PDF to HTML conversion (of which there are at least several hosted services) be a good starting place? These tools already handle a lot of the work with characters, exposing some layout, etc. Then scraping tools built on extracting data from HTML (of which there are several, and more could be created) could be used.
I think the guys at mySociety have got some tools that do this already.
nRojzs