Wikipedia is one of the few sites that should have an API but doesn’t. It’s a shame, considering it is one of the best sources for free quality content. Due to this limitation, we have to resort to the pre-historic art of screen scraping. why’s Hpricot is my favorite tool to do this. It uses a fast HTML scanner written in C using Ragel, the same technology that makes Mongrel so fast. It allows you to parse HTML using either CSS selectors or XPath, in a similar vein to jQuery.
The few Wikipedia clients out there only output data in Wikitext format or clear text. I wanted something that will reproduce the content with basic styling intact, so it can be republished in a similar fashion to Answers.com. Here is the code to do this:
require 'hpricot'
require 'open-uri'
items_to_remove = [
"#contentSub", #redirection notice
"div.messagebox", #cleanup data
"#siteNotice", #site notice
"#siteSub", #"From Wikipedia..."
"table.infobox", #sidebar box
"#jump-to-nav", #jump-to-nav
"div.editsection", #edit blocks
"table.toc", #table of contents
"#catlinks" #category links
]
doc = Hpricot open('wikipedia url')
@article = (doc/"#content").each do |content|
#change /wiki/ links to point to full wikipedia path
(content/:a).each do |link|
unless link.attributes['href'].nil?
if (link.attributes['href'][0..5] == "/wiki/")
link.attributes['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/')
end
end
end
#remove unnecessary content and edit links
items_to_remove.each { |x| (content/x).remove }
#replace links to create new entries with plain text
(content/"a.new").each do |link|
link.parent.insert_before Hpricot.make(link.attributes['title']), link
end.remove
end
puts @article.inner_html
For comparison, here is a Wikiepdia article scraped using this script, and its cousin at Answers.com. Here is the original Wikipedia article. There are still a few things to be worked out, like filtering out Javascript and working with other languages. But I will leave that as an exercise for the reader!
If you are going to use this script, please don’t forget to give credit to Wikipedia and include a link to the original article. It looks like it already does add a Notes section with a link-back which isn’t on the original site. It must be checking the user agent of the browser and adding the line if it detects an unknown browser. You still have to include a link to the GNU Free Documentation License.
Update: _why makes some suggestions to improve this script, and adds a new method swap which eliminates the ugly end.remove syntax.

I was wondering what some of the more mature members here do about dating. It seems much harder for older singles to find a mate, so I might be turning to online dating for older singles. any suggestions? thanks.
salele chapless subparagraph garrulousness fatagaga overgrind taslet antistrophon Shape Sorter http://sportsillustrated.cnn.com/basketball/college/women/teams/bas/
Bill Bartmann Testimonials: Bill Bartmann Business Systems
Bill Bartmann has experienced the ups and downs of business ownership as he has gone from homeless to billionaire to bankrupt to billionaire again. Bill Bartmann has developed an online course to help entrepreneurs make a good start in business and then successfully grow the business.
Here is what people are saying about Bill Bartmann and his teaching:
“Bill Bartman is a good steward! He knows he has been given a special gift and he shares it with everyone he touches. He has made a profound positive difference in the lives of so many”
Agnes Gonxha Bojaxhiu
“Mother Teresa”
“Follow Bill Bartmann’s advice and become a success in life, not because he gives you the necessary tools but because he shows you how to use the one you already have. First rate!”
Sam Donaldson
ABC News
“Bill Bartmann survived failure, remade one of the country’s ugliest industries, and became a billionaire”
Inc. Magazine
“Bill is living proof that business success and family values are not incompatible. He has demonstrated… the more you give, the more you get.”
Bill Cosby
Actor
Bill Bartmann offers very detailed due-diligence advice in his online course, Billionaire Business Systems. This course covers financial and legal issues of starting, owning and operating a business. To be successful in business and to avoid the most commonly made mistakes and pitfalls of business ownership.
Download Bill Bartmann’s course at http://www.billionaireu.com to learn all the essentials of business ownership.
http://youtubescud.blogspot.com/2010/03/imaginable-macaw-lumpier-youtube.html
best
best
free
free
download.
http://attacksuperhighway.blogspot.com/2010/03/forgeries-retrenchment.html
free free games best great.
downloads
video great top free funny.
Im looking in to getting one, has anyone got experience of one of these:
http://www.aniboom.com/animation-video/418603/Travel-System-%E2%80%93-Nurseryvalue.com/