The Greatest Resources On The Web /--evilbitz
External links in Wikipedia are links to websites that contain information that is relevant to the article at which they appear, most often they are resources that the author used in order to compile his article at Wikipedia. It was interesting for me to find out what are the top 100 domain names that Wikipedia articles are linking to.
I downloaded the Wikipedia backup from the 11/30/2006 and created a PHP script that would create those statistics. I hereby release the statistics I managed to gather from the Wikipedia backup and the source code of the PHP script (please excuse me for the mess, and the CRLFs
) that helped me create those statistics, for those of you who would like to play with it a bit more.
Technical Details
The PHP script moves through the backup file and finds the wikitext of each page (Wikipedia entry), then it parses the wikitext and finds out external links. I designed a simple database that will hold the results, it has an articles table which holds an id and title for each article, and an external links table which holds, for each external link, the domain name and the article id of where it appears. So at the end, after having the database fed with gathered data, the following query calculates the statistics: “select domain_name, count(*) from tbl_exlinks group by domain_name order by 2 desc limit 100;”
The PHP script creates a directory which holds some generated .sql files, which are being executed using the mysql command line tool. I also created a python script that will help me execute them.
The Statistics
en.wikipedia.org has 679597 links.
www.google.com has 81208 links.
www.findagrave.com has 44549 links.
www.britannica.com has 43854 links.
babelfish.altavista.com has 37437 links.
news.bbc.co.uk has 29627 links.
www.allmusic.com has 20782 links.
www.imdb.com has 17460 links.
books.google.com has 17092 links.
www.geocities.com has 15135 links.
www.bbc.co.uk has 10355 links.
www.myspace.com has 10175 links.
www.google.co.uk has 7961 links.
www.nytimes.com has 7805 links.
tools.wikimedia.de has 7210 links.
www.amazon.com has 7185 links.
www.ncbi.nlm.nih.gov has 5848 links.
maps.google.com has 5816 links.
www.washingtonpost.com has 5494 links.
www.guardian.co.uk has 5434 links.
www.youtube.com has 5212 links.
www.opsi.gov.uk has 4993 links.
planetmath.org has 4979 links.
www.cnn.com has 4927 links.
web.archive.org has 4703 links.
www.jewishencyclopedia.com has 4622 links.
www.mindat.org has 4552 links.
www.newadvent.org has 4356 links.
www.webmineral.com has 4145 links.
www.baseball-reference.com has 3616 links.
www.cbc.ca has 3613 links.
www.pbs.org has 3587 links.
www.findarticles.com has 3414 links.
www.parl.gc.ca has 3295 links.
www.abc.net.au has 3162 links.
www.nationalregisterofhistoricplaces.com has 2875 links.
www.aafla.org has 2818 links.
www.tv.com has 2692 links.
www.rollingstone.com has 2673 links.
www.angelfire.com has 2663 links.
www.perseus.tufts.edu has 2639 links.
www.gutenberg.org has 2628 links.
www.flheritage.com has 2611 links.
news.yahoo.com has 2556 links.
www.lib.utexas.edu has 2505 links.
www.timesonline.co.uk has 2489 links.
www.history.navy.mil has 2478 links.
members.aol.com has 2472 links.
imdb.com has 2471 links.
www.flickr.com has 2471 links.
www.biographi.ca has 2412 links.
groups.yahoo.com has 2379 links.
www.nba.com has 2348 links.
sports.espn.go.com has 2322 links.
www.fallingrain.com has 2305 links.
www.nps.gov has 2271 links.
de.wikipedia.org has 2215 links.
www.globalsecurity.org has 2213 links.
www.time.com has 2162 links.
www.cricinfo.com has 2127 links.
www.telegraph.co.uk has 2110 links.
www.usatoday.com has 2107 links.
www.msnbc.msn.com has 2099 links.
www.ethnologue.com has 2060 links.
www.hockeydb.com has 2056 links.
video.google.com has 2042 links.
groups.google.com has 2006 links.
www.pantheon.org has 1993 links.
www-history.mcs.st-andrews.ac.uk has 1977 links.
www.mobygames.com has 1945 links.
www.smh.com.au has 1909 links.
bioguide.congress.gov has 1897 links.
query.nytimes.com has 1878 links.
www.census.gov has 1815 links.
www.npr.org has 1810 links.
www.bartleby.com has 1808 links.
www.submission.info has 1794 links.
www.un.org has 1781 links.
www.discogs.com has 1779 links.
www.cia.gov has 1777 links.
www.alexa.com has 1774 links.
www.microsoft.com has 1723 links.
www.theage.com.au has 1692 links.
www.forbes.com has 1686 links.
www.gamespot.com has 1669 links.
www.boston.com has 1665 links.
www.gcr1.com has 1644 links.
www.nscb.gov.ph has 1638 links.
www.defenselink.mil has 1631 links.
www.wizards.com has 1606 links.
www.navsource.org has 1599 links.
www.t-macs.com has 1582 links.
www.probertencyclopaedia.com has 1558 links.
www.uefa.com has 1552 links.
www.sfgate.com has 1542 links.
www.state.gov has 1525 links.
www.reuters.com has 1517 links.
www.archive.org has 1509 links.
adsabs.harvard.edu has 1508 links.
nces.ed.gov has 1486 links.
Posted in random | 5 Comments

