To download a copy of the main Wikipedia article of every country

This works to download a copy of the main Wikipedia article on every country. This will get all the text and the main parts of the format, but images and links are not functional (because the content is contained at other urls). This could be useful for many purposes, for example, if you’re going to a country where access to Wikipedia is banned (or any other trip where access to any/some/entire internet is uncertain), but you want an easy reference on a variety of basic facts and background of yet-to-be-determined countries, running this code will download all the pages onto your hard drive in about 30 seconds.

The steps are as follows:

0) Do not follow the following instructions if you are not aware of the non-virus likelihood (related to installation of the basic software) of the R project or CRAN, or RStudio.

1) Download the R statistical package from r-project.org and then install it.

2) Download RStudio (a more user friendly graphical interface that operates on top of R) from rstudio.com and then install it.

3) Set the working directory in RStudio so you can find the files that will be downloaded. Go to: Tools -> Global Options, and then click on “Browse” to select the working directory (this is where the files will go), and then click on “Apply”.

Then, paste the entirety of the following into the RStudio console (either on the left side or lower left), hit enter, and wait through at least two extended stages of the computer doing stuff. When everything is done, there will be a copy of the main Wikipedia article on every country wherever R is saving your stuff.


# The "#" sign starts a comment, so the entire following can be pasted into the console, press enter, and the desired result should be achieved.
#___1: R installed
#___2: RStudio installed
#___3: Set working directory in RStudio
#___4: install the package needed to make it work by pasting this into the console and pressing enter
install.packages("downloader")
#___5: load the library from the package
library (downloader)
#___6: declare all the extensions of the Wikipedia pages related to the country names
countryNames = c("Afghanistan", "Albania", "Algeria", "American_Samoa", "Andorra", "Angola", "Anguilla", "Antigua_and_Barbuda", "Argentina", "Armenia", "Aruba", "Australia", "Austria", "Åland_Islands", "Azerbaijan", "The_Bahamas", "Bahrain", "Bangladesh", "Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bermuda", "Bhutan", "Bolivia", "Bosnia_and_Herzegovina", "Botswana", "Brazil", "British_Virgin_Islands", "Brunei", "Bulgaria", "Burkina_Faso", "Burundi", "Cambodia", "Cameroon", "Canada", "Cape_Verde", "Cayman_Islands", "Central_African_Republic", "Chad", "Chile", "China", "Christmas_Island", "Cocos_(Keeling)_Islands", "Colombia", "Comoros", "Republic_of_the_Congo", "Democratic_Republic_of_the_Congo", "Cook_Islands", "Costa_Rica", "Ivory_Coast", "Crimea", "Croatia", "Cuba", "Curaçao", "Cyprus", "Czech_Republic", "Denmark", "Djibouti", "Dominica", "Dominican_Republic", "East_Timor", "Ecuador", "Egypt", "El_Salvador", "Equatorial_Guinea", "Eritrea", "Estonia", "Ethiopia", "Falkland_Islands", "Faroe_Islands", "Fiji", "Finland", "France", "French_Guiana", "French_Polynesia", "Gabon", "The_Gambia", "Georgia_(country)", "Germany", "Ghana", "Gibraltar", "Greece", "Greenland", "Grenada", "Guadeloupe", "Guam", "Guatemala", "Guernsey", "Guinea", "Guinea-Bissau", "Guyana", "Haiti", "Honduras", "Hong_Kong", "Hungary", "Iceland", "India", "Indonesia", "Iran", "Iraq", "Republic_of_Ireland", "Isle_of_Man", "Israel", "Italy", "Jamaica", "Japan", "Jersey", "Jordan", "Kazakhstan", "Kenya", "Kiribati", "North_Korea", "South_Korea", "Kosovo", "Kuwait", "Kyrgyzstan", "Laos", "Latvia", "Lebanon", "Lesotho", "Liberia", "Libya", "Liechtenstein", "Lithuania", "Luxembourg", "Republic_of_Macedonia", "Madagascar", "Malawi", "Malaysia", "Maldives", "Mali", "Malta", "Marshall_Islands", "Martinique", "Mauritania", "Mauritius", "Mayotte", "Mexico", "Federated_States_of_Micronesia", "Moldova", "Monaco", "Mongolia", "Montenegro", "Montserrat", "Morocco", "Mozambique", "Myanmar", "Namibia", "Nauru", "Nepal", "Netherlands", "New_Caledonia", "New_Zealand", "Nicaragua", "Niger", "Nigeria", "Niue", "Norfolk_Island", "Northern_Mariana_Islands", "Norway", "Oman", "Pakistan", "Palau", "State_of_Palestine", "Panama", "Papua_New_Guinea", "Paraguay", "Peru", "Philippines", "Pitcairn_Islands", "Poland", "Portugal", "Puerto_Rico", "Qatar", "Réunion", "Romania", "Russia", "Rwanda", "Sahrawi_Arab_Democratic_Republic", "Saint_Barthélemy", "Saint_Helena,_Ascension_and_Tristan_da_Cunha", "Saint_Kitts_and_Nevis", "Collectivity_of_Saint_Martin", "Saint_Lucia", "Saint_Pierre_and_Miquelon", "Saint_Vincent_and_the_Grenadines", "Samoa", "San_Marino", "São_Tomé_and_Príncipe", "Saudi_Arabia", "Senegal", "Serbia", "Seychelles", "Sierra_Leone", "Singapore", "Sint_Maarten", "Slovakia", "Slovenia", "Solomon_Islands", "Somalia", "South_Africa", "South_Sudan", "Spain", "Sri_Lanka", "Sudan", "Suriname", "Svalbard", "Swaziland", "Sweden", "Switzerland", "Syria", "Taiwan", "Tajikistan", "Tanzania", "Thailand", "Togo", "Tokelau", "Tonga", "Trinidad_and_Tobago", "Tunisia", "Turkey", "Turkmenistan", "Turks_and_Caicos_Islands", "Tuvalu", "Uganda", "Ukraine", "United_Arab_Emirates", "United_Kingdom", "United_States", "United_States_Virgin_Islands", "Uruguay", "Uzbekistan", "Vanuatu", "Vatican_City", "Venezuela", "Vietnam", "Wallis_and_Futuna", "Yemen", "Zambia", "Zimbabwe")
#___7: The following cycles through all country name extensions and downloads the Wikipedia page of every country and saves it in the working directory
for (i in 1:length(countryNames)){
url=paste("https://en.wikipedia.org/wiki/", countryNames[i])
filename <- paste(countryNames[i], ".html")
download(url,filename)
}
#___8: If a country name is spelled incorrectly, no warning is issued. Instead it just creates a wiki page with basically no content. If any countries are removed from the array, then it simply will not download those ones.

 

__________________________

P.S.: In WordPress, to insert code that can be copy+pasted without errors introduced by format conversion, use <code>(content)</code>. Otherwise, WordPress automatically changes various characters.

About admin

Some guy
This entry was posted in Arts, media & society, History, International, Web and computing. Bookmark the permalink.

Leave a Reply