Saturday, 4 June 2011

User Story: Gamification and Wikimedia Commons mass upload

Gamification is the use of game play in non-game applications.

This is a user story of the near future for Wikimedia Commons, based on a mini-brainstorm chat in the pub with Tom Morris and Shimgray immediately after the British Library English & Drama editathon. This was also in the context of the recent workshop with selected Wellcome Trust researchers on how to release mass image donations and some of the known issues and discussions on Commons for previous mass donations.

It is June 2012 and three types of mass donation over the last 6 months have resulted in the Wikimedia Foundation, two major public bodies and two Chapter coordinated volunteer groups radically revising the possible application of Commons and for the first time formed direct partnerships with external digital libraries. All three have gamification programmes.

1Philately and crowd-sourcing the catalogue

A large archive has donated a set of high quality scans on CD ROM of philatelic artefacts. These contain an estimated 350,000 different objects including many known to be unique to the archive. The archive does not have the resources to catalogue these artefacts fully but they are hierarchically organized into several main groups including pre-1955 postage stamps from African countries, Cinderella stamps from many countries dating from 1860 to 1968 and a wide collection of airmail stamps. They have been selected to be out of copyright but this is not guaranteed for all images due to the partial catalogue.

The volunteer teams have access to a hosted staging area for the images and new e-volunteers from the philatelic community are encouraged to apply to help but due to possible copyright issues registration is necessary. A total of 40 participants are gradually working through the collection to apply a standard set of categorization tags using a visual front-end that seamlessly draws on the live Commons categorization structures and provides feedback to the participants on their progress, the quality of their classification work (based on sample expert checks) and the total programme progress. Where copyright is not obvious, these are flagged for further expert review.

By April 2012 over 80,000 images have been released to the public on Commons and the same images and metadata support the archive's public catalogue (which includes the copyrighted images not released on Commons). Based on the current burn-down rate the 350,000 images will be fully catalogued by the end of July and the archive is considering now planning a collaborative digitization programme for a further 200,000 objects.

The wider philatelic community has already recognized Commons as a reference resource and an external organization has used the Commons API to produce a simplified front end which can be used as an integrated part of their established on-line catalogue.

2Political history and identification

A small cultural archive has had a co-funded a volunteer programme to digitize a number of political history related artefacts to Commons. Though the majority of these date within the last 70 years, they are being released on a no known copyright basis. The programme is on-going and 3,000 high quality images have been released out of an estimated 40,000.

A key issue with classification has been to identify a number of professional photographs that were donated from a 1970-1990's magazine archive. These were mostly interviewees for the magazine and so all had consent for release but whether the photographs were used in the final print is unknown and the photographs have no identification or dating on them.

As they are scanned the images are uploaded the same day to Commons. Using a system of barnstars, the wider community has been encouraged to compete to classify the backlog of new images and an optional user-script has been created to help compare related images existing on Commons and automatically cross-match to TinEye prospects on the internet.

The attribution on every Commons image page links back to the donor archive's website and as a result the archive has seen a large increase in archive access requests as these have been gradually been put in use in nearly all of the 300 available language variations of Wikipedia. The archive has put forward a funding plan for a paid intern placement to support the on-going digitization and sharing programme.
3Medical research and funware

A large research body has establish a image sharing programme in partnership with four other images services including Commons (in practice the Wikimedia Foundation and the UK chapter) to provide an image clearing service for research image mass donations. Each of the five organizations has access to the image staging area and planned mass donations fit minimal upload criteria of known copyright status (some may be constrained for non-commercial use), EXIF data with complete originator information and associated "at the bench" description data for the majority of uploads. There is an expectation that images hosted on the clearing service will persist for at least a year at a time before being deleted.

Any or all of the five participating organizations are free to host the images. Due to the large nature of some of the donations (over 10,000 images in a batch), each major donation type has required Commons community discussion and planning though only the fully copyright free images are considered. Not all of the donations have been considered of likely general public educational value.

Separate from the main Commons site, a funware rating system (cheezburger) has been developed (which can run on mobile devices) to help rank categories with over 400 images for further decomposition. As well as helping to tag images with suitable sub-categories, game players are able to subjectively compare images for interest and quality. As a result of this work, the main Commons interface includes options for these categories to display recommended images from larger sets. This has been successful enough to be extended beyond the medical research image sets in order to gameify all large categories on Commons and has proved particularly popular with a range of "soft" categories such as "1960s" and "cats".

Thursday, 2 June 2011

WMF in 2016 - a User Story

Here's a SciFi user story, with the Wikimedia Foundation in mind, to illustrate the importance of a verified backup process and archive policy, it's not a prediction of the future!

In 2016 San Francisco has a major earthquake and the servers and operational facilities for the WMF are damaged beyond repair. The emergency hot switchover to Hong Kong is delayed due to an ongoing DoS attack from Eastern European countries. The switchover eventually appears successful and data is synchronized with Hong Kong for the next 3 weeks. At the end of 3 weeks, with a massive raft of escalating complaints about images disappearing, it is realized that this is a result of local data caches expiring. The DoS attack covered the tracks of a passive data worm that only activates during back-up cycles and the loss is irrecoverable due backups aged over 2 weeks being automatically deleted. Due to a lack of operational archive strategy it is estimated that the majority of digital assets (including over 80 million photographs) have been permanently lost and estimates for 60% partial reconstruction from remaining cache snapshots and independent global archive sites run to over 2 years of work.

Wednesday, 1 June 2011

Mass uploading your Flickr photos to Wikimedia Commons

Why would I want to?
  • Wikimedia Commons is the multi-media friendly website that shares images, audio and video with Wikipedia in 270+ languages plus sister projects such as Wikiversity and Wikisource. Many people and organizations would like to have some or all of the photos they are happy to share freely on Flickr, available and used on Wikipedia. A number of large institutions already use "Flickr Commons" to share images from their digital collections and if these are released on a public domain license, or only limited by requiring attribution, these are candidates for easy mass upload to Wikimedia Commons.
  • Many individuals have interesting collections of images on Flickr from their hobbies or holiday trips. With a bit of work ensuring the titles are plain English (these are used as the file names on Wikimedia Commons), tidying up Flickr tags and adding a few good descriptions these would make great additions to Wikimedia Commons.
  • Once uploaded, you have the benefit of other "e-volunteers" helping by improving descriptions, adding categories, using the images on Wikipedia and elsewhere and even digitally correcting poorly exposed or old marked photographs (if they are interesting enough!)
Do I have to do this myself?
Example standard Commons upload wizard in the middle of taking
several images at once from my hard disk.
  • If you are uploading a manageable number of images (say less than 50), then the built-in upload features of Wikimedia Commons are excellent for loading a small batch of files from your hard disk, which you can then describe, categorize and suitably rename before committing to final publication. Try it out at and check the general help available. For very large numbers (say, more than 1,000) you may want to raise a request at COM:BATCH, even if you feel technically capable, as without a bit of planning and discussion you might find yourself spending many tedious hours correcting license details or categorization errors which you only noticed after the upload.
  • If you are in the middle ground (a couple of hundred images which you have on a Flickr stream) then check out Flickr Mass in the case study below, this avoids any retyping of the details and tags you have on Flickr and makes some pretty good guesses for you during the process.
Case study for Flickr Mass tool
Last weekend I spent an afternoon for the first time playing around with Flickr Mass. To use the tool you have to already have an account set up on Wikimedia Commons (or Wikipedia) and then register on the toolserver. After a bit of testing out the features I went on to upload around 250 of the photos that the London School of Economics Library has on their Flickr photo stream. These were perfect for uploading automatically as they are:
  • suitably licensed on a free license suitable for Wikimedia Commons (i.e. no non-commercial or no-derivatives restrictions)
  • have descriptive titles, full extended descriptions in the Flickr description box and include links back to the original source library digital catalogue
Set of photos (mostly of William Beverage) loaded to the new category of LSE Commons.
The process was intuitive and I could filter by title words, any Flickr tag or by words in the description. A handy feature was to run the tool as a simulation which shows you exactly what is going to happen before doing it for real and releasing images to Wikimedia Commons.

The images uploaded extremely well, I did not have to re-type any of the titles or descriptions as these were so nicely done on Flickr. I uploaded filtering on certain tag names (including "poster" and "portrait").

A key tip is that Flickr Mass will assume that image tags on Flickr make suitable categories and so will add them as categories if the words match an existing Wikimedia Commons category. I had to do a bit of tidying up on the images to remove surplus categories of "LSE" and in many cases "glasses" (where someone had tagged every portrait of people with glasses). Obviously if the Flickr stream is yours, it would be easy to tidy up the tags and add tags that have the same name as Wikimedia Commons categories that you would like to see added.

After uploading you may well find the gadgets HotCat and Cat-a-lot (you can find these under your Wikimedia Commons Preferences) quite useful as a time saver for re-categorizing images and changing where groups of images are located.

You can surf the final outcome my upload at In the end I spent most of a day due to a steep learning curve and the time taken afterwards fixing categories. If I were to do this again with a few hundred images I would guess that this would take realistically about an hour if the Flickr stream were already well organized and it would be much easier if I owned the Flickr stream and could re-organize the tags.