March 31, 2017, by Stuart Moran

Xerox Printers as a Research Tool

The Digital Research Team are currently supporting researchers in the digitisation of paper-based research data.

Dr Christian Karner in the School of Sociology and Social Policy has gathered around 5000 newspaper clippings over the last 10-15 years as a part of his research. While he knows the data very well, it can be slow to find the information he needs and the whole archive is at high risk if there were, for example, a fire.

We are looking to scan the data in using nothing more than the Xerox printers available across the University. We’re also using a number of software tools for optical character recognition, file conversion and for searching through the data. The process we’re going to follow is outlined below:

Step 1: Scan each newspaper clipping as a PDF-Image-on-text with 600 DPI (dots per inch), and send it to the home drive. The high resolution will help with the optical character recognition later on.

Step 2: Rename each PDF according to the following naming convention (YEAR-MONTH-DATE-CLIPPING#). This will help with the structure and access to the results during keyword searches.

Step 3: Convert the PDFs to editable word documents using “Nuance PDF Converter Assistant”. The optical character recognition on this is surprisingly good, although it does have trouble recognising umlauts. At this point we have effectively got each newspaper clipping as a word document.

Step 4: Extract the plain text from the word documents using “Ant File Convertor”. This strips the word document of all its formatting rules and pictures, leaving us with the text only.

Step 5: Add all 5000 plain text files to “AntConc”. This tool then allows us to search through data for keywords, and carry out all sorts of interesting analysis.

The goal of this work is to protect the integrity of the research data, make it always accessible and to digitally provide new tools and opportunities for working with and analysing the data.

The scanning hopefully starts next week, and we look forward to sharing our learnings.

Please note, this work is being conducted in full consideration of the copyright principles of “fair dealing/usage”. There is only one clipping per newspaper being scanned, with only one digital copy being stored, and it the archive is to only be used privately for non-commercial personal research.


Next blog in series: Emerging Research Opportunities from Digitised Newspaper Data

Previous blog in series: Introducing Handwriting Technologies

Stuart Moran, Digital Research Specialist for Social Sciences

Posted in Uncategorized