March 2018 – SEBASTIAN APPELT

This is another part of the series on a real machine learning project I did in university. The intention is not to write a tutorial, but provide you with hints and references on libraries and tools. Part 1 features the collection of raw data through web scraping. This part 2 will focus on how to clean the collected (raw) data. The whole project was about price prediction of artworks from the long 19th century.

Data Cleaning

When we deal with real data, it is not like playing around with toy datasets like iris. Our data will be messed up, it will have missing values, different typings, wrong information, wrong encoding etc. So what we basically have to do is to:

Remove HTML tags
Fix encoding
Convert strings to datatypes (datetime, numbers)
Normalize categories
Normalize? numeric values
Replace/wipe-out missing values

1. Remove HTML tags

When we are scraping from websites, we often have HTML tags included in the scraped texts. The HTML tags can help us, to recognize entity lists (<br>). Anyway, in our final data, we do not want to include HTML. A very nice python library, that helps you getting done with this is w3lib [w3lib]. It has a module HTML, that contains a function

remove_tags(htmlString)

2. Fix encoding

When dealing with artwork data, that was created all over Europe, you have artwork titles and information from different languages like German, Spanish, French or Italian. These languages contain often accents, umlauts or other special characters. Especially the artist name is difficult to group when it contains accents or other characters. A python included library called unicodedata [unicodedata] will help you out here. Import it and type

unicodedata.normalize('NFKD', text_with_accents).encode('ascii', errors='ignore')

3. Convert strings to datatypes

When you want to work with your data, it is a good thing to have uniform date and number formats. To find numbers in your text, I would recommend using simple regex [regex] expressions:

re.search("€\s([0-9]+,[0-9]+)(\s*-\s*)*([0-9]+,[0-9]+)*", rawprice)

Be aware, that you could have different number formats (, or . separator).

To parse dates, there is a super cool python library out there, which is called dateutil [dateutil]. The dateparse can make a fuzzy search within your texts, to find all dates:

parser.parse(text_including_date, fuzzy = True).year()

4. Normalize categories

The categorization of texts can be quite tricky. When we already parsed our numbers and dates, we can group on that values, which is a good thing. But what is missing, is text categories. In the case of artworks, this would be for example the artist name. We want to be able to find out the total of sales for an artist, or its total revenue. This can only be done, when all artist names are written in the same way. The problem that occurs here, is that names can have a big variety of spelling. Take Salvador Dali as an example. In the dataset, you can find the following spellings:

Dalí, Salvador
dali, salvador
domènech, salvador dalí
salvador dalí
salvador dali

So the first idea, that pops out is, to compare hamming distances [hamming]. For “salvador dalí” and “salvador dali”, this could really work out, but what about “domènech, salvador dalí”? For that problems, one has to be creative. In the case of artists, there exist online databases like Getty [getty] and Artnet [artnet]. These databases contain names and alternative spellings and nicknames or artists. If we look our names up here, we can simply normalize them to a default spelling. So we would head again to the step scraping, to crawl the necessary artist names.

If you have not the option to look up names in a database, the problem can get really challenging. The easiest (but not good) approach might be to use edit distances or named entity recognition approaches. If you do not want to code this up by hand, you could use fuzzy search libraries, to make matches.

5. Normalize numeric values

This is a topic that you really have to think about deeply and that has not a real recipe because it is domain dependent. You can think of doing a normalization on your numeric values, like z-score normalization or min-max-normalization. This will have different impacts on your later machine learning, and you have to try out, what works for you. A good starting point can be the tutorial on machinelearningmastery [mlmastery]

6. Replace/wipe-out missing values

For missing values, you have generally two options: You want to discard the rows in your dataset that contains missing values (which could shrink your data size extremely), or you want to replace missing values by the mean, median, min, max, or a default value.

This topic is also very domain dependent and has a huge impact on your machine learning algorithm. Generally, I would say, you should remove all entries, that do not contain your predicted value. For the other entries, you could try to run the machine learning with different normalizations and use the best one.

References

[w3lib] http://w3lib.readthedocs.io/en/latest/w3lib.html

[unicodedata] https://docs.python.org/3.6/library/unicodedata.html

[regex] https://docs.python.org/3.6/library/re.html

[dateutil] https://dateutil.readthedocs.io/en/stable/

[hamming] https://en.wikipedia.org/wiki/Hamming_distance

[getty] http://www.getty.edu/

[artnet] http://www.artnet.com/

[mlmastery] https://machinelearningmastery.com/scale-machine-learning-data-scratch-python/