Data Science

Machine learning / Price prediction of artworks / Part 2: Data Cleaning

This is another part of the series on a real machine learning project I did in university. The intention is not to write a tutorial, but provide you with hints and references on libraries and tools. Part 1 features the collection of raw data through web scraping. This part 2 will focus on how to clean the collected (raw) data. The whole project was about price prediction of artworks from the long 19th century.

Data Cleaning

When we deal with real data, it is not like playing around with toy datasets like iris. Our data will be messed up, it will have missing values, different typings, wrong information, wrong encoding etc. So what we basically have to do is to:

  1. Remove HTML tags
  2. Fix encoding
  3. Convert strings to datatypes (datetime, numbers)
  4. Normalize categories
  5. Normalize? numeric values
  6. Replace/wipe-out missing values
1. Remove HTML tags

When we are scraping from websites, we often have HTML tags included in the scraped texts. The HTML tags can help us, to recognize entity lists (<br>). Anyway, in our final data, we do not want to include HTML. A very nice python library, that helps you getting done with this is w3lib [w3lib]. It has a module HTML, that contains a function

remove_tags(htmlString)
2. Fix encoding

When dealing with artwork data, that was created all over Europe, you have artwork titles and information from different languages like German, Spanish, French or Italian. These languages contain often accents, umlauts or other special characters. Especially the artist name is difficult to group when it contains accents or other characters. A python included library called unicodedata [unicodedata] will help you out here. Import it and type

unicodedata.normalize('NFKD', text_with_accents).encode('ascii', errors='ignore')
3. Convert strings to datatypes

When you want to work with your data, it is a good thing to have uniform date and number formats. To find numbers in your text, I would recommend using simple regex [regex] expressions:

re.search("€\s([0-9]+,[0-9]+)(\s*-\s*)*([0-9]+,[0-9]+)*", rawprice)

Be aware, that you could have different number formats (, or . separator).

To parse dates, there is a super cool python library out there, which is called dateutil [dateutil]. The dateparse can make a fuzzy search within your texts, to find all dates:

parser.parse(text_including_date, fuzzy = True).year()
4. Normalize categories

The categorization of texts can be quite tricky. When we already parsed our numbers and dates, we can group on that values, which is a good thing. But what is missing, is text categories. In the case of artworks, this would be for example the artist name. We want to be able to find out the total of sales for an artist, or its total revenue. This can only be done, when all artist names are written in the same way. The problem that occurs here, is that names can have a big variety of spelling. Take Salvador Dali as an example. In the dataset, you can find the following spellings:

  • Dalí, Salvador
  • dali, salvador
  • domènech, salvador dalí
  • salvador dalí
  • salvador dali

So the first idea, that pops out is, to compare hamming distances [hamming]. For “salvador dalí” and “salvador dali”, this could really work out, but what about “domènech, salvador dalí”? For that problems, one has to be creative. In the case of artists, there exist online databases like Getty [getty] and Artnet [artnet]. These databases contain names and alternative spellings and nicknames or artists. If we look our names up here, we can simply normalize them to a default spelling. So we would head again to the step scraping, to crawl the necessary artist names.

If you have not the option to look up names in a database, the problem can get really challenging. The easiest (but not good) approach might be to use edit distances or named entity recognition approaches. If you do not want to code this up by hand, you could use fuzzy search libraries, to make matches.

5. Normalize numeric values

This is a topic that you really have to think about deeply and that has not a real recipe because it is domain dependent. You can think of doing a normalization on your numeric values, like z-score normalization or min-max-normalization. This will have different impacts on your later machine learning, and you have to try out, what works for you. A good starting point can be the tutorial on machinelearningmastery [mlmastery]

6. Replace/wipe-out missing values

For missing values, you have generally two options: You want to discard the rows in your dataset that contains missing values (which could shrink your data size extremely), or you want to replace missing values by the mean, median, min, max, or a default value.

This topic is also very domain dependent and has a huge impact on your machine learning algorithm. Generally, I would say, you should remove all entries, that do not contain your predicted value. For the other entries, you could try to run the machine learning with different normalizations and use the best one.

References

[w3lib] http://w3lib.readthedocs.io/en/latest/w3lib.html

[unicodedata] https://docs.python.org/3.6/library/unicodedata.html

[regex] https://docs.python.org/3.6/library/re.html

[dateutil] https://dateutil.readthedocs.io/en/stable/

[hamming] https://en.wikipedia.org/wiki/Hamming_distance

[getty] http://www.getty.edu/

[artnet] http://www.artnet.com/

[mlmastery] https://machinelearningmastery.com/scale-machine-learning-data-scratch-python/

 

 

Machine learning / Price prediction of artworks / Part 1 Scraping

 

I was pretty busy the last weeks, so, that is why I did not post something. As I did a full machine learning project in university, I would like to share my experiences in a four-part series with you. The whole topic of the series will be about price prediction of artworks from the so-called long 19th century (https://en.wikipedia.org/wiki/Long_nineteenth_century). This topic is especially interesting because we are dealing with raw data. Not super cleaned data sets that you push through a machine learning algorithm and get easily an accuracy of over 90%.

First of all the process will have four main parts: Scraping, Cleaning, Feature Analysis and Machine Learning. As there are perfect tutorials outside, I will not explain every step in detail, but give you references for a good start and comment my personal experiences so that you do not run into the same mistakes.

Scraping

The most basic idea to get data is to scrape websites (https://en.wikipedia.org/wiki/Web_scraping). So the idea for this project is to scrape auction house websites like Sotheby’s or Christie’s. As this could cause legal issues (https://en.wikipedia.org/wiki/Web_scraping#Legal_issues) you have to be pretty sure about the terms of usage. For this project we are especially interested in information about the price of an artwork, it’s sale date, the artist, the material, etc.

A really good tool for scraping is called “Scrapy” [Scrapy]. It comes bundled with everything that you probably need and is super fast because it is parallelizing your web requests.  It also deals with http header configuration, direct data upload to a cloud provider, checkpointing (for stopping and resuming), structuring your scraping projects etc. A very good tutorial is on Scraping is at [ScrapyTutorial]. If you walked through it, I would recommend having a look at so-called Items [ScrapyItems]. These can separate your crawling and necessary transformations.

My recommendation is really, to scrape the data raw and do all cleaning and transformation later. So you do not have to do the scraping again when you made a mistake but can rework on the raw (html) data. For real data, scraping can take up to one week or even longer.

A nice tool to find the XPATH of an element within a website is the XPath Helper Wizard [XPATH]. Simply hold the shift key (while the tool is activated) and hover over the element you want to scrape. Sometimes some handwork is needed, but you get an idea.

A row of the raw dataset could look like this:

[

{“title”: “donna al balcone”, “style”: “lithograph in colours”, “created_year”: “1956”, “size_unit”: “cm”, “height”: “65,5”, “width”: “50”, “artist_name”: “massimo campigli”, “description”: “<div style=\”line-height:18px;\”>\r\n\r\n<!– written by TX_HTML32 8.0.140.500 –>\r\n<title></title>\r\n\r\n\r\n<p>\r\n<font style=\”font-family:’Arial’;font-size:10pt;\”><b>Donna al balcone<br>\r\n</b>Lithograph in colours, 1956 . <br>\r\nMeloni/Tavola 161. Signed, dated and numbered 16/175. On Rives (with watermark). 59,5 : 38,7 cm (23,4 : 15,2 in). Sheet: 65,5 x 50 cm (25,7 x 19,6 in). <p>\r\n</p><p>\r\n</p><p>\r\nPrinted by Desjobert, Paris. Published by L’Œuvre gravé, Paris-Zürich<br>\r\nMinor light- and mount-staining. Margins with some scattered fox marks. Verso a strip of tape along the edges, glue to the mount. [HD]</p></font>\r\n </p></div>”, “sale_id”: “295”, “sale_title”: “Old Masters and Modern Art/ Marine Art”, “lot_id”: “350”, “auction_house_name”: “xy”, “image_urls”: [“http://xy.com400503194.jpg”], “currency”: “EUR”, “estimate_currency”: “EUR”, “price”: “1547”, “max_estimated_price”: “1000”, “min_estimated_price”: “1000”, “images”: [{“url”: “http:xy.com//400503194.jpg”, “path”: “full/95477903330c088065ba9e48596972471463370b.jpg”, “checksum”: “db621696c32d3f66377f3fa97128925c”}]},

….

Configuration Hints

For the projects, I needed some special configurations, that might be good to know.

Getting the API (if there is one)

A thing that I figured out, was that in the previous project there was always used HTML scraping. You should consider monitoring the requests (under Chrome Dev Tools -> Network), to find out whether there is an API to use. This is way faster than scraping HTML code.

User Agent (If you get not allowed)

Some websites are blocking the scrapy user agent. You can work around that by using the following property in settings.py

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
Google Cloud Upload

To upload your images directly to google cloud, you can use the following properties in settings.py

# This is the configuration for google cloud
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'gs://your-gc-project-url/images'
GCS_PROJECT_ID = 'your gc project id'


 

SIDENOTE: You have to download your gcs api keys and do the following export before running the scraping:

export GOOGLE_APPLICATION_CREDENTIALS=google-api-keys.json

So the next part will cover the step cleaning, that will help to wipe-out html tags and do transformations on the data.

References

[Scrapy] https://scrapy.org/

[ScrapyTutorial] https://docs.scrapy.org/en/latest/intro/tutorial.html

[ScrapyItems] https://docs.scrapy.org/en/latest/topics/items.html

[XPATH] https://chrome.google.com/webstore/detail/xpath-helper-wizard/jadhpggafkbmpdpmpgigopmodldgfcki