I was pretty busy the last weeks, so, that is why I did not post something. As I did a full machine learning project in university, I would like to share my experiences in a four-part series with you. The whole topic of the series will be about price prediction of artworks from the so-called long 19th century (https://en.wikipedia.org/wiki/Long_nineteenth_century). This topic is especially interesting because we are dealing with raw data. Not super cleaned data sets that you push through a machine learning algorithm and get easily an accuracy of over 90%.
First of all the process will have four main parts: Scraping, Cleaning, Feature Analysis and Machine Learning. As there are perfect tutorials outside, I will not explain every step in detail, but give you references for a good start and comment my personal experiences so that you do not run into the same mistakes.
Scraping
The most basic idea to get data is to scrape websites (https://en.wikipedia.org/wiki/Web_scraping). So the idea for this project is to scrape auction house websites like Sotheby’s or Christie’s. As this could cause legal issues (https://en.wikipedia.org/wiki/Web_scraping#Legal_issues) you have to be pretty sure about the terms of usage. For this project we are especially interested in information about the price of an artwork, it’s sale date, the artist, the material, etc.
A really good tool for scraping is called “Scrapy” [Scrapy]. It comes bundled with everything that you probably need and is super fast because it is parallelizing your web requests. It also deals with http header configuration, direct data upload to a cloud provider, checkpointing (for stopping and resuming), structuring your scraping projects etc. A very good tutorial is on Scraping is at [ScrapyTutorial]. If you walked through it, I would recommend having a look at so-called Items [ScrapyItems]. These can separate your crawling and necessary transformations.
My recommendation is really, to scrape the data raw and do all cleaning and transformation later. So you do not have to do the scraping again when you made a mistake but can rework on the raw (html) data. For real data, scraping can take up to one week or even longer.
A nice tool to find the XPATH of an element within a website is the XPath Helper Wizard [XPATH]. Simply hold the shift key (while the tool is activated) and hover over the element you want to scrape. Sometimes some handwork is needed, but you get an idea.
A row of the raw dataset could look like this:
[
…
{“title”: “donna al balcone”, “style”: “lithograph in colours”, “created_year”: “1956”, “size_unit”: “cm”, “height”: “65,5”, “width”: “50”, “artist_name”: “massimo campigli”, “description”: “<div style=\”line-height:18px;\”>\r\n\r\n<!– written by TX_HTML32 8.0.140.500 –>\r\n<title></title>\r\n\r\n\r\n<p>\r\n<font style=\”font-family:’Arial’;font-size:10pt;\”><b>Donna al balcone<br>\r\n</b>Lithograph in colours, 1956 . <br>\r\nMeloni/Tavola 161. Signed, dated and numbered 16/175. On Rives (with watermark). 59,5 : 38,7 cm (23,4 : 15,2 in). Sheet: 65,5 x 50 cm (25,7 x 19,6 in). <p>\r\n</p><p>\r\n</p><p>\r\nPrinted by Desjobert, Paris. Published by L’Œuvre gravé, Paris-Zürich<br>\r\nMinor light- and mount-staining. Margins with some scattered fox marks. Verso a strip of tape along the edges, glue to the mount. [HD]</p></font>\r\n </p></div>”, “sale_id”: “295”, “sale_title”: “Old Masters and Modern Art/ Marine Art”, “lot_id”: “350”, “auction_house_name”: “xy”, “image_urls”: [“http://xy.com400503194.jpg”], “currency”: “EUR”, “estimate_currency”: “EUR”, “price”: “1547”, “max_estimated_price”: “1000”, “min_estimated_price”: “1000”, “images”: [{“url”: “http:xy.com//400503194.jpg”, “path”: “full/95477903330c088065ba9e48596972471463370b.jpg”, “checksum”: “db621696c32d3f66377f3fa97128925c”}]},
….
Configuration Hints
For the projects, I needed some special configurations, that might be good to know.
Getting the API (if there is one)
A thing that I figured out, was that in the previous project there was always used HTML scraping. You should consider monitoring the requests (under Chrome Dev Tools -> Network), to find out whether there is an API to use. This is way faster than scraping HTML code.
User Agent (If you get not allowed)
Some websites are blocking the scrapy user agent. You can work around that by using the following property in settings.py
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
Google Cloud Upload
To upload your images directly to google cloud, you can use the following properties in settings.py
# This is the configuration for google cloud ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} IMAGES_STORE = 'gs://your-gc-project-url/images' GCS_PROJECT_ID = 'your gc project id'
SIDENOTE: You have to download your gcs api keys and do the following export before running the scraping:
export GOOGLE_APPLICATION_CREDENTIALS=google-api-keys.json
So the next part will cover the step cleaning, that will help to wipe-out html tags and do transformations on the data.
References
[Scrapy] https://scrapy.org/
[ScrapyTutorial] https://docs.scrapy.org/en/latest/intro/tutorial.html
[ScrapyItems] https://docs.scrapy.org/en/latest/topics/items.html
[XPATH] https://chrome.google.com/webstore/detail/xpath-helper-wizard/jadhpggafkbmpdpmpgigopmodldgfcki