Showcase Project: From Data Collection to Analysis


banner

I. Project definition

Check my Tableau dashboards! All the code related to this project can be directly reviewed in my GitHub. All the resulting datasets can also be found there under project/data/pg_exports in JSON and CSV formats:

  • cpu_prices_dataset and gpu_prices_dataset: prices for numerous products since 2013 (around 100,000 entries in total)
  • cpu_specs_dataset and gpu_specs_dataset: technical specifications for both product categories (around 600 entries in total)

B. Goals

Starting this project, my intention was to collect data from one or several sources in order to better understand the GPU market. After considering multiple possibilities, I settled with two main targets:

  • the French ecommerce website LDLC I already used in the past,
  • and the technology publisher TechPowerUp, since they have a pretty well maintained database about electronics.

C. Organization and goals

From a technical point of view, my objectives were to reinforce my skills with the Pandas and Psycopg2 Python libraries, without necessarily seeking optimal coding all the time. For instance, at one point I decided to clean data using Pandas. At another, the tidying process was done on Postgres. Playing around was my focus.

Below is a basic flowchart roughly showing my logic, with a twofold data collection (from LDLC and TechPowerUp) and the database construction and feeding:

chart

II. Scripts explanation

A. Database connection

Everything related to the management of the electronics PostgreSQL database is handled by the db_conn.py file.

B. Live benchmarking website scraping

This part of the project is handled by the tpu_scraper.py file.

1. Challenges

Unlike the live LDLC scraper which went absolutely smoothly, TechPowerUp presented some hurdles. Their website is rather watchful to scraping and crawling activities, issuing reCAPTCHA tests if it deems your behavior suspicious or unrespectful.

My decision was to simply slow the bot down and wait a random number of seconds between each search and navigation actions. It worked that way pretty well.

2. Improvements

I researched a few ways to speed up the process without getting interrupted by a reCAPTCHA test:

  • mimicking human behavior with random clicking (risky, could click on a link and disrupt the scraper), scrolling and typing,
  • rotating between user agents,
  • and rotating between proxies.

The first and second methods can be easily implemented to the project. I have a work in progress file tackling the third method, proxy_retriever.py. I did manage to get a list of proxies but did not find a way to use them with Playwright yet.

C. Live ecommerce website scraping

This is handled by the ldlc_live_scraper.py file, which makes use of the Python Playwright library to retrieve current day prices for CPUs and GPUs on the LDLC website. Pretty straightforward.

D. Archived ecommerce website scraping

1. Wayback Machine URLs recovery

Managed by the ldlc_archive_scraper.py file. Writing this script required most of this project's time.

When one's goal is to build a time series dataset, it is mandatory to either gather data on a regular basis or find past data. The latter can be achieved with Web Archive's Wayback Machine. For this project, I used the Wayback Machine's CDX API to retrieve all snapshots of LDLC's CPU and GPU product price lists for the last 10 years.

Starting from my target URLs using this API results in the following requests:

Target URLWayback Machine API request
www.ldlc.com/informatique/pieces-informatique/carte-graphique-interne/c4684/http://web.archive.org/cdx/search/cdx?url=www.ldlc.com/informatique/pieces-informatique/carte-graphique-interne/c4684/&output=json
www.ldlc.com/informatique/pieces-informatique/processeur/c4300/http://web.archive.org/cdx/search/cdx?url=www.ldlc.com/informatique/pieces-informatique/processeur/c4300/&output=json

The standard output format is JSON and contains various pieces of information, the most important being the snapshot URLs and their timestamp. Using Python, it is easy to iterate through those URLs and use BeautifulSoup to retrieve their data.

2. Building the adaptative scraper

Since the project's aim is to get data from the last decade, it is likely that the LDLC website underwent some changes over the years. In order to identify those changes, the scraper had to be iteratively designed. A first version of the scraper was built to target specific selectors from the 2013 LDLC website. When an error was logged by the scraper, it meant the Document Object Model (DOM) of the website changed. This involved checking the DOM again for new selectors.

Going through this process a few times was necessary to refine the scraper and make it fully operational.

3. Optimizing the script

With the aim of speeding up the script execution time, I used the multiprocessing module. What could be better would be to combine the use of multiprocessing and asyncio, which I used for the live scraping of TechPowerUp.

III. Further improvements

This project suffers mainly from two shortcomings, being the CPU dataset incompleteness and the wrong choice of approach regarding the ecommerce website.

A. CPU dataset incompleteness

Both LDLC and TechPowerUp websites turned to be surprisingly lacking regarding specific CPU data. Therefore, by the end of my data collection, the CPU dataset still missed technical information about most recent Intel CPUs (# of cores, # of threads, TDP figures), which prevented me to:

  • efficiently analyze the power and efficiency progress made over the last years,
  • and compare, in that regard, both INTEL and AMD product lines.

The only way to circumvent this problem was to look for other reliable sources and possibly build a new scraper.

B. Wrong ecommerce approach

What I wanted to do with this project was to build a time series with GPU and CPU prices evolution over time. In that sense, going for LDLC and its archived pages stored on Web Archive was a correctly set goal. However, choosing this specific website made me miss my target because:

  • the prices were extremely steady over the years and during the COVID-19 crisis,
  • and the website was subject to shortages.

In that, using LDLC as data source did not allow me to observe the effects of supply and demand as I wanted. This was frustrating because of the market considerably changed with the cryptocurrency hype and COVID.

Consequently, the correct approach would have been to monitor a platform like eBay, because such a platform is the very embodiment of supply and demand. I would have then built a scraper that would run only once a week for instance and get the prices for a list of references which could be either preset or dynamic. Of course, building a dataset in that case would require a lot of time!