Designing a web-scraping strategy depends on your constraints (e.g.,deadlines, uncertainties, awareness), objectives (e.g., features, speed, reproducability), and experience! In this post, I talk about my experience of trying two different strategies to collect the product description data. Trying both strategies revealed valuable insights into the strengths and weaknesses of each. I share these insights below.
First strategy: chase low hanging fruit
The minimum viable features I need for my app are simple: the product name, price, and SKU number. Everything else is a bonus! But, the bonuses looks really nice though… Have a look at the sample product page:
The product page contains many features such as the description, ABV, sugar content, bottle volume, and sale price. My greed for obtaining “all” features in case of some unforeseen reasons led me to fixate on the product page and ignore the search for any other viable (and better) solutions. It was a great learning experience that helps me better identify what is the best web scraping strategy and why.
Brute-forcing strategy:
I noticed the product page contains a link “Check Availability in All Stores” that sends you to a well organized inventory page for this bottle! (We will get back to this inventory data later.) The inventory page has a “View Details” button that sends you back to the product page.
Unlike the product page URL, the product inventory URL is structured simply as https://www.lcbo.com/en/storeinventory/?sku=32853. It looks like I can query inventory pages by selecting SKU numbers. If I have the SKU number then I can use the URL behind the “View Details” button to get the product descriptions URL. From there, I can use the techniques in Part 1 of the blog series to obtain JSON data containing product descriptions.
This technique works. The solution is a minor adaptation to the script shown in Part 1 of the blog series. I found approximately 13,000 SKU numbers have products associated with them. However, this brute-force approach took several days to perform. The reason is LCBO’s SKU number is up to 6 digits long which has 1 million possible combinations. So, only 1.3% of the requests were successful.
Pros & Cons
This strategy feels quite embarassing to share. But, its valuable to make a poorly performing but “successful” solution. It made me appreciate how thoughtful and clever web scraping can be. A few small mistakes or accounting errors can lead to restarting the entire scraping process. Even worse is to patch together two poorly executed scraping results and hope things are okay. The frustrations compel me to consider more than one solution before deciding to commit and build.
Here are my observations of the pros and cons for brute-forcing.
Pros to brute-forcing
- Easy to implement
- Easy to debug
- Can do other work while waiting
- Feels like you are working
Cons to brute-forcing
- Extremely slow and inefficient
- Vulnerable to computer and network instabilities
- Requires babysitting
- Requires two steps (product inventory -> product page)
- Very likely to have IP banned
- Suspicious and straining to servers
- Generates enormous inertia to run again
- Gives a false impression that you are working
Second strategy: “less is more”
One of the best surprises in web scraping is finding a data product with features you were not expecting. Sometimes this can be a LOT of features! It feels like opening a loot box full of data goodies! I suspect this is common practice since re-engineering data products takes careful work and risks breaking everything. So, it is less risky to just send more than minimally required for future developer.
I did not know this! As a result, I ignored LCBO’s query page because it appeared to only provided the product name, price, and ratings (see sample image below). That was a costly mistake!
It was not too costly considering that my first strategy was so slow that I could explore other strategies. Under tighter constraints or stress, I may have not done this, which is good to be aware of.
Query strategy:
Part 1: Finding the data
Using the techniques in Part 1 of the blog series, I found a relatively large JSON file is transferred when querying the LCBO product list. See the image below.
The primary variables we care about are the first and last variables, “totalCount” and “results”.
“totalCount” is the number of products resulting from my query. Since the query had no filters, there were 13,186 results products to be found.
“results” is the list of wine products on display. It contains all of the product descriptions we are interested in and more. There is a LOT of data here that we can unpack. Here are some snapshots for one product:
Part 2: Requesting the data
This is where things become a little tricky. In Part 1, we found the cURL command to request store information data. The command had a single integer as an argument corresponding to the store ID. Here, we will have several parameters due to the numerous filters available. The parameters also contain hidden limits. For example, I cannot simply request a single JSON with 13,186 products (I tried, of course). As a result, we need to procede carefully with designing a scraping strategy.
After importing the cURL command in Postman, this is what I see:
The code snippet on the right is this:
curl –location ‘https://platform.cloud.coveo.com/rest/search/v2?organizationId=lcboproductionx2kwygnc’
–header ‘User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0’
–header ‘Accept: /’
–header ‘Accept-Language: en-CA,en-US;q=0.7,en;q=0.3’
–header ‘Accept-Encoding: gzip, deflate, br’
–header ‘Authorization: Bearer xx883b5583-07fb-416b-874b-77cce565d927’
–header ‘Content-Type: application/x-www-form-urlencoded; charset=UTF-8’
–header ‘Origin: https://www.lcbo.com’
–header ‘DNT: 1’
–header ‘Connection: keep-alive’
–header ‘Referer: https://www.lcbo.com/’
–header ‘Sec-Fetch-Dest: empty’
–header ‘Sec-Fetch-Mode: cors’
–header ‘Sec-Fetch-Site: cross-site’
–data ‘actionsHistory=%5B%5D&referrer=&analytics=%7B%22clientId%22%3A%22%22%2C%22documentLocation%22%3A%22https%3A%2F%2Fwww.lcbo.com%2Fen%2Fproducts%23t%3Dclp-products%26sort%3Drelevancy%26layout%3Dcard%22%2C%22documentReferrer%22%3A%22%22%2C%22pageId%22%3A%22%22%7D&isGuestUser=false&aq=%40ec_category%3D%3DProducts&searchHub=Web_Listing_EN&tab=clp-products&locale=en&firstResult=0&numberOfResults=48&excerptLength=200&enableDidYouMean=true&sortCriteria=relevancy&queryFunctions=%5B%5D&rankingFunctions=%5B%5D&groupBy=%5B%7B%22field%22%3A%22%40lcbo_bintag_wine_sweetness%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_bintag_wine_sweetness%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_sweetness%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_body%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_bintag_wine_body%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_body%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_flavor_intensity%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_bintag_wine_flavor_intensity%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_flavor_intensity%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_acidity%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_bintag_wine_acidity%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_acidity%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_tannins%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_bintag_wine_tannins%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_bintag_wine_tannins%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40ec_price%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40ec_price%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40ec_price%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40lcbo_total_volume%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_total_volume%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_total_volume%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40lcbo_alcohol_percent%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_alcohol_percent%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_alcohol_percent%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%2C%7B%22field%22%3A%22%40lcbo_sugar_gm_per_ltr%22%2C%22injectionDepth%22%3A10000%2C%22computedFields%22%3A%5B%7B%22field%22%3A%22%40lcbo_sugar_gm_per_ltr%22%2C%22operation%22%3A%22minimum%22%7D%2C%7B%22field%22%3A%22%40lcbo_sugar_gm_per_ltr%22%2C%22operation%22%3A%22maximum%22%7D%5D%2C%22maximumNumberOfValues%22%3A1%7D%5D&facets=%5B%7B%22facetId%22%3A%22%40ec_category%22%2C%22field%22%3A%22ec_category%22%2C%22type%22%3A%22hierarchical%22%2C%22injectionDepth%22%3A1000%2C%22delimitingCharacter%22%3A%22%7C%22%2C%22filterFacetCount%22%3Atrue%2C%22basePath%22%3A%5B%22Products%22%5D%2C%22filterByBasePath%22%3Afalse%2C%22currentValues%22%3A%5B%5D%2C%22preventAutoSelect%22%3Afalse%2C%22numberOfValues%22%3A5%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40lcbo_varietal_name%22%2C%22field%22%3A%22lcbo_varietal_name%22%2C%22type%22%3A%22specific%22%2C%22sortCriteria%22%3A%22occurrences%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%5D%2C%22numberOfValues%22%3A5%2C%22freezeCurrentValues%22%3Afalse%2C%22preventAutoSelect%22%3Afalse%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40lcbo_vqa_code%22%2C%22field%22%3A%22lcbo_vqa_code%22%2C%22type%22%3A%22specific%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%5D%2C%22numberOfValues%22%3A8%2C%22freezeCurrentValues%22%3Afalse%2C%22preventAutoSelect%22%3Afalse%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40country_of_manufacture%22%2C%22field%22%3A%22country_of_manufacture%22%2C%22type%22%3A%22specific%22%2C%22sortCriteria%22%3A%22occurrences%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%5D%2C%22numberOfValues%22%3A5%2C%22freezeCurrentValues%22%3Afalse%2C%22preventAutoSelect%22%3Afalse%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40lcbo_region_name%22%2C%22field%22%3A%22lcbo_region_name%22%2C%22type%22%3A%22specific%22%2C%22sortCriteria%22%3A%22occurrences%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%5D%2C%22numberOfValues%22%3A5%2C%22freezeCurrentValues%22%3Afalse%2C%22preventAutoSelect%22%3Afalse%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40lcbo_program%22%2C%22field%22%3A%22lcbo_program%22%2C%22type%22%3A%22specific%22%2C%22sortCriteria%22%3A%22occurrences%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%5D%2C%22numberOfValues%22%3A5%2C%22freezeCurrentValues%22%3Afalse%2C%22preventAutoSelect%22%3Afalse%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40lcbo_current_offer%22%2C%22field%22%3A%22lcbo_current_offer%22%2C%22type%22%3A%22specific%22%2C%22sortCriteria%22%3A%22occurrences%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%5D%2C%22numberOfValues%22%3A8%2C%22freezeCurrentValues%22%3Afalse%2C%22preventAutoSelect%22%3Afalse%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40stores_stock%22%2C%22field%22%3A%22stores_stock%22%2C%22type%22%3A%22specific%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%5D%2C%22numberOfValues%22%3A8%2C%22freezeCurrentValues%22%3Afalse%2C%22preventAutoSelect%22%3Afalse%2C%22isFieldExpanded%22%3Afalse%7D%2C%7B%22facetId%22%3A%22%40ec_rating%22%2C%22field%22%3A%22ec_rating%22%2C%22type%22%3A%22numericalRange%22%2C%22sortCriteria%22%3A%22descending%22%2C%22injectionDepth%22%3A1000%2C%22filterFacetCount%22%3Atrue%2C%22currentValues%22%3A%5B%7B%22start%22%3A1%2C%22end%22%3A1.9%2C%22endInclusive%22%3Atrue%2C%22state%22%3A%22idle%22%7D%2C%7B%22start%22%3A2%2C%22end%22%3A2.9%2C%22endInclusive%22%3Atrue%2C%22state%22%3A%22idle%22%7D%2C%7B%22start%22%3A3%2C%22end%22%3A3.9%2C%22endInclusive%22%3Atrue%2C%22state%22%3A%22idle%22%7D%2C%7B%22start%22%3A4%2C%22end%22%3A4.9%2C%22endInclusive%22%3Atrue%2C%22state%22%3A%22idle%22%7D%2C%7B%22start%22%3A5%2C%22end%22%3A5%2C%22endInclusive%22%3Atrue%2C%22state%22%3A%22idle%22%7D%5D%2C%22numberOfValues%22%3A5%2C%22freezeCurrentValues%22%3Afalse%2C%22generateAutomaticRanges%22%3Afalse%7D%5D&facetOptions=%7B%7D&categoryFacets=%5B%5D&retrieveFirstSentences=true&timezone=America%2FNew_York&enableQuerySyntax=false&enableDuplicateFiltering=false&enableCollaborativeRating=false&debug=false&allowQueriesWithoutKeywords=true&dictionaryFieldContext=%7B%22stores_stock%22%3A%22%22%2C%22stores_inventory%22%3A%22217%22%2C%22stores_stock_combined%22%3A%22217%22%2C%22stores_low_stock_combined%22%3A%22217%22%7D’
YIKES! The big block of text in the “–data” argument is URL encoded. We can decode this using online converters such as https://meyerweb.com/eric/tools/dencoder/ and prettify with http://urlprettyprint.com/. Here’s what I find:
actionsHistory = []
referrer =
analytics = {"clientId":"","documentLocation":"https://www.lcbo.com/en/products#t=clp-products&sort=relevancy&layout=card","documentReferrer":"","pageId":""}
isGuestUser = false
aq = @ec_category==Products
searchHub = Web_Listing_EN
tab = clp-products
locale = en
firstResult = 0
numberOfResults = 48
excerptLength = 200
enableDidYouMean = true
sortCriteria = relevancy
queryFunctions = []
rankingFunctions = []
groupBy = [{"field":"@lcbo_bintag_wine_sweetness","injectionDepth":10000,"computedFields":[{"field":"@lcbo_bintag_wine_sweetness","operation":"minimum"},{"field":"@lcbo_bintag_wine_sweetness","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@lcbo_bintag_wine_body","injectionDepth":10000,"computedFields":[{"field":"@lcbo_bintag_wine_body","operation":"minimum"},{"field":"@lcbo_bintag_wine_body","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@lcbo_bintag_wine_flavor_intensity","injectionDepth":10000,"computedFields":[{"field":"@lcbo_bintag_wine_flavor_intensity","operation":"minimum"},{"field":"@lcbo_bintag_wine_flavor_intensity","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@lcbo_bintag_wine_acidity","injectionDepth":10000,"computedFields":[{"field":"@lcbo_bintag_wine_acidity","operation":"minimum"},{"field":"@lcbo_bintag_wine_acidity","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@lcbo_bintag_wine_tannins","injectionDepth":10000,"computedFields":[{"field":"@lcbo_bintag_wine_tannins","operation":"minimum"},{"field":"@lcbo_bintag_wine_tannins","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@ec_price","injectionDepth":10000,"computedFields":[{"field":"@ec_price","operation":"minimum"},{"field":"@ec_price","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@lcbo_total_volume","injectionDepth":10000,"computedFields":[{"field":"@lcbo_total_volume","operation":"minimum"},{"field":"@lcbo_total_volume","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@lcbo_alcohol_percent","injectionDepth":10000,"computedFields":[{"field":"@lcbo_alcohol_percent","operation":"minimum"},{"field":"@lcbo_alcohol_percent","operation":"maximum"}],"maximumNumberOfValues":1},{"field":"@lcbo_sugar_gm_per_ltr","injectionDepth":10000,"computedFields":[{"field":"@lcbo_sugar_gm_per_ltr","operation":"minimum"},{"field":"@lcbo_sugar_gm_per_ltr","operation":"maximum"}],"maximumNumberOfValues":1}]
facets = [{"facetId":"@ec_category","field":"ec_category","type":"hierarchical","injectionDepth":1000,"delimitingCharacter":"|","filterFacetCount":true,"basePath":["Products"],"filterByBasePath":false,"currentValues":[],"preventAutoSelect":false,"numberOfValues":5,"isFieldExpanded":false},{"facetId":"@lcbo_varietal_name","field":"lcbo_varietal_name","type":"specific","sortCriteria":"occurrences","injectionDepth":1000,"filterFacetCount":true,"currentValues":[],"numberOfValues":5,"freezeCurrentValues":false,"preventAutoSelect":false,"isFieldExpanded":false},{"facetId":"@lcbo_vqa_code","field":"lcbo_vqa_code","type":"specific","injectionDepth":1000,"filterFacetCount":true,"currentValues":[],"numberOfValues":8,"freezeCurrentValues":false,"preventAutoSelect":false,"isFieldExpanded":false},{"facetId":"@country_of_manufacture","field":"country_of_manufacture","type":"specific","sortCriteria":"occurrences","injectionDepth":1000,"filterFacetCount":true,"currentValues":[],"numberOfValues":5,"freezeCurrentValues":false,"preventAutoSelect":false,"isFieldExpanded":false},{"facetId":"@lcbo_region_name","field":"lcbo_region_name","type":"specific","sortCriteria":"occurrences","injectionDepth":1000,"filterFacetCount":true,"currentValues":[],"numberOfValues":5,"freezeCurrentValues":false,"preventAutoSelect":false,"isFieldExpanded":false},{"facetId":"@lcbo_program","field":"lcbo_program","type":"specific","sortCriteria":"occurrences","injectionDepth":1000,"filterFacetCount":true,"currentValues":[],"numberOfValues":5,"freezeCurrentValues":false,"preventAutoSelect":false,"isFieldExpanded":false},{"facetId":"@lcbo_current_offer","field":"lcbo_current_offer","type":"specific","sortCriteria":"occurrences","injectionDepth":1000,"filterFacetCount":true,"currentValues":[],"numberOfValues":8,"freezeCurrentValues":false,"preventAutoSelect":false,"isFieldExpanded":false},{"facetId":"@stores_stock","field":"stores_stock","type":"specific","injectionDepth":1000,"filterFacetCount":true,"currentValues":[],"numberOfValues":8,"freezeCurrentValues":false,"preventAutoSelect":false,"isFieldExpanded":false},{"facetId":"@ec_rating","field":"ec_rating","type":"numericalRange","sortCriteria":"descending","injectionDepth":1000,"filterFacetCount":true,"currentValues":[{"start":1,"end":1.9,"endInclusive":true,"state":"idle"},{"start":2,"end":2.9,"endInclusive":true,"state":"idle"},{"start":3,"end":3.9,"endInclusive":true,"state":"idle"},{"start":4,"end":4.9,"endInclusive":true,"state":"idle"},{"start":5,"end":5,"endInclusive":true,"state":"idle"}],"numberOfValues":5,"freezeCurrentValues":false,"generateAutomaticRanges":false}]
facetOptions = {}
categoryFacets = []
retrieveFirstSentences = true
timezone = America/New_York
enableQuerySyntax = false
enableDuplicateFiltering = false
enableCollaborativeRating = false
debug = false
allowQueriesWithoutKeywords = true
dictionaryFieldContext = {"stores_stock":"","stores_inventory":"217","stores_stock_combined":"217","stores_low_stock_combined":"217"}
We can start throwing away some arguments that we do not need such as “groupBy” and “analytics”. With some experimentation1, I was able to widdle it down to this:
locale = en
firstResult = '"$1"'
numberOfResults = '"$2"'
excerptLength = 200
sortCriteria = @ec_price '"$3"'
facets = [{"facetId":"@ec_category","field":"ec_category","type":"hierarchical","injectionDepth":1000,"delimitingCharacter":"|","filterFacetCount":true,"basePath":["Products"],"filterByBasePath":false,"currentValues":[{"value":"wine","state":"idle","children":[{"value":"'"$4"'","state":"selected","children":[],"retrieveChildren":false,"retrieveCount":0}],"retrieveChildren":false,"retrieveCount":0}],"preventAutoSelect":true,"numberOfValues":1,"isFieldExpanded":false}]
facetOptions = {"freezeFacetOrder":true}
categoryFacets = []
retrieveFirstSentences = true
timezone = America/New_York
enableQuerySyntax = false
enableDuplicateFiltering = false
enableCollaborativeRating = false
debug = false
allowQueriesWithoutKeywords = true
A few additional modifications have been made to the above.
- Added a sort criterion “ec_price” for price.
- Modified “facets” to return only wine-related products. See the end of this section for non-wine products.
- Added 4 variable placeholders in particular locations with the following meanings:
- $1: page number (starts at 0)
- $2: number of requests per page number (max 10002)
- $3: sort order; use “ascending” or “descending” text without quotation marks.
- $4: wine-category; must be URL encoded. E.g., red$20wine
When we re-encode the above, we can rebuild our cURL command:
curl --location 'https://platform.cloud.coveo.com/rest/search/v2?organizationId=lcboproductionx2kwygnc' \
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0' \
--header 'Accept: */*' \
--header 'Accept-Language: en-CA,en-US;q=0.7,en;q=0.3' \
--header 'Accept-Encoding: gzip, deflate, br' \
--header 'Authorization: Bearer xx883b5583-07fb-416b-874b-77cce565d927' \
--header 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
--header 'Origin: https://www.lcbo.com' \
--header 'DNT: 1' \
--header 'Connection: keep-alive' \
--header 'Referer: https://www.lcbo.com/' \
--header 'Sec-Fetch-Dest: empty' \
--header 'Sec-Fetch-Mode: cors' \
--header 'Sec-Fetch-Site: cross-site' \
--header 'locale=en&firstResult='"$1"'&numberOfResults='"$2"'&excerptLength=2000&sortCriteria=%40ec_price%20'"$3"'&facets=%5B%7B%22facetId%22%3A%22%40ec_category%22%2C%22field%22%3A%22ec_category%22%2C%22type%22%3A%22hierarchical%22%2C%22injectionDepth%22%3A1000%2C%22delimitingCharacter%22%3A%22%7C%22%2C%22filterFacetCount%22%3Atrue%2C%22basePath%22%3A%5B%22Products%22%5D%2C%22filterByBasePath%22%3Afalse%2C%22currentValues%22%3A%5B%7B%22value%22%3A%22wine%22%2C%22state%22%3A%22idle%22%2C%22children%22%3A%5B%7B%22value%22%3A%22'"$4"'%22%2C%22state%22%3A%22selected%22%2C%22children%22%3A%5B%5D%2C%22retrieveChildren%22%3Afalse%2C%22retrieveCount%22%3A0%7D%5D%2C%22retrieveChildren%22%3Afalse%2C%22retrieveCount%22%3A0%7D%5D%2C%22preventAutoSelect%22%3Atrue%2C%22numberOfValues%22%3A1%2C%22isFieldExpanded%22%3Afalse%7D%5D&facetOptions=%7B%22freezeFacetOrder%22%3Atrue%7D&categoryFacets=%5B%5D&retrieveFirstSentences=true&timezone=America%2FNew_York&enableQuerySyntax=false&enableDuplicateFiltering=false&enableCollaborativeRating=false&debug=false&allowQueriesWithoutKeywords=true' \
--data ''"$5"'' --output
I added the “–output” command and a 5th variable placeholder which represents the output filename. Save the script and run it in a terminal with the following commands:
chmod +x wine_request.sh
./wine_request.sh 0 1000 ascending red%20wine redwine.json
The arguments trailing the script “./wine_request.sh” are passed to the numbered arguments $1 through $5, respectively. Congratulations! We have downloaded and saved the JSON file as ‘redwine.json’. Better yet, we can download 1,000 products per query!
There are non-wine beverages as well (e.g., beer, spirits, cider, icewine). Due to the query structure, we cannot use the same script above for non-wine products. We can create a second script that is nearly identical with the same arguments.
curl --location 'https://platform.cloud.coveo.com/rest/search/v2?organizationId=lcboproductionx2kwygnc' \
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0' \
--header 'Accept: */*' \
--header 'Accept-Language: en-CA,en-US;q=0.7,en;q=0.3' \
--header 'Accept-Encoding: gzip, deflate, br' \
--header 'Authorization: Bearer xx883b5583-07fb-416b-874b-77cce565d927' \
--header 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
--header 'Origin: https://www.lcbo.com' \
--header 'DNT: 1' \
--header 'Connection: keep-alive' \
--header 'Referer: https://www.lcbo.com/' \
--header 'Sec-Fetch-Dest: empty' \
--header 'Sec-Fetch-Mode: cors' \
--header 'Sec-Fetch-Site: cross-site' \
--header 'locale=en&firstResult='"$1"'&numberOfResults='"$2"'&excerptLength=2000&sortCriteria=%40ec_price%20'"$3"'&facets=%5B%7B%22facetId%22%3A%22%40ec_category%22%2C%22field%22%3A%22ec_category%22%2C%22type%22%3A%22hierarchical%22%2C%22injectionDepth%22%3A1000%2C%22delimitingCharacter%22%3A%22%7C%22%2C%22filterFacetCount%22%3Atrue%2C%22basePath%22%3A%5B%22Products%22%5D%2C%22filterByBasePath%22%3Afalse%2C%22currentValues%22%3A%5B%7B%22value%22%3A%22'"$4"'%22%2C%22state%22%3A%22selected%22%2C%22children%22%3A%5B%5D%2C%22retrieveChildren%22%3Afalse%2C%22retrieveCount%22%3A0%7D%5D%2C%22preventAutoSelect%22%3Atrue%2C%22numberOfValues%22%3A1%2C%22isFieldExpanded%22%3Afalse%7D%5D&facetOptions=%7B%22freezeFacetOrder%22%3Atrue%7D&categoryFacets=%5B%5D&retrieveFirstSentences=true&timezone=America%2FNew_York&enableQuerySyntax=false&enableDuplicateFiltering=false&enableCollaborativeRating=false&debug=false&allowQueriesWithoutKeywords=true' \
--data ''"$5"'' --output
Part 3: Collecting all the data
Okay… so how do I get all the products?
We need to create another script that executes multiple requests. Below, I will describe pieces of a script that we will combine at the end.
A. Run a loop over each category of wine
I have a text file containing wine categories in each row. The first column is the URL encoded string to pass into the cURL command. The second column is the prefix for the filenames.
red%20wine redwine
white%20wine whitewine
sparkling%20wine sparklingwine
rose%20wine rosewine
fortified%20wine fortifiedwine
champagne champagne
sake%20%26%20rice%20wine sakericewine
specialty%20wine specialtywine
gifts%20and%20samplers giftsandsamplers
icewine icewine
The following while loop iterates over each row in the text file above. The first and second columns are represented as variables “htag” and “ftag”. These variables are into a function called “scrape_request_loop()”.
while IFS= read -r htag ftag; do
scrape_request_loop $htag $ftag
done < ./code/wine_names.txt
B. Loop over page number requests
The function “scrape_request_loop()” has multiple steps and can be seen in its entirety at the end of this subsection. I first describe the function in chunks before combining the chunks together.
We request the first 1,000 products for the specified category with the command below. You can assume the function “request_ascending()” executes the cURL command and saves the JSON file.
# $1: $htag; wine-category URL substring; e.g., red$20wine
# $2: $ftag; wine-category filename prefix; e.g., redwine
# E.g., request_ascending pageNumber numProducts htag ftag
request_ascending 0 1000 $1 $2
The JSON file contains a variable “totalCount” which is the number of products found in the category specified. This is important since it determines whether there are more than 1,000 wines in this category. The following commands read the newly made JSON file (using “jq”) and determines how many more query requests are needed. It then makes these query requests with the for loop. The for loop is skipped if “totalCount” is less than 1000.
totalCount=$(jq '.totalCount' $fname) # Read newly made JSON file with jq. Get totalCount.
countPer1000=$((totalCount/1000)) # Number of 1000-batches needed.
maxPages=$countPer1000 # Maximum number of pages (redundant for now...)
for pageNumber in $(seq 1000 1000 $(($maxPages*1000)))
do
request_ascending $pageNumber 1000 $1 $2
done
Unfortunately, the solution above does not work if “totalCount” exceeds 5,000. If you request “pageNumber” 5 and beyond (the index starts at 0), the server will return results for “pageNumber = 4”. There appears to be a hidden limit set by the server. While this is mildly annoying, the simple fix is to reverse the sort order from ascending to descending and query again. Of course, this solution cannot work immediately since there are 13,186 products and only 10,000 will be delivered. This is why I filtered by product category to reduce the maximum results. This solution will also fail for red wines since there are 9,400 product descriptions available. That will not happen any time soon and is a future Stephen problem3.
We need to change the “maxPages” definition to take the minimum between (“countPer1000”, 4).
maxPages=$(($countPer1000<4?$countPer1000:4)) # Minimum value b/w (countPer1000, 4)
After the for loop executing the function “request_ascending()” is complete, we need to append the following for loop to obtain the remaining solutions:
if [ $countPer1000 -ge 5 ]; then
for pageNumber in $(seq 0 1000 $(($countPer1000*1000-4000)))
do
request_descending $pageNumber 1000 $1 $2
done
fi
C. The entire script
The entire script is shown below. After setting permissions (i.e., “chmod +x”), the script starts at the bottom where the while loops begin.
The functions “request_ascending()” and “request_descending()” contain the command to execute the cURL scripts.
#!/bin/bash
chmod +x code/wine_request.sh
chmod +x code/nonwine_request.sh
request_ascending () {
# $1: start index
# $2: number of requests
# $3: wine-category URL substring; e.g., red$20wine
# $4: wine-category filename prefix; e.g., redwine
fname="json/products/$4.a.$1.json"
sudo $SCRAPE_SH $1 $2 ascending $3 $fname
sleep 1
}
request_descending () {
fname="json/products/$4.d.$1.json"
sudo $SCRAPE_SH $1 $2 descending $3 $fname
sleep 1
}
scrape_request_loop (){
# $1: $htag; wine-category URL substring; e.g., red$20wine
# $2: $ftag; wine-category filename prefix; e.g., redwine
request_ascending 0 1000 $1 $2
totalCount=$(jq '.totalCount' $fname)
countPer1000=$((totalCount/1000))
maxPages=$(($countPer1000<4?$countPer1000:4))
for i in $(seq 1000 1000 $(($maxPages*1000)))
do
request_ascending $i 1000 $1 $2
done
if [ $countPer1000 -ge 5 ]; then
for i in $(seq 0 1000 $(($countPer1000*1000-4000)))
do
request_descending $i 1000 $1 $2
done
fi
}
SCRAPE_SH="./code/wine_request.sh"
while IFS= read -r htag ftag; do
scrape_request_loop $htag $ftag
done < ./code/wine_names.txt
SCRAPE_SH="./code/nonwine_request.sh"
while IFS= read -r htag ftag; do
scrape_request_loop $htag $ftag
done < ./code/nonwine_names.txt
Pros & Cons
Pros
- Extremely fast (1 min!) and no babysitting
- Very cheap
- Automatable for Github Actions (a big deal)
- Gentle on servers
- Can execute more frequently for up to date price changes
- Requests existing products only (unlike the brute-force method)
Cons
- Read and manage URL-encoded variables carefully
- Time to understand cURL parameters
- Close attention to hidden limits
Data Summary
Once I have all of the product description data, I can generate a list of SKU numbers that I can query to obtain the live inventory (see Part 2 of this blog series)! There are approximately registered products: 9,400 wine and 4,100 non-wine. “Non-wine” products can include beer, liquor, and reusuable bags. Only 2,800 wines and 2,100 non-wines are purchasable as of 5/14/2023.
The files described above are located in the repository: https://github.com/royourboat/lcbo-wine-scraper. Specifically,
code/all_product_request.sh
code/wine_request.sh
code/nonwine_request.sh
code/wine_names.txt
code/nonwine_names.txt
Footnotes
Citation
@online{ro2023,
author = {Ro, Stephen},
title = {Scraping {LCBO} {Data} {(Part} 3: {Product} {Descriptions)}},
date = {2023-06-17},
url = {https://royourboat.github.io/posts/2023-06-17-lcbo-scraper/},
langid = {en}
}