Rants and Data
For developers working on getting business data from the web, there is almost always a need to perform data parsing upon web harvest.
In this post I want to share on my benchmarking the two basic techniques used in web scraping to parse the scraped data: Regex* and XPath. *Regex works as the pattern applied to any text (incl. html) to fetch matched pieces of content while XPath (similar to CSS path) traverses the DOM html document to select and fetch matched nodes. We will try to parse sample data with PHP server-side and see the complexity of those techniques and compare time cost for the above mentioned techniques. Let’s take a simple mobile.de VW search result page. Its raw html is here - view-source::http://suchen.mobile.de/auto/volkswagen.html, though hardly readable. REGEX TECHNIQUE Suppose we want to get title, link, and price for each item. Let’s look at the page’s html and through developer tools (F12 or Ctrl+Shift+I) find an html content piece pertaining to a single list item. For the sake of simplicity I’ve distilled the real html into the following snippet: <div class="vehicleDetails vehicleDetailsMain" data-id="201841567" data-href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1"> <div class="topOfPageTitle"> <a class="infoLink detailsViewLink" href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&negativeFeatures=EXPORT" rel="nofollow" onclick="mga('send', 'event', 'car', '/en/public/ses/top of page')">Volkswagen Golf VI " STYLE" 1.6 TDI AHK, SHZ, GRA, Klima AL</a> </div> <div class="topAdDesc"> <span class="commercial">Top Ad</span> </div> <div class="imageWrapper extraMargin"> <div class="image"> <span></span> <a href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&action=topOfPage&negativeFeatures=EXPORT" onclick="mga('send', 'event', 'car', '/en/public/ses/top of page')"> <img alt="Vehicle picture" width="150" src="http://i.ebayimg.com/00/s/NDgwWDY0MA==/z/wh4AAOSwuMFUZ3~J/$_18.JPG" /> </a> </div> <div class="stackBoxOne"></div> <div class="stackBoxTwo"></div> <div class="video"><span>Video</span></div> </div> <div class="description"> <div class="list"> <span title="Saloon / Used vehicle German edition New HU Manual gearbox 77 kW (105 PS), Diesel">Saloon / Used vehicle German edition New HU...</span> </div> <div class="fuelConsumption"> <div>Fuel consumption combined:<br/>ca 4.7 l/100 km **</div> <div>CO<sub>2</sub>-Emissions combined:<br/>ca 123 g/km **</div> </div> <div class="dealerdata"> <div class="address"> Autohaus Gartner GmbH & Co.KG,83549 Eiselfing </div> <div class="phoneNumber"> Phone +49 (0)8071 92030 </div> </div> <div class="rightSideColumns"> <div class="pricePrimaryCountryOfSale priceGross">11,400 EUR</div> <div class="priceSecondaryCountryOfSale priceNet"></div> <div class="pricePrimaryCountryOfOrigin"></div> Here goes the PHP code to apply regex for each of the item attributes: <?php $html = <<<EOD <!DOCTYPE html> … EOD; echo '<h2>All links </h2>'; $patternHref = '/data-id="\d+" data-href="([^"]+?)"/'; preg_match_all($patternHref, $html, $links); $i=1; foreach($links[1] as $link) echo '<br>', $i++, '. ', $link; $patternTitle = '~<a [^>]*?>(.*?)<\/a>~'; preg_match_all($patternTitle, $html, $titles); $i=1; echo '<h2>All titles </h2>'; foreach($titles[1] as $title) echo '<br>', $i++, '. ', $title; $patternPrice = '~<div >([^>]*?)<\/div>~'; preg_match_all($patternPrice, $html, $prices); $i=1; echo '<h2>All prices </h2>'; foreach($prices[1] as $price) echo '<br>', $i++, '. ', $price; The main problem with Regex parsing is inconsistency. See below in the 3 parsed sets, the last one (Prices) includes the prices of 3 more items which are from an ad inserted in the result page. While the values of the first 2 sets (columns) might be related, the third set values hardly relate to the particular items. Titles 1. Volkswagen Golf VI " STYLE" 1.6 TDI AHK, SHZ, GRA, Klima AL 2. Volkswagen Phaeton 3.0 V6 TDI DPF 4MOTION Automatik Massage 3. Volkswagen Golf 6 ez 09 4. Volkswagen VW Lupo 1.4i Princeton Klima Servo TUV 11/... 5. Volkswagen Caddy Trendline SOCCER 6. Volkswagen Vw Lupo 1.0 mpi mit TUV 7. Volkswagen Golf 1.6 TDI DPF Team 8. Volkswagen Touareg 2.5 R5 TDI 9. Volkswagen Polo 1.2 Trendline Navi 10. Volkswagen Polo 1.2 Trendline Navi 11. Volkswagen T5 FACELIFT EDITION 25 MEGA OPTIK 18ZOLL 1A 12. Volkswagen Golf 1.8 Automatik Rolling Stones Collection 13. Volkswagen Vw passat 1.6 74 kw tuv 02.15 14. Volkswagen Volkswagen Polo 15. Volkswagen Gepflegter VW Passat 3B Kombi / Variant 1.... 16. Volkswagen vw passat 1,8 alu sehr sauber 17. Volkswagen Polo 1.2 Sitzheizung 18. Volkswagen Volkswagen Golf 2.0 GTI DSG 19. Volkswagen Polo 1.2 9N *SITZHEIZUNG*EURO4* 20. Volkswagen Golf 21. Volkswagen Touran 1.9 TDI DPF Trendline Links 1. http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html 2. http://suchen.mobile.de/auto-inserat/vw-phaeton-3-0-v6-tdi-dpf-4motion-automatik-massage-dortmund/204056406.html 3. http://suchen.mobile.de/auto-inserat/vw-golf-6-ez-09-altenburg/204056397.html 4. http://suchen.mobile.de/auto-inserat/vw-lupo-vw-lupo-1-4i-princeton-klima-servo-t%C3%BCv-11-sprockh%C3%B6vel/204056373.html 5. http://suchen.mobile.de/auto-inserat/vw-caddy-trendline-soccer-wolfhagen/204056347.html 6. http://suchen.mobile.de/auto-inserat/vw-lupo-vw-lupo-1-0-mpi-mit-t%C3%BCv-ludwigsburg/204056335.html 7. http://suchen.mobile.de/auto-inserat/vw-golf-1-6-tdi-dpf-team-sch%C3%B6ffengrund/204056327.html 8. http://suchen.mobile.de/auto-inserat/vw-touareg-2-5-r5-tdi-wardenburg-bei-old/201546204.html 9. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-trendline-navi-nordhorn/204056286.html 10. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-trendline-navi-nordhorn/204056287.html 11. http://suchen.mobile.de/auto-inserat/vw-t5-multivan-t5-facelift-edition-25-mega-optik-18zoll-1a-bermatingen-n%C3%A4he-b/203784734.html 12. http://suchen.mobile.de/auto-inserat/vw-golf-1-8-automatik-rolling-stones-collection-berlin/204056239.html 13. http://suchen.mobile.de/auto-inserat/vw-passat-vw-passat-1-6-74-kw-t%C3%BCv-02-15-coburg/204056232.html 14. http://suchen.mobile.de/auto-inserat/vw-polo-volkswagen-polo-m%C3%BChlheim-am-main/204056220.html 15. http://suchen.mobile.de/auto-inserat/vw-passat-gepflegter-vw-passat-3b-kombi-variant-1-stuhr/204056215.html 16. http://suchen.mobile.de/auto-inserat/vw-passat-vw-passat-1-8-alu-sehr-sauber-vohwinkel/204056199.html 17. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-sitzheizung-lahnstein/203891580.html 18. http://suchen.mobile.de/auto-inserat/vw-golf-volkswagen-golf-2-0-gti-dsg-mosbach/204056185.html 19. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-9n-sitzheizung-euro4-butzbach/204056145.html 20. http://suchen.mobile.de/auto-inserat/vw-golf-quadrat-ichendorf/204056142.html 21. http://suchen.mobile.de/auto-inserat/vw-touran-1-9-tdi-dpf-trendline-dortmund/204056138.html Prices 1. 13,959 EUR 2. 11,400 EUR 3. 8,000 EUR 4. 2,200 EUR 5. 11,750 EUR 6. 11,780 EUR (Gross)* 7. 11,890 EUR 8. 19,990 EUR (Gross)* 9. 750 EUR 10. 11,550 EUR 11. 5,999 EUR 12. 4,800 EUR 13. 4,800 EUR 14. 14,490 EUR 15. 999 EUR 16. 700 EUR 17. 1,099 EUR 18. 1,800 EUR 19. 700 EUR 20. 9,900 EUR (Gross)* 21. 16,499 EUR 22. 2,990 EUR 23. 450 EUR 24. 8,280 EUR I’ve benchmarked the code for 5000 iterations and you can see the result in here (over 9 secs). The time consumption for a single item parse is fair - 0.002 second. As far as complexity, as an experienced regex composer, it took me about 1.5-2 hours to analyze scraped html and to make the patterns as well as to test them. XPath TECHNIQUE Now we come to the XPath technique, this technique being applied to XML/XHTML docs, so we first need to brush up the raw html. Preparing html as a strict XML To parse raw data, the xPath technique is far advanced compared to Regex. Strictly speaking, xPath is applied to XML/XHTML docs, so here are the steps to do it with raw html:
See the following code that performs the two above mentioned steps: <?php $html = <<<EOD <!DOCTYPE html> … EOD; // here we remove unwanted html chars and tags $html = str_replace(' ', ' ', $html); $html = str_replace('<br/>', ' ', $html); $html = str_replace('<noindex>', '', $html); $html = str_replace('</noindex>', '', $html); $html = str_replace('noindex', '', $html); // we suppress libxml internal errors libxml_use_internal_errors(true); // Getting the text into the DOM Document for further parse $DOM = new DOMDocument('1.0', 'UTF-8'); $DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />' . $html); // Initiating DOM XPath Document for parse $xpath = new DOMXPath($DOM); $xpath->registerNamespace("php", "http://php.net/xpath"); $xpath->registerPHPFunctions(); Traversing thru xPath Let's go to the developer tools for finding xPaths for item attributes. The title xPath notation is: //div[@]/a[@"infoLink detailsViewLink"]/text() The href attribute of ‘a’ tag xPath notation is: //div[@]/a[@"infoLink detailsViewLink"]/@href The price nodes xPath notation is: //*[@id="parkAndCompareVehicle"]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text() The thing is that xPath selects/refers to all the nodes under a given notation. So with one shot we get all the nodes and are able to store them. The best way to fetch them in the structured way would be to iterate over them. See the following function, which does that: <?php function get_cars_from_xpath_object($xpath) { $prices = $xpath->query('//*[@id="parkAndCompareVehicle"]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text()'); $titles = $xpath->query('//div[@]/a[@"infoLink detailsViewLink"]/text()'); $hrefs = $xpath->query('//div[@]/a[@"infoLink detailsViewLink"]/@href'); $cars=array(); $i=0; while($prices->item($i)->nodeValue) { $cars[] = array('title'=>$titles->item($i)->nodeValue, 'href'=>$hrefs->item($i)->nodeValue, 'price'=>$prices->item($i)->nodeValue ); $i++; } return $cars; } If we want to benchmark xPath parsing we need to iterate over the whole process of (1) removing the broken pieces from raw html, (2) making it DOM structure and (3) traversing by xPath notation to fetch nodes. For the 500 iterations that include removing unwanted html content and converting to DOM structure it took 10.5 seconds, about 0.02 seconds for each single item info parse. The result is rather unexpected but we have to include html preparation and DOM structure in each iteration because in real web scraping a scraper goes over paginated data and for each page it needs to do the above mentioned procedures. The complexity of the xPath notations forming and testing does not exceed Regex’s forming and testing complexity. So the xPath technique is the most common way for getting scraped data into business directories, because it almost always returns results which are related and thus reliable from structured document. Conclusion The Regex and xPath benchmarking has shown the `xPath technique` to be superior to `Regex technique` in parsing scraped data and being more precise and structure related. Actually it is XPath rather than Regex that is being widely used for scraped data parsing. The only note is that when using xPath an html content must be for the most part structured so that the XML library is able to read and understand it. Average time for single item parse (seconds) Regex 0.002 XPath 0.02
0 Comments
|
AuthorData Geek, Growth Hacker. Archives
July 2015
Categories |