Blog Archives

XPath vs Regex for parsing scraped html content

3/17/2015

For developers working on getting business data from the web, there is almost always a need to perform data parsing upon web harvest.

In this post I want to share on my benchmarking the two basic techniques used in

web scraping to parse the scraped data: Regex* and XPath.

*Regex works as the pattern applied to any text (incl. html) to fetch matched

pieces of content while XPath (similar to CSS path) traverses the DOM html document to select and fetch matched nodes.

We will try to parse sample data with PHP server-side and see the complexity of those techniques and compare time cost for the above mentioned techniques.

Let’s take a simple mobile.de VW search result page. Its raw html is here - view-source::http://suchen.mobile.de/auto/volkswagen.html, though hardly readable.

REGEX TECHNIQUE

Suppose we want to get title, link, and price for each item.

Let’s look at the page’s html and through developer tools (F12 or Ctrl+Shift+I) find an html content piece pertaining to a single list item. For the sake of simplicity I’ve distilled the real html into the following snippet:

<div class="vehicleDetails vehicleDetailsMain" data-id="201841567" data-href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1">

<div class="topOfPageTitle">      <a class="infoLink detailsViewLink" href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&negativeFeatures=EXPORT" rel="nofollow" onclick="mga('send', 'event', 'car', '/en/public/ses/top of page')">Volkswagen Golf VI " STYLE" 1.6 TDI AHK, SHZ, GRA, Klima AL</a>      </div>      <div class="topAdDesc">      <span class="commercial">Top Ad</span>      </div>      <div class="imageWrapper extraMargin">      <div class="image">      <span></span>      <a href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&action=topOfPage&negativeFeatures=EXPORT" onclick="mga('send', 'event', 'car', '/en/public/ses/top of page')">      <img alt="Vehicle picture" width="150" src="http://i.ebayimg.com/00/s/NDgwWDY0MA==/z/wh4AAOSwuMFUZ3~J/$_18.JPG" />      </a>      </div>      <div class="stackBoxOne"></div>      <div class="stackBoxTwo"></div>      <div class="video"><span>Video</span></div>      </div>      <div class="description">      <div class="list">      <span title="Saloon / Used vehicle German edition New HU Manual gearbox 77 kW (105 PS), Diesel">Saloon / Used vehicle German edition New HU...</span>      </div>      <div class="fuelConsumption">      <div>Fuel consumption combined:<br/>ca 4.7 l/100 km **</div>      <div>CO<sub>2</sub>-Emissions combined:<br/>ca 123 g/km **</div>      </div>      <div class="dealerdata">      <div class="address">      Autohaus Gartner GmbH & Co.KG,83549 Eiselfing      </div>      <div class="phoneNumber">      Phone +49 (0)8071 92030      </div>      </div> <div class="rightSideColumns">   <div class="pricePrimaryCountryOfSale priceGross">11,400 EUR</div>                              <div class="priceSecondaryCountryOfSale priceNet"></div> <div class="pricePrimaryCountryOfOrigin"></div>

Here goes the PHP code to apply regex for each of the item attributes:

<?php

$html = <<<EOD

<!DOCTYPE html>

…

EOD;

echo '<h2>All links </h2>';

$patternHref = '/data-id="\d+" data-href="([^"]+?)"/';

preg_match_all($patternHref, $html, $links);

$i=1;

foreach($links[1] as $link)

echo '<br>', $i++, '. ', $link;

$patternTitle = '~<a [^>]*?>(.*?)<\/a>~';

preg_match_all($patternTitle, $html, $titles);

$i=1;

echo '<h2>All titles </h2>';

foreach($titles[1] as $title)

echo '<br>', $i++, '. ', $title;



$patternPrice = '~<div >([^>]*?)<\/div>~';

preg_match_all($patternPrice, $html, $prices);

$i=1;

echo '<h2>All prices </h2>';

foreach($prices[1] as $price)

echo '<br>', $i++, '. ', $price;

The main problem with Regex parsing is inconsistency. See below in the 3 parsed sets, the last one (Prices) includes the prices of 3 more items which are from an ad inserted in the result page. While the values of the first 2 sets (columns) might be related, the third set values hardly relate to the particular items.

Titles

1. Volkswagen Golf VI " STYLE" 1.6 TDI AHK, SHZ, GRA, Klima AL

2. Volkswagen Phaeton 3.0 V6 TDI DPF 4MOTION Automatik Massage

3. Volkswagen Golf 6 ez 09

4. Volkswagen VW Lupo 1.4i Princeton Klima Servo TUV 11/...

5. Volkswagen Caddy Trendline SOCCER

6. Volkswagen Vw Lupo 1.0 mpi mit TUV

7. Volkswagen Golf 1.6 TDI DPF Team

8. Volkswagen Touareg 2.5 R5 TDI

9. Volkswagen Polo 1.2 Trendline Navi

10. Volkswagen Polo 1.2 Trendline Navi

11. Volkswagen T5 FACELIFT EDITION 25 MEGA OPTIK 18ZOLL 1A

12. Volkswagen Golf 1.8 Automatik Rolling Stones Collection

13. Volkswagen Vw passat 1.6 74 kw tuv 02.15

14. Volkswagen Volkswagen Polo

15. Volkswagen Gepflegter VW Passat 3B Kombi / Variant 1....

16. Volkswagen vw passat 1,8 alu sehr sauber

17. Volkswagen Polo 1.2 Sitzheizung

18. Volkswagen Volkswagen Golf 2.0 GTI DSG

19. Volkswagen Polo 1.2 9N *SITZHEIZUNG*EURO4*

20. Volkswagen Golf

21. Volkswagen Touran 1.9 TDI DPF Trendline

Links

1. http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html

2. http://suchen.mobile.de/auto-inserat/vw-phaeton-3-0-v6-tdi-dpf-4motion-automatik-massage-dortmund/204056406.html

3. http://suchen.mobile.de/auto-inserat/vw-golf-6-ez-09-altenburg/204056397.html

4. http://suchen.mobile.de/auto-inserat/vw-lupo-vw-lupo-1-4i-princeton-klima-servo-t%C3%BCv-11-sprockh%C3%B6vel/204056373.html

5. http://suchen.mobile.de/auto-inserat/vw-caddy-trendline-soccer-wolfhagen/204056347.html

6. http://suchen.mobile.de/auto-inserat/vw-lupo-vw-lupo-1-0-mpi-mit-t%C3%BCv-ludwigsburg/204056335.html

7. http://suchen.mobile.de/auto-inserat/vw-golf-1-6-tdi-dpf-team-sch%C3%B6ffengrund/204056327.html

8. http://suchen.mobile.de/auto-inserat/vw-touareg-2-5-r5-tdi-wardenburg-bei-old/201546204.html

9. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-trendline-navi-nordhorn/204056286.html

10. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-trendline-navi-nordhorn/204056287.html

11. http://suchen.mobile.de/auto-inserat/vw-t5-multivan-t5-facelift-edition-25-mega-optik-18zoll-1a-bermatingen-n%C3%A4he-b/203784734.html

12. http://suchen.mobile.de/auto-inserat/vw-golf-1-8-automatik-rolling-stones-collection-berlin/204056239.html

13. http://suchen.mobile.de/auto-inserat/vw-passat-vw-passat-1-6-74-kw-t%C3%BCv-02-15-coburg/204056232.html

14. http://suchen.mobile.de/auto-inserat/vw-polo-volkswagen-polo-m%C3%BChlheim-am-main/204056220.html

15. http://suchen.mobile.de/auto-inserat/vw-passat-gepflegter-vw-passat-3b-kombi-variant-1-stuhr/204056215.html

16. http://suchen.mobile.de/auto-inserat/vw-passat-vw-passat-1-8-alu-sehr-sauber-vohwinkel/204056199.html

17. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-sitzheizung-lahnstein/203891580.html

18. http://suchen.mobile.de/auto-inserat/vw-golf-volkswagen-golf-2-0-gti-dsg-mosbach/204056185.html

19. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-9n-sitzheizung-euro4-butzbach/204056145.html

20. http://suchen.mobile.de/auto-inserat/vw-golf-quadrat-ichendorf/204056142.html

21. http://suchen.mobile.de/auto-inserat/vw-touran-1-9-tdi-dpf-trendline-dortmund/204056138.html

Prices

1. 13,959 EUR

2. 11,400 EUR

3. 8,000 EUR

4. 2,200 EUR

5. 11,750 EUR

6. 11,780 EUR (Gross)*

7. 11,890 EUR

8. 19,990 EUR (Gross)*

9. 750 EUR

10. 11,550 EUR

11. 5,999 EUR

12. 4,800 EUR

13. 4,800 EUR

14. 14,490 EUR

15. 999 EUR

16. 700 EUR

17. 1,099 EUR

18. 1,800 EUR

19. 700 EUR

20. 9,900 EUR (Gross)*

21. 16,499 EUR

22. 2,990 EUR

23. 450 EUR

24. 8,280 EUR

I’ve benchmarked the code for 5000 iterations and you can see the result in here (over 9 secs). The time consumption for a single item parse is fair - 0.002 second.

As far as complexity, as an experienced regex composer, it took me about 1.5-2 hours to analyze scraped html and to make the patterns as well as to test them.

XPath TECHNIQUE Now we come to the XPath technique, this technique being applied to XML/XHTML docs, so we first need to brush up the raw html.

Preparing html as a strict XML

To parse raw data, the xPath technique is far advanced compared to Regex. Strictly speaking, xPath is applied to XML/XHTML docs, so here are the steps to do it with raw html:

remove the broken pieces from raw html content
make html content a DOM structure (XML document)

See the following code that performs the two above mentioned steps:

<?php

$html = <<<EOD

<!DOCTYPE html>

…

EOD;

// here we remove unwanted html chars and tags

$html = str_replace(' ', ' ', $html);

$html = str_replace('<br/>', ' ', $html);

$html = str_replace('<noindex>', '', $html);

$html = str_replace('</noindex>', '', $html);

$html = str_replace('noindex', '', $html);

// we suppress libxml internal errors

libxml_use_internal_errors(true);

// Getting the text into the DOM Document for further parse

$DOM = new DOMDocument('1.0', 'UTF-8');

$DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />' . $html);

// Initiating DOM XPath Document for parse

$xpath = new DOMXPath($DOM);

$xpath->registerNamespace("php", "http://php.net/xpath");

$xpath->registerPHPFunctions();

Traversing thru xPath

Let's go to the developer tools for finding xPaths for item attributes.

The title xPath notation is:

//div[@]/a[@"infoLink detailsViewLink"]/text()

The href attribute of ‘a’ tag xPath notation is:

//div[@]/a[@"infoLink detailsViewLink"]/@href

The price nodes xPath notation is:

//*[@id="parkAndCompareVehicle"]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text()

The thing is that xPath selects/refers to all the nodes under a given notation. So with one shot we get all the nodes and are able to store them. The best way to fetch them in the structured way would be to iterate over them. See the following function, which does that:

<?php

function get_cars_from_xpath_object($xpath) {

    $prices = $xpath->query('//*[@id="parkAndCompareVehicle"]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text()');

    $titles = $xpath->query('//div[@]/a[@"infoLink detailsViewLink"]/text()');                $hrefs = $xpath->query('//div[@]/a[@"infoLink detailsViewLink"]/@href');

    $cars=array();

    $i=0;

    while($prices->item($i)->nodeValue)

    {

       $cars[] = array('title'=>$titles->item($i)->nodeValue,         'href'=>$hrefs->item($i)->nodeValue, 'price'=>$prices->item($i)->nodeValue );

       $i++;

    }

    return $cars;

}

If we want to benchmark xPath parsing we need to iterate over the whole process of (1) removing the broken pieces from raw html, (2) making it DOM structure and (3) traversing by xPath notation to fetch nodes.

For the 500 iterations that include removing unwanted html content and converting to DOM structure it took 10.5 seconds, about 0.02 seconds for each single item info parse.

The result is rather unexpected but we have to include html preparation and DOM structure in each iteration because in real web scraping a scraper goes over paginated data and for each page it needs to do the above mentioned procedures.

The complexity of the xPath notations forming and testing does not exceed Regex’s forming and testing   complexity. So the xPath technique is the most common way for getting scraped data into business directories, because it almost always returns results which are related and thus reliable from structured document.

Conclusion The Regex and xPath benchmarking has shown the `xPath technique` to be superior to `Regex technique` in parsing scraped data and being more precise and structure related. Actually it is XPath rather than Regex that is being widely used for scraped data parsing. The only note is that when using xPath an html content must be for the most part structured so that the XML library is able to read and understand it.

Average time for single item parse (seconds)

Regex 0.002
XPath 0.02

0 Comments

Rants and Data

XPath vs Regex for parsing scraped html content

Author

Archives

Categories