• Home
  • About
  • Contact
  • Speaking
  • Blog
  • Consulting
  • Investing
  • Mistakes
Jonathan Thralow

Rants and Data

July 10th, 2015

7/10/2015

0 Comments

 
0 Comments

Top 10 Online Marketing Trends of 2015. 

6/10/2015

0 Comments

 



The Rise of Content Marketing
According to the B2B Content Marketing Benchmarks report, 93% of B2B marketers said they used content marketing in 2014, and 42% said they considered their strategy effective (up from 36% last year).  Marketers know that in order to make a sale customers must trust you and your product knowledge.  One of the most effective methods is to create content that speaks to your customer in a way they find interesting and nurturing. 2. Marketing Analytics Will Boom
Big data is making its way to businesses big and small.  We have come to the point where information technology is the responsibility of every individual, not just the IT staff.  Today’s marketers have powerful data mining, structuring and visualization tools at their fingertips. 3. Usability Testing Advances
The pricing of high tech tools, like heatmapping and personalized testing, have made their way to the point where it makes sense for small and medium sized businesses to use tools outside of Google Analytics to get a feel for how their website is performing.  Crazyegg.com has a low priced heatmapping tool and usertesting.com offers limited free use of their network of website reviewers. 4. More Videos Than Ever
The volume of Americans consuming information from YouTube is growing quickly.  The following numbers come from YouTube.com https://www.youtube.com/yt/press/statistics.html
- YouTube has more than 1 billion users
- Every day people watch hundreds of millions of hours on YouTube and generate billions of views
- The number of hours people are watching on YouTube each month is up 50% year over year
- 300 hours of video are uploaded to YouTube every minute
- ~60% of a creator’s views comes from outside their home country
- YouTube is localized in 75 countries and available in 61 languages
- Half of YouTube views are on mobile devices
- Mobile revenue on YouTube is up over 100% y/y
5. Getting Personalized
A trend that started a couple of years ago, but is continuing to pick up momentum is the use of Customer Relationship Management software to organize and segment your customers and your leads.  Nimble businesses are using tools like Pardot and Hubspot to systematically communicate with their leads and customers based on self defining rules.  For a review of the top 10 CRM services click here http://venturebeat.com/2014/02/11/top-10-crm-services/6. The Explosion of the “Explainer Video”
An “Explainer Video” is a video that demonstrates what a company does and its mission statement in 1 minute or less.  These videos usually live on the homepage of a website or the about us page.  The number of potential customers who visit a webpage and look for a video to learn about the company is growing quickly.  Many people don’t even read the text on a website until after they watch the “Explainer Video”.  Youtube has taught people that the internet is not just static pages, but a place to be entertained with music and video.  This has forced websites to quickly adopt the use of “Explainer Videos”.  Many sites have demonstrated conversion rates of first time visitors up to three times greater when they display an explainer video that when than do not.
7. Social Media Marketing Spend Will Increase
Social Media has become the largest entity on the Internet as far as content creation and time spent.  74% of all Internet users connect via social media sites.  For more stats about social media follow this link: http://www.pewinternet.org/fact-sheets/social-networking-fact-sheet/  

With the ability to target new customers and retarget old leads the marketing spend in Social Media is exploding and will continue throughout the year.


8. The boundaries between Social Media, SEO, & content marketing will blur.  Google became successful by utilizing a system of ranking important sites and it did so by creating something called “Page Rank”.  This system looked at how many sites linked to a page and in essence gave that page it’s vote.  Google used this data to learn what sites were the authority on a specific topic.  Today, Google understands that 74% of the users and content is now living on social outlets and has been working feverishly to restructure its ranking system to include social indicators such as Facebook likes, shares and Twitter mentions to help rank its listings in order to give their customers the best user experience.  This trend will continue and most likely take over as the most important ranking indicator and bypassing the long term standard “Page Rank”. 9. More Money Allocated in Online Ads

Internet users and customers have become more sensitive to online ads.  The can easily skim over them or click on them as needed.  The average Internet user is starting to make some very interesting trends clearer to the online marketer.  Internet users click on the natural SEO links to a greater extent when they are researching a topic, but use the paid links when they are ready to buy.  This type of action usually gives the user their best experience and gives the online marketer a much greater reason to bid up their paid links.  According to the Moz.com article, PPC links convert 2x as much as SEO links and I have found this number has grown since this article was published. https://moz.com/ugc/true-or-false-organic-traffic-converts-better-than-ppc



10. Mobile Optimization will become more important than ever
Google is already penalizing sites for not being mobile friendly and this trend will continue.  While most people still do not place orders with their mobile device they do spend a lot of time researching and deciding what companies they plan to build relationships with in the future.  A well constructed mobile site will build the trust needed for a long term relationship.
0 Comments

XPath vs Regex for parsing scraped html content

3/17/2015

0 Comments

 
For developers working on getting business data from the web, there is almost always a need to perform data parsing upon web harvest.

In this post I want to share on my benchmarking the two basic techniques used in

web scraping to parse the scraped data: Regex* and XPath.





*Regex works as the pattern applied to any text (incl. html) to fetch matched

pieces of content while XPath (similar to CSS path) traverses the DOM html document to select and fetch matched nodes.





We will try to parse sample data with PHP server-side and see the complexity of those techniques and compare time cost for the above mentioned techniques.

Let’s take a simple mobile.de VW search result page. Its raw html is here - view-source::http://suchen.mobile.de/auto/volkswagen.html, though hardly readable.

REGEX TECHNIQUE



Suppose we want to get title, link, and price for each item.

Let’s look at the page’s html and through developer tools (F12 or Ctrl+Shift+I) find an html content piece pertaining to a single list item. For the sake of simplicity I’ve distilled the real html into the following snippet:





 <div class="vehicleDetails vehicleDetailsMain" data-id="201841567" data-href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1">          

<div class="topOfPageTitle">      <a class="infoLink detailsViewLink" href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&negativeFeatures=EXPORT" rel="nofollow" onclick="mga('send', 'event', 'car', '/en/public/ses/top of page')">Volkswagen Golf VI &quot; STYLE&quot; 1.6 TDI AHK, SHZ, GRA, Klima AL</a>      </div>      <div class="topAdDesc">      <span class="commercial">Top Ad</span>      </div>      <div class="imageWrapper extraMargin">      <div class="image">      <span></span>      <a href="http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html?lang=en&pageNumber=1&action=topOfPage&negativeFeatures=EXPORT" onclick="mga('send', 'event', 'car', '/en/public/ses/top of page')">      <img alt="Vehicle picture" width="150" src="http://i.ebayimg.com/00/s/NDgwWDY0MA==/z/wh4AAOSwuMFUZ3~J/$_18.JPG" />      </a>      </div>      <div class="stackBoxOne"></div>      <div class="stackBoxTwo"></div>      <div class="video"><span>Video</span></div>      </div>      <div class="description">      <div class="list">      <span title="Saloon / Used vehicle German edition New HU Manual gearbox 77&nbsp;kW (105&nbsp;PS), Diesel">Saloon / Used vehicle German edition New HU...</span>      </div>      <div class="fuelConsumption">      <div>Fuel consumption combined:<br/>ca 4.7 l/100 km **</div>      <div>CO<sub>2</sub>-Emissions combined:<br/>ca 123 g/km **</div>      </div>      <div class="dealerdata">      <div class="address">      Autohaus Gartner GmbH &amp; Co.KG,83549 Eiselfing      </div>      <div class="phoneNumber">      Phone&nbsp;+49 (0)8071 92030      </div>      </div> <div class="rightSideColumns">   <div class="pricePrimaryCountryOfSale priceGross">11,400 EUR</div>                              <div class="priceSecondaryCountryOfSale priceNet"></div> <div class="pricePrimaryCountryOfOrigin"></div>                                      





Here goes the PHP code to apply regex for each of the item attributes:

<?php

$html = <<<EOD

<!DOCTYPE html>

…

EOD;





echo '<h2>All links </h2>';  

$patternHref = '/data-id="\d+" data-href="([^"]+?)"/';

preg_match_all($patternHref, $html, $links);

$i=1;  

foreach($links[1] as $link)

echo '<br>', $i++, '. ',  $link;  

$patternTitle = '~<a [^>]*?>(.*?)<\/a>~';

preg_match_all($patternTitle, $html, $titles);

$i=1;

echo '<h2>All titles </h2>';  

foreach($titles[1] as $title)

echo '<br>', $i++, '. ',  $title;

          

$patternPrice = '~<div >([^>]*?)<\/div>~';

preg_match_all($patternPrice, $html, $prices);  

$i=1;

echo '<h2>All prices </h2>';  

foreach($prices[1] as $price)

echo '<br>', $i++, '. ',  $price;  



The main problem with Regex parsing is inconsistency. See below in the 3 parsed sets, the last one (Prices) includes the prices of 3 more items which are from an ad inserted in the result page. While the values of the first 2 sets (columns) might be related, the third set values hardly relate to the particular items.


Titles

1. Volkswagen Golf VI " STYLE" 1.6 TDI AHK, SHZ, GRA, Klima AL

2. Volkswagen Phaeton 3.0 V6 TDI DPF 4MOTION Automatik Massage

3. Volkswagen Golf 6 ez 09

4. Volkswagen VW Lupo 1.4i Princeton Klima Servo TUV 11/...

5. Volkswagen Caddy Trendline SOCCER

6. Volkswagen Vw Lupo 1.0 mpi mit TUV

7. Volkswagen Golf 1.6 TDI DPF Team

8. Volkswagen Touareg 2.5 R5 TDI

9. Volkswagen Polo 1.2 Trendline Navi

10. Volkswagen Polo 1.2 Trendline Navi

11. Volkswagen T5 FACELIFT EDITION 25 MEGA OPTIK 18ZOLL 1A

12. Volkswagen Golf 1.8 Automatik Rolling Stones Collection

13. Volkswagen Vw passat 1.6 74 kw tuv 02.15

14. Volkswagen Volkswagen Polo

15. Volkswagen Gepflegter VW Passat 3B Kombi / Variant 1....

16. Volkswagen vw passat 1,8 alu sehr sauber

17. Volkswagen Polo 1.2 Sitzheizung

18. Volkswagen Volkswagen Golf 2.0 GTI DSG

19. Volkswagen Polo 1.2 9N *SITZHEIZUNG*EURO4*

20. Volkswagen Golf

21. Volkswagen Touran 1.9 TDI DPF Trendline


Links

1. http://suchen.mobile.de/auto-inserat/vw-golf-vi-style-1-6-tdi-ahk-shz-gra-klima-al-eiselfing/201841567.html

2. http://suchen.mobile.de/auto-inserat/vw-phaeton-3-0-v6-tdi-dpf-4motion-automatik-massage-dortmund/204056406.html

3. http://suchen.mobile.de/auto-inserat/vw-golf-6-ez-09-altenburg/204056397.html

4. http://suchen.mobile.de/auto-inserat/vw-lupo-vw-lupo-1-4i-princeton-klima-servo-t%C3%BCv-11-sprockh%C3%B6vel/204056373.html

5. http://suchen.mobile.de/auto-inserat/vw-caddy-trendline-soccer-wolfhagen/204056347.html

6. http://suchen.mobile.de/auto-inserat/vw-lupo-vw-lupo-1-0-mpi-mit-t%C3%BCv-ludwigsburg/204056335.html

7. http://suchen.mobile.de/auto-inserat/vw-golf-1-6-tdi-dpf-team-sch%C3%B6ffengrund/204056327.html

8. http://suchen.mobile.de/auto-inserat/vw-touareg-2-5-r5-tdi-wardenburg-bei-old/201546204.html

9. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-trendline-navi-nordhorn/204056286.html

10. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-trendline-navi-nordhorn/204056287.html

11. http://suchen.mobile.de/auto-inserat/vw-t5-multivan-t5-facelift-edition-25-mega-optik-18zoll-1a-bermatingen-n%C3%A4he-b/203784734.html

12. http://suchen.mobile.de/auto-inserat/vw-golf-1-8-automatik-rolling-stones-collection-berlin/204056239.html

13. http://suchen.mobile.de/auto-inserat/vw-passat-vw-passat-1-6-74-kw-t%C3%BCv-02-15-coburg/204056232.html

14. http://suchen.mobile.de/auto-inserat/vw-polo-volkswagen-polo-m%C3%BChlheim-am-main/204056220.html

15. http://suchen.mobile.de/auto-inserat/vw-passat-gepflegter-vw-passat-3b-kombi-variant-1-stuhr/204056215.html

16. http://suchen.mobile.de/auto-inserat/vw-passat-vw-passat-1-8-alu-sehr-sauber-vohwinkel/204056199.html

17. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-sitzheizung-lahnstein/203891580.html

18. http://suchen.mobile.de/auto-inserat/vw-golf-volkswagen-golf-2-0-gti-dsg-mosbach/204056185.html

19. http://suchen.mobile.de/auto-inserat/vw-polo-1-2-9n-sitzheizung-euro4-butzbach/204056145.html

20. http://suchen.mobile.de/auto-inserat/vw-golf-quadrat-ichendorf/204056142.html

21. http://suchen.mobile.de/auto-inserat/vw-touran-1-9-tdi-dpf-trendline-dortmund/204056138.html

Prices

1. 13,959 EUR

2. 11,400 EUR

3. 8,000 EUR

4. 2,200 EUR

5. 11,750 EUR

6. 11,780 EUR (Gross)*

7. 11,890 EUR

8. 19,990 EUR (Gross)*

9. 750 EUR

10. 11,550 EUR

11. 5,999 EUR

12. 4,800 EUR

13. 4,800 EUR

14. 14,490 EUR

15. 999 EUR

16. 700 EUR

17. 1,099 EUR

18. 1,800 EUR

19. 700 EUR

20. 9,900 EUR (Gross)*

21. 16,499 EUR

22. 2,990 EUR

23. 450 EUR

24. 8,280 EUR



I’ve benchmarked the code for 5000 iterations and you can see the result in here (over 9 secs). The time consumption for a single item parse is fair - 0.002 second.

As far as complexity, as an experienced regex composer, it took me about 1.5-2 hours to analyze scraped html and to make the patterns as well as to test them.

XPath TECHNIQUE Now we come to the XPath technique, this technique being applied to XML/XHTML docs, so we first need to brush up the raw html.


Preparing html as a strict XML

To parse raw data, the xPath technique is far advanced compared to Regex. Strictly speaking, xPath is applied to XML/XHTML docs, so here are the steps to do it with raw html:

  • remove the broken pieces from raw html content

  • make html content a DOM structure (XML document)


See the following code that performs the two above mentioned steps:


<?php

$html = <<<EOD

<!DOCTYPE html>

…

EOD;

// here we remove unwanted html chars and tags

 $html = str_replace('&nbsp;', ' ', $html);

 $html = str_replace('<br/>', ' ', $html);  

 $html = str_replace('<noindex>', '', $html);  

 $html = str_replace('</noindex>', '', $html);

 $html = str_replace('noindex', '', $html);

// we suppress libxml internal errors

 libxml_use_internal_errors(true);

// Getting the text into the DOM Document for further parse

 $DOM = new DOMDocument('1.0', 'UTF-8');

 $DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />' . $html);

// Initiating DOM XPath Document for parse

$xpath = new DOMXPath($DOM);

$xpath->registerNamespace("php", "http://php.net/xpath");

$xpath->registerPHPFunctions();





Traversing thru xPath

Let's go to the developer tools for finding xPaths for item attributes.

The title xPath notation is:

//div[@]/a[@"infoLink detailsViewLink"]/text()





The href attribute of ‘a’ tag xPath notation is:

//div[@]/a[@"infoLink detailsViewLink"]/@href

The price nodes xPath notation is:

//*[@id="parkAndCompareVehicle"]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text()





The thing is that xPath selects/refers to all the nodes under a given notation. So with one shot we get all the nodes and are able to store them. The best way to fetch them in the structured way would be to iterate over them. See the following function, which does that:

<?php

function get_cars_from_xpath_object($xpath) {

    $prices = $xpath->query('//*[@id="parkAndCompareVehicle"]/div[1]/div[2]/div/div[2]/div[5]/div[1]/div/div[2]/text()');  

    $titles = $xpath->query('//div[@]/a[@"infoLink detailsViewLink"]/text()');                $hrefs = $xpath->query('//div[@]/a[@"infoLink detailsViewLink"]/@href');  

    $cars=array();

    $i=0;

    while($prices->item($i)->nodeValue)

    {

       $cars[] = array('title'=>$titles->item($i)->nodeValue,         'href'=>$hrefs->item($i)->nodeValue, 'price'=>$prices->item($i)->nodeValue );

       $i++;

    }

    return $cars;

}





If we want to benchmark xPath parsing we need to iterate over the whole process of (1) removing the broken pieces from raw html, (2) making it DOM structure and (3) traversing by xPath notation to fetch nodes.





For the 500 iterations that include removing unwanted html content and converting to DOM structure it took 10.5 seconds, about 0.02 seconds for each single item info parse.

The result is rather unexpected but we have to include html preparation and DOM structure in each iteration because in real web scraping a scraper goes over paginated data and for each page it needs to do the above mentioned procedures.





The complexity of the xPath notations forming and testing does not exceed Regex’s forming and testing   complexity. So the xPath technique is the most common way for getting scraped data into business directories, because it almost always returns results which are related and thus reliable from structured document.  

Conclusion The Regex and xPath benchmarking has shown the `xPath technique` to be superior to `Regex technique` in parsing scraped data and being more precise and structure related. Actually it is XPath rather than Regex that is being widely used for scraped data parsing. The only note is that when using xPath an html content must be for the most part structured so that the XML library is able to read and understand it.



Average time for single item parse (seconds)

Regex 0.002
XPath 0.02














0 Comments

    Author

    Data Geek, Growth Hacker.

    Archives

    July 2015
    June 2015
    March 2015

    Categories

    All

    RSS Feed