When Littledata first started working with benchmark data we found the biggest barrier to accuracy was self-reporting on industry sectors. Here’s how we built a better feature to categorise customer websites.

Google Analytics has offered benchmarks for many years, but with limited usefulness since the industry sector field for the website is often inaccurate. The problem is that GA is typically set up by a developer or agency without knowledge or care about the company’s line of business – or understanding of what that industry sector is used for.

To fix this problem Littledata needed a way to categorise websites which didn’t rely on our users selecting from a drop-down list.

Google Analytics has offered benchmarks for many years, but with limited usefulness since the industry sector field for the website is often inaccurate.

The first iteration: IBM Watson NLP and a basic taxonomy

Our first iteration of this feature used a pre-trained model as part of IBM Watson’s set of natural language APIs. It was simple: we sent the URL, and back came a category according to the Internet Advertising Bureau taxonomy.

After running this across thousands of websites we quickly realised the limitations:

  1. It failed with non-English websites
  2. It failed when website homepage was heavy with images rather than text
  3. It failed when the website was rendered via Javascript

Since our customer base is growing most strongly outside the UK, with graphical product lists on their homepage, and using the latest Javascript frameworks (such as React), the failure rate was above 50% and rising.

So we prioritised a second iteration.

The second iteration: Extraction, translation and public APIs

The success criteria was that the second iteration could categorise 8 sites which the first iteration failed with, and should go on to be 80% accurate.

We also wanted to use mainly public APIs, to avoid maintaining code libraries, so we broke the detection process into 3 steps:

  1. Extracting meaningful text from the website
  2. Translating that text into English
  3. Categorising the English text to an IAB category and subcategory

The Watson API seemed to perform well when given sufficient formatted text, at minimal cost per use, so we kept this for step 3.

For step 2, the obvious choice was Google Translate API. The magic of this API is that it can detect the language of origin (with a minimum of ~4 words) and then provide the English translation.

That left us focussing the development time on step 1 – extracting meaningful text. Initially we looked for a public API, and found the Aylien article extraction API. However, after testing it out on our sample sites, it suffered from the same flaws as the IBM Watson processing: unable to handle highly graphical sites, or those with Javascript rendering.

To give us more control of the text extraction, we then opted to use a PhantomJS browser on our server. Phantom provides a standard function to extract the HTML and text from the rendered page, but at the expense of being somewhat memory intensive.

Putting the first few thousand characters of the website text into translation and then categorisation produced better results, but still suffered from false positives – for example if the text contained legal-ease about data privacy it got categorised as technical or legal.

We then looked at categorising the page title and meta description, which any SEO-savvy site would stuff with industry language. The problem here is that the text can be short, and mainly filled with brand names.

After struggling for a day we hit upon the magic formula: categorising both the page title and the page body text, and looking for consistent categorisation across the two. By using two text sources from the same page we more than doubled the accuracy, and it worked for all but one of our ‘difficult’ websites. This hold-out site – joone.fr – has almost no mention of its main product (diapers, or nappies), which makes it uniquely hard to categorise.

Categorising industry sector for MADE.com

So to put it all the new steps together, here’s how it works for our long-term enterprise client MADE.com’s French-language site.

Step 1: Render the page in PhantomJS and extract the page title and description

Step 2: Extract the page body text, remove any cookie policy and format

Step 3: Translate both text strings in Google Translate

Step 4: Compare the categorisations of the title vs page body text

Step 5: If the two sources match, store the category

I’m pleased that a few weeks after launching the new website classifier we have found it to be 95% accurate.

Benchmarking is a core part of our feature set, informing everything that we do here at Littledata. From Shopify store benchmarks to general web performance data, the improved accuracy and deeper industry sector data is helping our customers get actionable insights to improve their ecommerce performance.

If you’re interested in using our categorisation API, please contact us for a pilot. And note that Littledata is also recruiting developers, so if you like solving these kind of challenges, think about coming to join us!

Edward

Founder of Littledata, Technical Lead and Product Consultant. He has broad experience helping companies with business strategy, product development and technology management.

View all posts

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.