When Littledata first started working with benchmark data we found the biggest barrier to accuracy was self-reporting on industry sectors. Here’s how we built a better feature to categorise customer websites.
Google Analytics has offered benchmarks for many years, but with limited usefulness since the industry sector field for the website is often inaccurate. The problem is that GA is typically set up by a developer or agency without knowledge or care about the company’s line of business – or understanding of what that industry sector is used for.
To fix this problem Littledata needed a way to categorise websites which didn’t rely on our users selecting from a drop-down list.
Google Analytics has offered benchmarks for many years, but with limited usefulness since the industry sector field for the website is often inaccurate.
Try Littledata free for 30 days
The first iteration: IBM Watson NLP and a basic taxonomy
Our first iteration of this feature used a pre-trained model as part of IBM Watson’s set of natural language APIs. It was simple: we sent the URL, and back came a category according to the Internet Advertising Bureau taxonomy.
After running this across thousands of websites we quickly realised the limitations:
- It failed with non-English websites
- It failed when website homepage was heavy with images rather than text
So we prioritised a second iteration.
The second iteration: Extraction, translation and public APIs
The success criteria was that the second iteration could categorise 8 sites which the first iteration failed with, and should go on to be 80% accurate.
We also wanted to use mainly public APIs, to avoid maintaining code libraries, so we broke the detection process into 3 steps:
- Extracting meaningful text from the website
- Translating that text into English
- Categorising the English text to an IAB category and subcategory
The Watson API seemed to perform well when given sufficient formatted text, at minimal cost per use, so we kept this for step 3.
For step 2, the obvious choice was Google Translate API. The magic of this API is that it can detect the language of origin (with a minimum of ~4 words) and then provide the English translation.
To give us more control of the text extraction, we then opted to use a PhantomJS browser on our server. Phantom provides a standard function to extract the HTML and text from the rendered page, but at the expense of being somewhat memory intensive.
Putting the first few thousand characters of the website text into translation and then categorisation produced better results, but still suffered from false positives – for example if the text contained legal-ease about data privacy it got categorised as technical or legal.
We then looked at categorising the page title and meta description, which any SEO-savvy site would stuff with industry language. The problem here is that the text can be short, and mainly filled with brand names.
After struggling for a day we hit upon the magic formula: categorising both the page title and the page body text, and looking for consistent categorisation across the two. By using two text sources from the same page we more than doubled the accuracy, and it worked for all but one of our ‘difficult’ websites. This hold-out site – joone.fr – has almost no mention of its main product (diapers, or nappies), which makes it uniquely hard to categorise.
So to put it all the new steps together, here’s how it works for our long-term enterprise client MADE.com’s French-language site.
Step 1: Render the page in PhantomJS and extract the page title and description
Step 3: Translate both text strings in Google Translate
Step 4: Compare the categorisations of the title vs page body text
Step 5: If the two sources match, store the category
I’m pleased that a few weeks after launching the new website classifier we have found it to be 95% accurate.
Benchmarking is a core part of our feature set, informing everything that we do here at Littledata. From Shopify store benchmarks to general web performance data, the improved accuracy and deeper industry sector data is helping our customers get actionable insights to improve their ecommerce performance.
If you’re interested in using our categorisation API, please contact us for a pilot. And note that Littledata is also recruiting developers, so if you like solving these kind of challenges, think about coming to join us!