Search engines leverage structured data to determine what entities are on your web page. They can also do this using other techniques such as natural language processing (NLP) and machine learning.
This article will introduce you to various tools that will help you identify entities on a web page. These tools include:
There are several Chrome plugins that are extremely helpful in understanding (and actually seeing) what structured data is on a web page. Illustrated below are the Chrome extensions I use, listed under the keyword used to locate them via Chrome Web Store search.
Here are links to each extension:
There are many advantages to utilizing these plugins. For one, they really give you a good feel for who is using what markup on their websites as you surf the web on a daily basis. When you see those little microdata and structured markup icons appear in your browser, you need only click to instantly see what kinds of markup and meta data are on a page.
Attempting to display the amount of information provided by all these extensions in a single screenshot is not possible, so I’ve opted to include just a few examples, broken down into several screenshots.
(Note: The fact that there is too much info to display in a single screenshot is indicative of the growth of structured data on the web since I last wrote on the topic 2 years ago. The volume of information that is available on the average web page, by comparison to 2012, has increased by orders of magnitude.)
The three screenshots below provide a sampling of the kinds of information gleaned via the microdata extensions. (All can be enlarged by clicking.) While the info is fairly similar across all 3 extensions, it’s nice to have several tools available in case one picks up something the others have missed.
All three plugins have identified schema.org Product markup, including properties for image, name, brand, manufacturer, model, product ID, offers and description. They’ve also identified the on-page markup for reviews and ratings.
Marketers looking to implement their own structured markup might be most interested in the Microdata/JSON-LD sniffer extension (middle screenshot above), as it provides the information in a convenient HTML view.
The META SEO inspector gives an even higher-level view of page data:
As illustrated by the screenshot above, the META SEO inspector lets you see all kinds of metadata provided to search engines, ranging from old-fashioned but still utilized metadata tags to schema.org information, Facebook Open Graph, Twitter tools/cards and more.
The last extension I’m going to cover here is called Green Turtle RDFa. This extension provides not only provides a complete listing of subject-predicate-object triples information on a web page, but also a visualization of that information. Here’s a view of the information Green Turtle has gleaned from the Walmart product page we’ve been using as an example so far:
With the right settings enabled, this tool also extracts microdata. To turn on that feature for this extension once you download it, you need to perform the following actions:
Once you have downloaded the Green Turtle extension into your Chrome browser, go to Tools –> Extensions and find it in your extensions list. Select “Options,” then check the box to Enable Microdata.
Now that you have enabled both RDFa and microdata parsing for the Green Turtle plugin, you should be able to see much information. Check out the new results for that same Walmart product page:
Gruff is a tool that is downloadable for free (Mac or PC) and allows you to visualize what structured data (or triples — data entities composed of subject-predicate-object) are harvested from a web page. The graphic below (extracted from a recent Search Engine Land article I wrote) will give you an idea of the type of information Gruff can give you.
To use Gruff, you must first download it here. To run it locally and use the simpler installation, I would recommend downloading the 3.3 version (you will see both when you select the download option).
Once Gruff is installed, you will need to create a “New Triple-Store” under the File menu. Once completed, you can then extract web page data by going to File –> Extract Microformat/RDFa Data from Web Page and then entering the URL in the box provided. (Leave the Graph Name field blank.)
When the program has finished extracting the data, go to the Display tab and select the last option, Display Triples of One Graph. This should bring up the data visualization map (as seen above).
TextRazor is an API that analyses text input to determine information about specific entities within that text. With this tool, you can “extract the Who, What, Why and How” from the text of web pages, tweets, emails, etc. To see how it works, check out their demo page and input some text.
As an example, here’s what TextRazor came up with when analyzing the first two paragraphs of one of my previous columns (click the images to enlarge):
Other useful tools and APIs for named entity extraction over text include:
…as well as many, many more. (I would invite an open discussion in the comments to create a more extensive useful list.)
These tools can be fun to play with, while providing a helpful understanding of how entities and entity graphs can be derived from both structured and unstructured information sources in a web page.