A California-based startup is aiming to categorize the internet’s swath of unstructured data into a navigable indexed database. Known as Diffbot, the startup aims to extract knowledge in an automated manner from online documents and other sources. After a year-long private pilot program, Diffbot has finally gone public as it tries to build the first comprehensive map of human knowledge through analyzing every page on the internet. The project’s origins can be traced back to artificial intelligence (AI) work by founder Mike Tung, who spent 5 years developing the necessary tools for accomplishing this goal.
By leveraging computer vision and natural language processing, Diffbot web crawler can analyze practically any webpage’s layout (which equates to about 90 percent of the internet and roughly 20 different page types) and structure for facts, figures, and abstract connections between objects. Some notable examples include product pages on websites like Amazon, or the biography from an executive on a company webpage. According to Tung, about 30 percent of a knowledge worker’s job involves gathering data, proving a genuine market opportunity for a horizontal graph—a database containing information on people, businesses, and things.
Data attained by Diffbot enters a giant database called the Diffbot Knowledge Graph (DKG), which contains more than 1 trillion facts, 10 billion different entities, and adds nearly 130 million facts per month. The index’s main categories feature detailed information on individual people (profiles, skills, and job and education history, even social media outlets), companies, locations (mapping data, addresses, business types, and zoning information), articles (every published story, datelines, and bylines from anywhere on the web in any language), discussions (chats and social sharing), and images.
This trove of data can be accessed through API calls and manipulated using DiffBot DQL, the technology that organizes questions into well-formed sentences (based on the language being used). DKG results can be viewed as lists, maps, or table layouts using Diffbot’s web-based UI. Another way this information can be portrayed is third-party content management systems or analytics platforms. So far, Diffbot has attracted some notable clients including Microsoft, eBay, Salesforce, and DuckDuckGo, all of whom are using the startup to improve the quality of their search results.
It’s fairly straightforward how the technology works. Anyone wanting to perform a one-off search for say, a specific shoe brand, goes to Diffbot’s web dashboard where they type in the sneaker brand in a Google-like search bar and hit enter. Within milliseconds, the user gets a product profile comprised of sources from around the internet. If you wanted to search something like a news story, you could type in something like the author’s name. The database would then extract every article that’s been published online (in every language they’ve been published in too) by the said writer. A more focused search for someone like an individual, would take the person’s searching to a CV-like work history that features facets from up to hundreds of biographies, articles, and publicly available profiles.
One of Diffbot’s unique strengths is the ability to quickly narrow searches down by entity, making it resourceful in job recruitment and similar tasks. A proper DQL thread can gather every employee at a company, in addition to their job title, skills, education history, and social media accounts. To compare, Google’s Knowledge Graph has been criticized in the past for its lack of attributing and omitting sources, which Tung claims are two aspects Diffbot simultaneously addresses—essentially killing two birds with one stone. Tung claims Diffbot is more comprehensive and accurate than databases that have been manually assembled. The technology’s DKG is regularly supplied with fresh information, whose machine learning algorithms are smart enough to disregard sites with histories of reporting or containing “logically inconsistent” facts.