Topic Scout Home Page

Anatomy of a Symbol Crawler

Web crawlers crawl the web, and create a graph of URLs that are also ranked for web search.

Would it be possible to create a crawler that will crawl concepts?

This 20 page PDF document describes an architecture for a symbol crawler. Based on a micro-service architecture, it describes how a small number of services can enable crawlers that will create a graph of symbols, where some symbols are topics, and other symbols are related to such topic symbols.

Here is a simplified view of what the crawlers do during a crawl.

  1. A crawler process pops a symbol off a symbol queue. The queue can be distributed, and so can the crawlers, working in a cluster.
  2. The crawler uses the symbol relevance micro-service obtain symbols relevant to this input symbol. If not in the symbol graph, the symbol is inserted into the symbol graph. Links are created between the input symbol and the output symbols.
  3. Crawler continues until symbol queue is exhausted or the crawler is halted for other reasons (e.g. the time period for the crawl, say 5 days, has elapsed.
In practice, things are more complicated, largely because the crawler must deal with the fact that a word or phrase may have multiple senses, such as tiger (mammal) or tiger (tank) or tiger (butterfly) etc. The micro-service architecture, in fact, has a micro-service whose mission to obtain the senses of words and phrases.

The core micro-services of this architecture include:

The diagram below shows this micro-service architecture, and how crawler instances relate to these micro-services.




One or more crawlers operate by taking the next symbol from the symbol queue using the symbol queue micro-service. The crawler then searches for more symbols, and this search includes obtaining the appropriate senses of the word or phrase using the sense listing micro-service.

An actual implementation of this symbol crawler architecture could, and would very likely have, additional micro-services. What other micro-services might be added would depend on the actual implementation.

Topic Scout, in particular, works by taking a training set and discovering symbols in it using machine learning. To fully scale, Topic Scout obtains such a training set through one or more queries. These queries can then be executed using a search engine, thus obtaining URLs. A set of documents is then obtained using those URLs. In effect, a micro-corpus is obtained, and Topic Scout's proprietary relevance engine then extracts symbols from that micro-corpus by machine learning.

To fully scale, a Topic Scout implementation of this symbol crawler architecture will need something novel: a query discovery micro-service. The job of the query discovery micro-service is take a word or phrase, along with the sense of that word or phrase, and generate a query (or set of queries) that will generate a micro-corpus for this topic. Such a query discovery micro-service might be implemented by rules. More ambitiously, it might use query induction based on machine learning. An initial implementation is likely to be based on rules.

For example, take a word "tiger" with the sense "animal". Such a query induction micro-service might use DBPedia to determine that a tiger is a mammal. Most importantly, it would determine that the query "tiger mammal" should be used to compute a micro-corpus for this topic. This is just one example of rules that might be used to generate queries for a topic. A rule-based query discovery service might have many such rules.

Ideally, such query discovery itself could be done using machine learning, but at present, I know of no way to do this.

Once the queries are obtained, what happens? Once the queries are returned from the query discovery micro-service? The Topic Scout implementation of this crawler architecture will have a web search service. The query from the query discovery micro-service is executed using the web search micro-service. This gets a result set for that query. If multiple queries are involved, each query is executed by the web search micro-service. The union of these query results provides the URLs for the micro-corpus. Each document from this set of urls is then obtained. In this way, a micro-corpus is prepared. The micro-corpus is then used by the symbol relevance micro-service to discover relevant symbols for that topic.

New symbols are integrated into the symbol graph, and so crawling goes on and on... Of course, a frontier of symbols is maintained, just like a frontier of URLs is maintained for a web crawler.

The Topic Scout implementation of this crawler architecture will, in addition to micro-services, depend heavily on queues, such as provided by some JMS provider, such as ActiveMQ. The symbols to be crawled will, in fact, simply be a queue of symbols provided by ActiveMQ. Workers will consume from each queue, and perform their particular defined task. For example, symbols that need to be processed by a relevance engine will be placed in a queue for this purpose. A battery of workers, each with their own client queue connected to the physical queue, will, when they are not busy, pull a symbol from the queue and process it. The actual output - a sequence of symbols - will then be placed in a queue, waiting for a worker that will handle the output, such as placing new symbols into the symbol graph.

Of course other implementations of this symbol crawler architecture might do things quite differently. The symbol crawler architecture is intentionally minimal. So it would be even possible for symbol crawlers with distinct implementations to even jointly work on the same symbol graph. It could be limiting to think that any single crawler is "best".