Topic Scout Home

Information about Topic Scout

This web site provides access to a topic directory of over 9,000 topics. What is special here is that each topic is associated with a topic lexicon, where a topic lexicon consists of words and phrases relevant to that topic. Try this two-minute experiment. Imagine if you were given the name of a topic, such as "dating", and you were asked to provide a list of words and phrases especially relevant to the topic of dating. Give yourself a minute or two, and write down words and phrases you think are relevant to the topic of dating. After doing this, look at what Topic Scout produced: see Topic Scout topic for dating. Topic Scout produces such a list of words and phrases through data mining.

You will find the list of words and phrases Topic Scout produced quite relevant to the topic of topic. What really matters is that Topic Scout can do this for an extremely large number of topics, and produce good results very consistently. This is true of all kinds topics, from highly techical, to business, to sports, the arts, and much more. If you are technical, and are interested in data mining take a look at Topic Scout's lexicon for data mining. In general, Topic Scout can learn from unstructured text, and do this so cost of creating trainning sets is near zero. Topic Scout does not rely on human curated training sets.

So Topic Scout mines web data with almost no human involvement today. For example, the dating topic was produced through the input of one word: "dating".

How does Topic Scout do this? Each topic is associated with one or more search queries. These search queries are executed using a web search engine. The documents referred to by the query results are then fetched from the web. The result is a set of documents to be used for training and testing. The training set of document is fed into the Topic Scout relevance engine. That relevance engine then discovers words and phrases relevant to that topic. It is these words and phrases that you see on the web pages of the topic directory provided by this web site.

The directory of topics at this web site, despite its size, is by no means complete. This topic directory does, however, provide evidence that very large topic-oriented taxonomies can be created, and produced for very little cost per topic. Very large here means hundreds of thousands of topics, even millions of topics.

The data on this web site was also developed for a specific purpose. Topic Scout has been used to create a system that could identify the topics of web pages, and it had accuracy - in its alpha version - of nearly 90% accuracy. It classified hundreds of thousands of web pages, with many thousands of topics to choose from. Topic Scout was remarkably good at picking just the right topic when there were thousands of topics to choose from. There is no barrier from Topic Scout doing classification where there would be hundreds of thousands, or even millions of topics, to choose from.

It is interesting to compare what data Topic Scout provides versus some other sources. Consider three sources of data on Elvis Presley.

  1. Wikipedia on Elvis Presley
  2. A graphic view based on DBPedia provided by Yago
  3. Topic Scout's lexicon on Elvis Presley

The article on Wikipedia is obviously created by human beings. The graph-based of Elvis Presley from Yago, with its nodes and edges, however, was created from content in DBPedia. Here is where Topic Scout is different. Topic Scout picked up data on Elvis Presley *without* using reliance on either WikiPedia or DBPedia. Instead, Topic Scout scanned content from the web, and did data mining on that content to find out what was particularly relevant to Elvis Presley. So Topic Scout was not relying on human-curated content. Topic Scout's outcome is all the more impressive because of this. In finding what was relevant to Elvis Presley, Topic Scout picked up large numbers of his songs, movies, his birth date, Graceland and something that we all associate with Elvis Presley: his swivelling hips, and a lot more. If the feature of entity recognition was added to Topic Scout, Topic Scout might arguably exceed the DBPedia-powerer material provided by Yago by both quantity and quality. So while curated content has many advantages, content obtained strictly by data mining has its advantages. Most notably, Topic Scout is not constrained to the curated content provided by DBPedia.

Does Topic Scout only work for English language text? This web site only pertains to topics in the English language. Topic Scout is, however, not just about English or any specific language. Topic Scout is multi-lingual, and is not limited to Indo-European languages. It has been applied to German, French and Spanish, and should also work for Japanese and Chinese. Nor is Topic Scout limited to one- or two-word combinations as seen on this web site. Topic Scout can be reconfigured to look for longer strings of relevant words and phrases. At present, it uses one- or two-word combinations, but that limitation can be lifted easily enough.

How was Topic Scout implemented? Topic Scout was implemented in the Java programming language. It uses a NoSQL approach to data storage. It uses a job-chaining form of map-reduce to do its core processing. Other technologies used include, but are not limited, to Snowball (for stemmiing) and Lucene. Internally, it uses a pipeline architecture for data mining. It has a highly optimized runtime classifier that encodes all strings into numbers for runtime speed.

Topic Scout uses a novel data mining algorithm that I invented while walking around a small man-made lake In Newark, California. This algorithm is not Latent Semantic Analysis, deep learning, Word2Vec, bayesian, etc. The algorithm is excellent at finding the relevant words and phrases in both broad-based topics and extremely specific topics.

This non-production version of the relevance engine does have some limitations. All of these limitations can be lifted.

If you want a white paper on Topic Scout, please send an e-mail to contact Stefan Gower asking for the white paper.

Also feel free to look at our blog.