Topic Scout Home
Limitations of non-production version
- The current implementation of Topic Scout actually internally supports
n-grams as large as 5. But the n-grams shown here are only one or two strings?
Why? It is really a limitation of the data mining at present. The limitation
is pretty straightforward to remove, but until it is, the topic lexicons are
restricted to symbols like united_states but not united_states_america.
- Topic Scout is currently not doing lexicon smoothing. What is lexicon smoothing?
Some topics - such as celebrities, politics and sports -
can be very volatile, and change frequently. So any data mining done on such topics will reflect the time in which the data mining was done.
For example, if there has just been a major political
scandal, and the topic for /government/politics is created, its lexicons may overly reflect the specifics of that scandal. In even a few weeks,
this scandal may be off the newspaper and blogosphere, and will be gone from the lexicon. For such volatile topics, is quite impossible
to get a proper sense of such a topic unless that topic is mined multiple times at different points in time. The words and phrases
that remain in the topic lexicon over and over again are likely the true perennial words and phrases of that topic. In general, because
Topic Scout is not doing such lexicon smoothing, such volatile topics may be below par.
- Topic Scout does not at present have access to production quality boilerplate software. Such
software extracts the real content from a web page, and ignores extraneous content. Open source forms
of such software exist today, but were not found to be adequate. Certainly some of the major
search companies must have very good code to do this, but as far as I know, they're not sharing the code.
I did code up my own boilerplate software, but it was experimental and not for production.
Despite this, that software did a far better job than Boilerpipe (popular open source software). I tried to
use Boilerpipe and gave up on it. It just didn't get the job done. The key problem with boilerpipe
is that has a page-centric view of the world, and that is quite inadequate. In my experience,
most of the crap on web pages is also crap found on related web pages on the same domain. This
is not always true, but it is true an awful lot of the time. That simple heuristic is enough to create boilerplate removal
software that looks for shingles in a web page that are common to "neighbor" web pages in the
same domain. If you then remove the most frequent of these re-occurring shingles, you generally
get rid of a *lot* of noise. This is actually pretty simple data mining. Of course, there is a cost to searching other related web pages,
but if done properly, that cost would be amortized, so the per cost of these computations could be
kept reasonably low. Of course, there are intra-page noise heuristics as well. Still, in my
own experience looking at a very large number of web pages, I would far prefer boilerplate removal software
that handles inter-page noise. In my opinion, inter-page noise is where most of the pain is noise-wise.
- Topic Scout's data sources were obtained from a search engine without application of a porn filter.
Generally, this hasn't been a problem, but in reflection, it would probably be better to turn the porn filter
- Each topic in Topic Scout is associated with one or more queries. Sometimes I made mistakes when creating
a query. For example, I once created a topic for ../product/recall and just used the query "recall". Of course,
this got mixed up with other senses of that word, such as product recall and memory recall. Hence
some topics in the topic directory are very likely to have such query mistakes in them.
- Since this topic tree was built, a new technique was invented that substantially improves the
lexicons for certain topics, If aplied this improvement would affect about 5% of the overall topics.
- Topic lexicons may contain noise symbols, such as click_link or send_email, especially if these are untuned lexicons.
In the case of topic lexicons use for text classification, such noises can be discovered and removed.
In a knowledge graph, such noise symbols will need to be removed by other means. The document on the symbol crawler
describes such noise symbol removal.
- It would really help to use some of the really cool technology produced by Diffbot.
Diffbot provides a way to extract content from the web by using deep learning to discover the text of a web page as
it is actually displayed on a web browser. This is a fantastic technology for any service that needs
to extract data from the web, as it is very hard to get accurate and complete content from a web page when
so there is so much dynamic content. Diffbot solves this problem, and solves it in a very elegant way.