You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I can envision, there are three high-level sets of data (corpus) for a given site that could be used by the classifier. In all cases, the output of the classifier might produce multiple strong signals about what the content of the site is (and "strong" will also need to be defined)
Looking at the homepage of a site and using the content/other signals there to determine the topics for the site
Looking at all of the content on a site and using that content/other signals to determine the topics for the site
Looking at all of the content on a site and weighting it by usage and then using that content/other signals to determine the topics for the site.
While the first and second options might be appealing methods as they are simple, they probably will give a very inaccurate view of the content of many sites. I think that the third would give the most accurate view of what a site is actually about.
The text was updated successfully, but these errors were encountered:
There's an ingrained assumption, within this, that a site would be limited to one set of topics. I'd like to challenge that assumption as this is unlikely to work well for many sites. The most obvious classes of sites that I conceive it'd cause problems with are:
major news publishers. 'news' for example would lack specificity, or even accuracy when referring to their food & drink section.
larger businesses with diversified products. It would be helpful to increase ad relevance to understand if a user who's visited an insurance company's site, for example, was looking at car insurance vs life insurance.
@JamesFinlayson It looks like both of your examples would be helped by allowing sites to set a section name for categories of content (#17). A large news site could put food and drink in a separate section from general news, and a diversified company could choose how they wanted to organize their products and services for classification purposes. (The sites would not supply their own topics, just provide info to the classifier to split out what pages should be treated as a group for classification purposes)
As far as I can envision, there are three high-level sets of data (corpus) for a given site that could be used by the classifier. In all cases, the output of the classifier might produce multiple strong signals about what the content of the site is (and "strong" will also need to be defined)
While the first and second options might be appealing methods as they are simple, they probably will give a very inaccurate view of the content of many sites. I think that the third would give the most accurate view of what a site is actually about.
The text was updated successfully, but these errors were encountered: