Filtering Categories

From Opendium Documentation
Jump to navigation Jump to search

Opendium systems have a central categorisation system that is used to control access to various types of content. When the system is examining content, such as web traffic, it attempts to categorise the content by using the criteria defined for each category. Depending on the system's configuration, the resulting categorisation may be used to block traffic or provide reports on concerning behaviour.

Note that it is possible for content to be in multiple categories at the same time (e.g. Self Harm material distributed through a Social Media site), and categories are never used to allow content. i.e. it is not possible to "allow all Social Networking websites" - access to content is allowed by virtue of it not being in a blocked category. (However, also see the Walled Garden).

The system has a selection of predefined categories and we provide regular updates to the categorisation criteria. You can also create new categories as you see fit, and modify the categorisation criteria for both the predefined and user defined categories yourself by right clicking the category on the Filtering Categories page. When modifying the predefined categories, any criteria that you add yourself will override the criteria that we manage - you, rather than us, are ultimately in control of your network!

There are several different types of categorisation criteria that you can edit:

  • URIs - the system will examine the web addresses (URI) that are being accessed.
  • HTTP headers - each web request and its response contains a number of headers which describe the transaction, which the system will examine.
  • Content types - each object requested from the web has a content type which describes what kind of data it is. For example, it could identify it as a text document, or an executable file. For each web response, the system will not only look to see what type of object the web server claims it is, but also perform real time content inspection to fingerprint the object type itself.
  • Keywords - the system will examine the textual content of web traffic in real time, looking for words, sentences and complex expressions.
  • Search Keywords - the system will examine the search terms which are being used by the users.

The final determination of which categories the content belongs to is made by a combination of many individual pieces identified by all of the criteria that makes up a category, with each piece contributes a "score" - the higher the score, the more likely it is that the content belongs to that category.

Descriptions

Categorising content can be a very subjective business, and one of the main ways that we try to maintain consistency is to have good definitions about what each category is intended to be. For our predefined categories, we include this information in the category description. We strongly recommend that you read the description of the categories before using them so that you better understand them and how well each category fits into your school's safeguarding strategy.

When creating your own custom categories, you can also include a description which explains what the category is for.

Whilst it is tempting to create a simple "Global Block List" category, to which you can quickly add any inappropriate content, it is often very beneficial for the school to think about why the content is inappropriate and categorise it properly. In our experience, this makes it much easier to understand the configuration and keep it aligned with a school's ever changing requirements and safeguarding strategy.

URIs

If you want to categorise a website, you can simply add its address to the appropriate categories by right clicking the category and then Edit URIs. You can now click Add URI and define a pattern to match the address. The top of the dialogue box shows a description of what the pattern you have entered will match. You can make the pattern as specific or as non-specific as necessary.

Similarly, if a web site is being consistently miscategorised, you can add its address to that category and tick the Exclude from category box.

Please note that the system does not use "*" as a wildcard to make partial matches on a URI, please use the drop-down menus provided.

HTTP headers

When an object is fetched from a web server, both the client and the server exchange a number of headers which provide some information about the object. These headers can be scanned for certain information and used to help categorise the content.

Right click the category and then Edit HTTP response headers. You can click Add header to define a pattern to match a particular header - specify the HTTP header name, and a keyword to look for in that header. You can specify whether to treat the keyword as the start of a word, end of a word, whole word, or tell it to ignore word boundaries and find that text anywhere. For advanced use, it is possible to specify the keyword as a Perl Compatible Regular Expression. Finally, you can specify how much a matching header will contribute to the score - the higher the score, the more likely it is that the content belongs to that category.

Content types

If you want to block a certain type of file, you can tell the system that all files of that type belong to a certain category. Right click the category and then Edit content types. You can now click Add content type and specify the content type to match in the standard MIME content type format ("type/subtype"). If the subtype box is left blank, the rule matches all subtypes - i.e. "image/" will match all images.

Keywords

You can specify keywords to look for in textual content. This is a powerful feature, but one to be used with care since it is very easy to miscategorise content. A keyword can be a single word (or even part of a single word), a sentence, or a more complex expression.

Right click the category and then Edit keywords. You can now click Add keyword and specify the keyword to match. You can specify whether to treat the keyword as the start of a word, end of a word, whole word, or tell it to ignore word boundaries and find that text anywhere. For advanced use, it is possible to specify the keyword as a Perl Compatible Regular Expression. Finally, you can specify how much the keyword will contribute to the score if it is found - the higher the score, the more likely it is that the content belongs to that category.

As a rule of thumb, single words should not be assigned a high score as the probability of a false positive is quite high; longer sentences can usually more safely be given a higher score.

Search keywords

When a user makes a web search, the system can examine the search terms for any concerning keywords. The list of keywords which are checked against search terms are separate from the keywords which are matched against content because there is much less context available for searches - rather than having a page of text to examine, searches contain just a handful of words so it is much harder to select suitable words which won't result in overblocking. A keyword can be a single word (or even part of a single word), a sentence, or a more complex expression.

Right click the category and then Edit search keywords. You can now click Add keyword and specify the keyword to match. You can specify whether to treat the keyword as the start of a word, end of a word, whole word, or tell it to ignore word boundaries and find that text anywhere. For advanced use, it is possible to specify the keyword as a Perl Compatible Regular Expression. Finally, you can specify how much the keyword will contribute to the score if it is found - the higher the score, the more likely it is that the content belongs to that category.

Fixing false positives

As with all heuristics, whilst we endeavour to make the categorisation system as reliable as possible, it can never be 100% accurate and content will sometimes be miscategorised.

If the miscategorisation is an isolated case of a web site being consistently miscategorised, then some simple steps can be taken to prevent it: edit the offending category's URIs and add the web site's address with the Exclude from category box ticked. This will ensure that the web site will never be considered to be in that category. If the web site is completely trusted (e.g. your own web site) then consider adding it to the Whitelist Override to completely disable filtering.

If a lot of content is being miscategorised, you might consider lowering the sensitivity of the offending categories. Also, it is a good idea to check any categorisation criteria that you have added and remove any that is likely to make the categorisation system over-sensitive.