ONLINE, May 2001
Copyright © 2001 Information Today, Inc.
The role of information specialists has broadened in recent years, and, for many, it now includes shared responsibility for the security of the intellectual property of the organization. Information security is a responsibility that, broadly defined, includes two major areas: protection from intruders and protection from unwanted release of information. The first of these, protection from intruders, is usually the responsibility of the IT team alone, and focuses on three major concerns: interruption of service, protection of IP addresses, and intrusion detection. All these pertain to protecting an organization's intellectual property from assault from outside the organization.
There is a second, equally important area: protection of an organization from unwanted release of information to inappropriate recipients. While information specialists are frequently seen as providing the conduits for getting vital information to the right people at the right time, they also are increasingly being asked to participate in the task of ensuring that vital information is not released to those who do not have the right to it. They are protecting an organization's intellectual property from advertent or inadvertent breaches of security from within the organization.
Here's a real-world situation that requires boundary controllers: a company has decided to do research and development on a new approach to solving a common, expensive problem within the industry, and needs to protect this competitive advantage. Another is a financial services firm that must ensure outgoing email messages are in full compliance with securities laws and regulations. Boundary controllers come into play when an organization is providing technology to a team within a multinational task force and needs to share necessary information with other teams, but national security deems that not all information be shared.
The keyword boundary controller approach is technologically the inverse of the approach utilized in most Web-based filters intended to prevent children from accessing mature Web sites, and thus its name-dirty word list. These filters prevent Web sites with content matching the keywords from being accessed by a user's computer, while boundary controllers compare outgoing messages to a keyword list. But the limitations endemic to the keyword approach when used in censorware, which have been widely reported in the press, are similar. Namely, since language is so ambiguous, a word can express multiple meanings, and some of its occurrences will reflect the sense intended in the keyword list, as well as senses that are quite innocuous. However, a keyword system cannot tell the difference, and therefore, many false hits are detected. Secondly, the same meaning can be expressed by different words, thereby requiring the developer of the keyword list to include every possible synonym in order for the list to be exhaustively protective. If the keyword list does not contain all such re-phrasings, offending messages, documents, or Web sites are not accurately screened.
For example, a business rule that has been processed through a full NLP module will have the words morphologically analyzed and the meanings of the roots, as well as the suffixes, stored in the representation. After this, each word will be tagged with its correct part-of-speech tag, phrasal concepts will be bracketed and not stored as unrelated single words, the correct sense of each polysemous word will have the correct sense of the word selected, and synonymous words and phrases of each concept in the business rule will be appended to the stored representation. Entities, relations, and events will be understood. For example, a human name will be tagged as a person, the syntactic role of subject in a sentence; if already tagged as a person, and the subject of an active verb, the module will indicate that person as the agent of the action described by the verb, the frame for the event instantiated by the verb will be activated, and the attached slots of the frame filled with semantic information from the sentence.
NLP's ability to produce a representation of business rules, as well as outgoing messages and documents with this degree of human-like knowledge enrichment, is what makes the conceptual approach to boundary control so powerful. In the following, Part A provides the semantic representation, and Part B provides the logical representation of the business rule that states that, "Junior employees of the Acme Corporation must not describe specifications of company products in outgoing emails." For the sake of clarity, the logical representation does not contain all the semantic details of A.
<Junior_employee (new_hire; level_1_to_6)|Person> of|PREP the|ART <Acme_Corporation|Company>must|MOD not|MOD <describe (tell; explain; discuss)> <specification (size)> of|PREP ,company_product|ProdName> in|PREP <outgoing_email (message; posting)>.
If ISA (?X,junior_employee) and ISA (?Y, Acme_product) and ISA (?Z, email) and RCPT (?Z, ?P) and LOC (?P, outside_network) and CONT (?Z, 'ASSOC (?Y, ?A) & MEAS (?A, ?B)'), then CHRC (?Z, nonreleasable).
The two means by which conceptual boundary controllers can be im- plemented are the information re- trieval approach and the categorization approach.
After the system has been trained, these vectors are used to categorize each new outgoing missive according to whether it is releasable or unreleasable because it violates a business rule.
The test results, run on 112 messages plus attachments, showed that the conceptual approach significantly outperformed the keyword approach in terms of messages correctly blocked and messages correctly released. The keyword commercial system had scores of 70% precision and 38% recall; while the DataShield system had scores of 96% precision and 99% recall. In terms of efficiency, it took the keyword commercial system 2.04 seconds and DataShield took 9.75 seconds.
The difference in effectiveness is significant, and while the difference in efficiency may appear to be a rather dramatic difference, it must be weighed against the performance results, which showed that the conceptual system stopped 99% of the un- releasable messages from being sent out, and only stopped 4% of the messages that should have been released. Although it took four and a half times longer for the conceptual system (then still a prototype), this needs to be compared to trained human subjects who, in this experiment, took an average time of 560 seconds or 9.33 minutes to review and decide on the releasability of this same set of documents.
[2] Lewis, D. & Gale, W. (1994). "A sequential algorithm for training text classifiers." Proceedings of the 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval.
Elizabeth Liddy, Ph.D. (liddy@syr.edu) is director of the Center for Natural Language Processing, School of Information Studies, Syracuse University.
Comments? Email letters to the Editor at marydee@infotoday.com.
[infotoday.com] | [ONLINE] | [Current Issue] | [Subscriptions] | [Top] |
Copyright © 2001, Information Today, Inc. All rights reserved.
custserv@infotoday.com