🗃️
PropScreen
  • PropScreen Overview
  • Introduction
    • The Problem
    • Problem Importance and Impact
    • Solution Overview
      • How Propscreen Addresses the Problem
      • Use Cases
      • User Personas
  • System Architecture
    • Core Components of Propscreen
      • PropScreen's Checks
      • Context Strings Database
      • Hashed Organizational Sensitive Information
      • Interdiction Log Database (Reports)
    • Architecture Overview
    • Sequence Diagrams by Use Case
      • Use Case 1 Sensitive Information Disclosure Interdiction
      • Use Case 2 Logging of Interdiction Events
  • Project Considerations
    • Threat Modeling
      • Key Threats
      • Key Threat 1
      • Key Threat 2
      • Key Threat 3
    • Secure by Design
    • Alternative Solutions
      • NER and Regex Based Scans
      • Traditional Data Loss Prevention
    • PropScreen's LLM Implementation
  • The Proof of Concept
    • Demo Video
    • Try the Proof of Concept
  • Going Forward...
    • SIEM Integration
    • Role Based Access Control Dependent Response Filtering
    • The Good, The Bad, and the Learning
Powered by GitBook
On this page
  • Bias
  • Unexpected Failure
  • How PropScreen Fixes the Problems
  1. Project Considerations
  2. Alternative Solutions

NER and Regex Based Scans

PreviousAlternative SolutionsNextTraditional Data Loss Prevention

Last updated 9 months ago

Named Entry Recognition and Regex are powerful tools that can be utilized to determine if sensitive information exists in a models response. However these tools have edge cases that may need to be addressed depending on the needs of an organization. The primary two issues that were found when deploying a NER and Regex only solution: bias and unexpected failure.

Bias

The first issue that was encountered when deploying a standalone NER/Regex based solution was the fact that some people's names were not recognized as names by the solution. Edge cases exists, they will probably always exist, and for the vast majority of instances the NER/Regex based approach works fine. However, especially in the circumstance of a person's PII, edge cases need to be handled such that the user is not unfairly punished for being outside of the set of names that are detected by the scans.

To illustrate the problem see the screenshot below. In this example the name Kay Walkingstick, a Cherokee Artist, is not detected by neither NER nor Regex as a person's name. The inference made by the writer is this is because of bias on terms of the definition of what is considered a "name" by the scanners.

Screenshot of NER/Regex based approach failing to identify a person's name (click to enlarge)

Unexpected Failure

The second issue was the fact that even with names that fall within what can be inferred as the bounds of the bias of the scans, there was still unexpected failure in detection of the names of people. In this test the model was asked the same question with the only difference being the first name of the person involved. In the first instance a name was not detected and in the second a name was detected.

How PropScreen Fixes the Problems

PropScreen has the ability to address both of these problems through its use of a of a hashed database of sensitive information against the model's response. The names of the people mentioned in this section could have been hashed and stored here. This would have resulted in their names being detected even if the NER/Regex based approach failed to do so.

check
The name "Jane Doe" is not recognized as a person's name (click to enlarge)
The name "John Doe" is recognized as a person's name (click to enlarge)