Detecting Data Patterns in Large Files

Showing results for 
Show  only  | Search instead for 
Did you mean: 

Detecting Data Patterns in Large Files

L4 Transporter

A question came up in class today in regards to Detecting Data Filtering Patterns in Large Files.  Does the firewall buffer the file being inspected by the Data Filtering Patterns so that it can detect all occurrences of the Data Pattern before sending the file out of the firewall?  Or would the recipient get a truncated version of the file since the Data Pattern weight did not trigger a block action until later in the file?


L7 Applicator

Hello Sir,

Are you looking for Data-Patterns profile or File Blocking profile.

Data Patterns: The Data Patterns profile uses to define the categories of sensitive information that you may want to subject to filtering using data filtering security policies. i.e SSN CC number.

File Blocking: A security policy can include specification of a file blocking profile that blocks selected file types from being uploaded and/or downloaded, or generates an alert when the specified file types are detected. We have a predefine signature of all file types inside PA firewall. So, no need to buffer the whole file in order to determine the "file type".

For more info : Creating Custom Application Signatures


L4 Transporter


Thanks for you reply.  I'm referring to Data Filtering Profiles with a custom Data Pattern to look for some specific patterns in files.  The question came up because we were discussing the how the Alert & Block thresholds are based on the weight value of the pattern. 

So, the question is:

What happens if there's a large file being analyzed by the Data Filtering Profile's Data Pattern Match and the Block threshold weight does not trigger until very far down into the file.  Will the entire file be held by the firewall until it is completely scanned for the data pattern or will part of the file be sent out while the remaining file is scanned for the data pattern.

I hope this makes sense.



L6 Presenter

As far as I know the DLP stuff in PA never buffers the entire file. It only looks within the stream and if something bad is found the session is killed. Same goes with the antivirus feature of PaloAlto (which is both good and bad).

Only case of buffering is for wildfire when the PA will copy the stream from the dataplane into the mgmtplane and then, once the full file has arrived, the file will be uploaded to the wildfire boxes (either in the cloud or locally if you use WF-500 boxes). That is when using wildfire the first client will get infected (or at least get the file) but next client who tries to download the same file should be blocked if the wildfire managed to identify the file as a bad one and push out the signatures to identify this bad file.


For sure, all the file is not stored in the PA. During the dowload, the PA increase  counter to know if the block threshold is matched. If not, you've got the file if yes, Pa send a tcp reset for stoping the download and as you download can be finished, you will have an error and file can be used.

Hope help.


Hey jwolach,

Let's assume that you have a file that is 75% public data and 25% sensitive data. Now lets assume that the first half of the document is public data and the last half is where the sensitive data is. If the data stream is being scanned as the file is being transferred, the packets will continue to flow until PAN detects the sensitive data. Once the signature match is detected, the stream is reset. The file will stop being transferred. This would be akin to downloading a word doc from the web and then stopping the download halfway and trying to open the document. At this point, the file would be "corrupt" and your application would throw an error and prevent you from opening it. However, if you really wanted to gain access to the downloaded data, you would theoretically have all the packets that were transferred. However, the sensitive information was blocked by the PAN OS. So in the end you would have a document that was 50% transferred, and if you did reconstruct the data to make it readable then you would have the 50% of the public data. The sensitive information was never streamed due to the PAN OS detecting it and dropping the connection. So the data you want to remain private, stays private.

So in the end, they may get part of the document, but all information after the first signature match will be excluded from the document.

L4 Transporter

Hi - this is something I worked with a while back - specifically triggering on the content of documents where the trigger phrase was a classification label, and was being FTP'd.

In short, and as mentioned above, the file is inspected as a dynamic stream of data, so if the trigger string is in the 'last' part of the file being transferred then the bulk of the data will be received by the far end in the form of a truncated file.  Depending on the file format, a quick modification with a HEX editor can recover the data in an easily readable format.

Also note that some files, such as docX files are essentially a bunch of zipped XML files with an internal file structure (if you've never tried, just rename a DOCX file to a ZIP file and open that way to see).  This can mean that, depending where your trigger string is (e.g. footer text is held in a different XML file to the body text), you can actually 'lose' the entire body XML section that can then be simply opened at the far end by merging with another docx file etc.

FTP is particually prone to this as most FTP servers don't automatically delete incomplete transfers etc, and as F comes after B, the body text always 'goes first' in our testing.


PS There are fixes - but that's for another day!

  • 6 replies
  • 101 Subscriptions
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!