Remove duplicate entries to reduce log file size for ML

Showing results for 
Show  only  | Search instead for 
Did you mean: 
Please sign in to see details of an important advisory in our Customer Advisories area.

Remove duplicate entries to reduce log file size for ML

L1 Bithead

Hi all,


The current issue I have is the export traffic logs are too large for Expedition ML and manipulation. 1 hour worth of logs is around 2GB. 2 days is around 100GB. After Machine Learning with the 100GB logs is done (around 80k rules) it fails to import to the project. An error dialog box pops up without much detail. Is there further error logs which might shed some light on the issue? Is there a limit on the number of logs/rules to be processed?


Tried ML on smaller 2GB file and it all works as expected.


We have tried to reduce the export log size by removing duplicates via Excel, however after scp copy to expedition, the new file is no longer listed as a file for M.Learning processing to a parquet format. (interestingly, even though no file is listed, pressing the 'Process Files' button the file is now seen and is processes and then it is recognised as 'Processed by User admin'. However a parquet file is not generated. no directory or data in /datastore)


reducing the duplicates reduces the log file by around 80%.


Current details.

Expedition version 1.1.7

6 vCPU




Any insights would be appreciated.






L5 Sessionator

Given your description, it is not a matter of how big the log files are, rather than how big it is to import the rules into the project.


What I guess that the issue may be is a limit in the packet size for MySQL inserts.


Check your file



and verify that your max_allowed_packet is large enough (make it 4 times bigger), as well as the bulk_insert_buffer_size value.


Let's see if this helps resolving your issue

L5 Sessionator

Btw, removing duplicate entries in the log is unnecessary. The first pre-processing pass already optimizes that and other aspects as well.


For instance, makes a summary of, what you call, duplicated connections and sums bytes send, bytes received, packs, etc.

If you check the folder "connections.parquet" that is created from the CSV files (within your Temporary Data Structures Folder), you will see that those 100GB of logs may have been reduced to some MBs.

Thanks @dgildelaig.


I've increased the size for both packets and buffer and it did not help with the import of the large (~80k) number of rules. Also, in this state, I can't remove the 'ML Enabled' tag on the rules. It returns with the error dialog box. I deleted the project to get around this issue.


I have now set 'ML Enabled' only one of the rule to get some a managable size of rules.





You're right, the processing to a parquet file from the logs make removing duplicates entries unnecessary.



  • 4 replies
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!