Remove duplicate entries to reduce log file size for ML

YungOng · ‎02-28-2019

Hi all,

The current issue I have is the export traffic logs are too large for Expedition ML and manipulation. 1 hour worth of logs is around 2GB. 2 days is around 100GB. After Machine Learning with the 100GB logs is done (around 80k rules) it fails to import to the project. An error dialog box pops up without much detail. Is there further error logs which might shed some light on the issue? Is there a limit on the number of logs/rules to be processed?

Tried ML on smaller 2GB file and it all works as expected.

We have tried to reduce the export log size by removing duplicates via Excel, however after scp copy to expedition, the new file is no longer listed as a file for M.Learning processing to a parquet format. (interestingly, even though no file is listed, pressing the 'Process Files' button the file is now seen and is processes and then it is recognised as 'Processed by User admin'. However a parquet file is not generated. no directory or data in /datastore)

reducing the duplicates reduces the log file by around 80%.

Current details.

Expedition version 1.1.7

6 vCPU

16GB RAM

200GB HDD

Any insights would be appreciated.

Thanks,

Yung

dgildelaig · ‎03-01-2019

Given your description, it is not a matter of how big the log files are, rather than how big it is to import the rules into the project.

What I guess that the issue may be is a limit in the packet size for MySQL inserts.

Check your file

/etc/mysql/my.conf

and verify that your max_allowed_packet is large enough (make it 4 times bigger), as well as the bulk_insert_buffer_size value.

Let's see if this helps resolving your issue

dgildelaig · ‎03-01-2019

Btw, removing duplicate entries in the log is unnecessary. The first pre-processing pass already optimizes that and other aspects as well.

For instance, makes a summary of, what you call, duplicated connections and sums bytes send, bytes received, packs, etc.

If you check the folder "connections.parquet" that is created from the CSV files (within your Temporary Data Structures Folder), you will see that those 100GB of logs may have been reduced to some MBs.

YungOng · ‎03-03-2019

Thanks @dgildelaig.

I've increased the size for both packets and buffer and it did not help with the import of the large (~80k) number of rules. Also, in this state, I can't remove the 'ML Enabled' tag on the rules. It returns with the error dialog box. I deleted the project to get around this issue.

I have now set 'ML Enabled' only one of the rule to get some a managable size of rules.

Thanks,

Yung

YungOng · ‎03-03-2019

You're right, the processing to a parquet file from the logs make removing duplicates entries unnecessary.

Unlock your full community experience!

Remove duplicate entries to reduce log file size for ML

Remove duplicate entries to reduce log file size for ML

Show your appreciation!