Single Incident/Playbook is killing the whole platform

Antanas — Wed, 28 Aug 2024 05:13:01 GMT

Hi,

I built a playbook to pull some nested data (~8 MB in total) which then is used in a looped subplaybook for additional data extraction.The subplaybook is relatively simple, uses for each input loop, starts with deletecontext (all=yes) and returns couple of small parameters through Outputs. It supposed to have ~600 iterations in total. Subplaybook has some built in JSON parsing as well as some regex conditions. Once the subplaybook kicks in, at some point the platform significantly slows down, console becomes unresponsive. Even the ssh shell becomes extremely slow, it took like ~5 min to login, each letter I type takes some time to appear in a terminal screen. At some point I was able to load workers/status, which showed only few tasks running (including the troubled one), more than a half of the workers were still available. The System Diagnostics page (which also took ages to load) showed 80%CPU, 80%Memory use (normally cpu is under 10%, memory under 60%). The top command in backend also showed quite significant usage of CPU. The playbook is mostly build on defaults, nothing much customized.

Eventually the ssh stopped responding, after hour of waiting i just cold rebooted the server and then closed the incident before it could continue.

What could be the cause of that behavior? I believe i have much more complex and intensive playbooks which do not cause such issues. Where to look for the clues?

How is it possible, that a single task/playbook kills the whole platform? Why it takes so many resources from the host, shouldn't resource allocation be restricted to a worker? Isn't this a goal of having podman/docker?

I do not have a test environment to play around the playbook, so replicating the issue is quite costly.

Appreciate for any ideas.

Thanks,

Antanas

Re: Single Incident/Playbook is killing the whole platform

Antanas — Sun, 01 Sep 2024 05:09:43 GMT

funnily enough, Support suggested to close a case that I raised for this and post a question here 🙂

topic Single Incident/Playbook is killing the whole platform in Cortex XSOAR Discussions

Single Incident/Playbook is killing the whole platform

Re: Single Incident/Playbook is killing the whole platform