Expedition ML Failed

cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Announcements

Expedition ML Failed

L2 Linker

Hello!

 

Machine Learning analysis is failing. In GUI, error is "Failed".  In /tmp/error_SecRulesLearn logs I see this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 89.0 failed 1 times, most recent failure: Lost task 26.0 in stage 89.0 (TID 6323, localhost, executor driver): java.io.FileNotFoundException: /tmp/blockmgr-9a58c978-4463-4f0c-a09d-e4bcc41c3691/25/temp_shuffle_1ae06377-b4dd-4d10-a389-7e656f2e5fd5 (Too many open files)

 

Did anybody has this error? Please help. 

 

Thank you and regards,

Maja

1 accepted solution

Accepted Solutions

L5 Sessionator

Hi,

I have not experienced this issue before.

However, by the description of the error, it seems that the process is having too many files open and the ulimit (going technical here) may have been crossed.

To reduce the number of open files, one thing you could try is to decrease the number of executors. In Spark, which it is what Expedition is using, we could reduce the number of executors by defining the number of CPUs that Expedition is using for Spark.

 

You can find this value in /home/userSpace/environmentParameters.php

Try to see if the number of used CPUs (NumCPUs) could be decreased a little bit.

 

I hope this helps

View solution in original post

2 REPLIES 2

L5 Sessionator

Hi,

I have not experienced this issue before.

However, by the description of the error, it seems that the process is having too many files open and the ulimit (going technical here) may have been crossed.

To reduce the number of open files, one thing you could try is to decrease the number of executors. In Spark, which it is what Expedition is using, we could reduce the number of executors by defining the number of CPUs that Expedition is using for Spark.

 

You can find this value in /home/userSpace/environmentParameters.php

Try to see if the number of used CPUs (NumCPUs) could be decreased a little bit.

 

I hope this helps

L2 Linker

Hello!

 

I decreased number of CPUs from 51 to 40 and now Machine Learning is working. Thank you very much for your help.

 

Regards,

Maja

  • 1 accepted solution
  • 2195 Views
  • 2 replies
  • 0 Likes
Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the LIVEcommunity as a whole!

The LIVEcommunity thanks you for your participation!