Nodes keep stopping - how to start and keep them started?

L1 Bithead

Re: Nodes keep stopping - how to start and keep them started?

Tonight I finished building this out in my lab. Centos 7 and the ansible 'stable' minemeld branch. Added the alienvault.reputation feed, and sure enough it killed off most of the four other miners when it hit ~15-18k indicators, which is about 3-5 minutes after the engine started. Restarting the engine will restart all miners, but the failure mode is consistent. The alienvault.reputation miner never advances past that number (or completes polling), despite the feed having around 60k indicators (as seen on a successful polling in my Ubuntu MM). Interesting. The feed works fine in my Ubuntu MM with just 1 GB ram and 1 core (and like 20 other feeds)

 

I noted that this miner uses Class: minemeld.ft.csv.CSVFT, and i wonder if it's an issue with that class on Centos. To confirm, i found another miner using this class (bambenekconsulting.c2_ipmasterlist) to see if the behavior is the same, but it completes polling successfully. Only 437 indicators, so next I'll try a feed of the same class with more indicators, closer to the scale of alienvault. This will be a bit of trial/error work.

 

Another option I found is to try running it in Docker, which contains all the necessary dependencies in the container: https://hub.docker.com/r/jtschichold/minemeld/ but i'm out of time tonight. I'll take a look at this later on this week, unless you want to take a first pass @hshawn

 

Cheers

L1 Bithead

Re: Nodes keep stopping - how to start and keep them started?

I did try other miners of the same class, and they worked without incident, but were all small. So perhaps it's scale related. I created a new miner (not a clone) to temporarily mine the alienvault.reputation database from a website I control. This new miner crashed in the same way the included alienvault.reputation one did. 

 

I noted minemeld-traced.log is non-zero bytes, and upon looking in there it seems the engine crashes due to unhandled exceptions.

2018-07-11T00:43:37 (8424)storage.remove_reference WARNING: Attempt to remove non existing reference: 2e5b0b9c-c47b-44Traceback (most recent call last):
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 561, in _ioloop
    conn.drain_events()
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/connection.py", line 323, in drain_events
    return amqp_method(channel, args)
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/connection.py", line 529, in _close
    (class_id, method_id), ConnectionError)
ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
<Greenlet at 0x7fb4e5de5230: <bound method AMQP._ioloop of <minemeld.comm.amqp.AMQP object at 0x7fb4e5e021d0>>(0)> fai

 

Everthing begins to fall apart in the minemeld-engine.log here (before this timestamp things appear healthy):

2018-07-11T00:38:59 (9727)mgmtbus._merge_status ERROR: old clock: 90 > 74 - dropped
2018-07-11T00:38:59 (9727)mgmtbus._merge_status ERROR: old clock: 21 > 20 - dropped
2018-07-11T00:38:59 (9727)mgmtbus._merge_status ERROR: old clock: 20 > 19 - dropped
2018-07-11T00:38:59 (9727)mgmtbus._merge_status ERROR: old clock: 22 > 20 - dropped
Traceback (most recent call last):
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 561, in _ioloop
    conn.drain_events()
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/connection.py", line 323, in drain_events
    return amqp_method(channel, args)
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
    reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Basic.publish: (404) NOT_FOUND - no exchange 'inboundaggregator' in vhost '/'
<Greenlet at 0x7f7939978190: <bound method AMQP._ioloop of <minemeld.comm.amqp.AMQP object at 0x7f793b0d1d10>>(11)> fa

2018-07-11T00:39:17 (9741)amqp._ioloop_failure ERROR: _ioloop_failure: exception in ioloop
Traceback (most recent call last):
  File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 567, in _ioloop_failure
    g.get()
  File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gevent/greenlet.py", line 251, in get
    raise self._exception
NotFound: Basic.publish: (404) NOT_FOUND - no exchange 'inboundaggregator' in vhost '/'
2018-07-11T00:39:17 (9741)chassis.stop INFO: chassis stop called

When comparing this to my Ubuntu MM, I noted that there are errors, but no 'clock' errors in the logs. I started poking around in the python to see how clock is used, but am now out of time again. Documenting all of this here before it slips out of my mind. 

L4 Transporter

Re: Nodes keep stopping - how to start and keep them started?

I gave the docker image a shot yesterday it seems to be still a work in progresss, it did come up but loging into minemeld was not working so I shelved it and went back to poking at the CentOS setup. It sounds like @tyreed you are going down some rabbit holes with this, don't stare at it too long, your vision may blur ;) 

L7 Applicator

Re: Nodes keep stopping - how to start and keep them started?

Hi @tyreed,

 

I have fixed this issue in the latest version of minemeld-ansible. Could you try reinstalling the instance with the last minemeld-ansible playbook please?

 

L4 Transporter

Re: Nodes keep stopping - how to start and keep them started?

@lmori I took the following steps this morning:

 

* refresh git repo

* re-run ansible playbook

* reboot server

* add alienvault reputation

 

results: it looked like it was going to be happy but after about a minute all the nodes stopped and the alienvault poll stopped at around 17k (there are usually around 50k in there)

 

After removing alienvault and comitting everything is happy again (but it takes about 8 minutes to restart the engine when this happens). Is there a better way to update the system? Should I completely wipe out the previous install and redo it from zero?

 

TIA!

 

Errors in the log around the time of this event (possibly unrelated):

[2018-07-13 07:41:04 PDT] [887] [ERROR] Exception on /status/minemeld [GET]
Traceback (most recent call last):
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 1641, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 1544, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 1639, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 1625, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/minemeld/engine/core/minemeld/flask/aaa.py", line 125, in decorated_view
return f(*args, **kwargs)
File "/opt/minemeld/engine/core/minemeld/flask/aaa.py", line 135, in decorated_view
return f(*args, **kwargs)
File "/opt/minemeld/engine/core/minemeld/flask/statusapi.py", line 171, in get_minemeld_status
status = MMMaster.status()
File "/opt/minemeld/engine/core/minemeld/flask/mmrpc.py", line 51, in status
return self._send_cmd('status')
File "/opt/minemeld/engine/core/minemeld/flask/mmrpc.py", line 41, in _send_cmd
self._open_channel()
File "/opt/minemeld/engine/core/minemeld/flask/mmrpc.py", line 38, in _open_channel
self.comm.start()
File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 595, in start
c = amqp.connection.Connection(**self.amqp_config)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
self.transport = self.Transport(host, connect_timeout, ssl)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
return create_transport(host, connect_timeout, ssl)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport
return TCPTransport(host, connect_timeout)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__
raise socket.error(last_err)
error: [Errno 111] Connection refused
[2018-07-13 07:41:05 +0000] [887] [ERROR] Error handling request
Traceback (most recent call last):
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gunicorn/workers/async.py", line 52, in handle
self.handle_request(listener_name, req, client, addr)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 159, in handle_request
super(GeventWorker, self).handle_request(*args)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/gunicorn/workers/async.py", line 105, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 2000, in __call__
return self.wsgi_app(environ, start_response)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 1996, in wsgi_app
ctx.auto_pop(error)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/ctx.py", line 387, in auto_pop
self.pop(exc)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/ctx.py", line 376, in pop
app_ctx.pop(exc)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/ctx.py", line 189, in pop
self.app.do_teardown_appcontext(exc)
File "/opt/minemeld/engine/current/lib/python2.7/site-packages/flask/app.py", line 1898, in do_teardown_appcontext
func(exc)
File "/opt/minemeld/engine/core/minemeld/flask/mmrpc.py", line 207, in teardown
g.MMMaster.stop()
File "/opt/minemeld/engine/core/minemeld/flask/mmrpc.py", line 55, in stop
self.comm.stop()
File "/opt/minemeld/engine/core/minemeld/comm/amqp.py", line 685, in stop
self.rpc_out_channel.close()
AttributeError: 'NoneType' object has no attribute 'close'

 

Update: I tried not adding alienvault to the aggregator until it finished polling. The alienvault node will completely poll and finish if it is not added to the aggregator, however once it is added then that is when the nodes get killed.

 

 

Re: Nodes keep stopping - how to start and keep them started?

Hi,

A few days ago, I installed Minemeld con Ansible and Ubuntu 18.

I think , I have a similar problem but with  urlhaus.URL.


Regards2018-10-08 16_24_47-MineMeld.png2018-10-08 16_25_05-MineMeld.png2018-10-08 16_25_17-MineMeld.png2018-10-08 16_25_27-MineMeld.png2018-10-08 16_25_41-MineMeld.png

 

Regards

L4 Transporter

Re: Nodes keep stopping - how to start and keep them started?

Might be related but what I am seeing is all nodes will stop instead of just some of them. 

 

OT but how did you get the ansible install to work on Ubuntu18? Mine will error out right away.

L7 Applicator

Re: Nodes keep stopping - how to start and keep them started?

Hi @Sistemas_SanLucar,

that is a conflict with RabbitMQ version installed by default on Ubuntu 18, that is also we don't support Ubuntu 18 yet in the Ansible playbook. You should install an older RabbitMQ release (3.2.X)

Re: Nodes keep stopping - how to start and keep them started?



1)a) Add the repositories fo file /etc/apt/sources.list

deb http://us.archive.ubuntu.com/ubuntu/ bionic universe /etc/apt/sources.list
deb http://minemeld-updates.panw.io/ubuntu trusty-minemeld main

1)    $ sudo apt-get update
2)    $ sudo apt-get upgrade # optional
3)    $ sudo apt-get install -y gcc git python-minimal python2.7-dev libffi-dev libssl-dev make
4)    $ wget https://bootstrap.pypa.io/get-pip.py
5)    $ sudo -H python get-pip.py
6)    $ sudo -H pip install ansible
7)    $ git clone https://github.com/PaloAltoNetworks/minemeld-ansible.git
8)    $ cd minemeld-ansible

8)a) After setp 8), we modified configuration

I copeid the files  Ubuntu-16.04.yml to Ubuntu-18.04.yml  in directory structure , these files didn't exist in setup.
Current structure:


./roles/infrastructure/vars/Ubuntu-14.04.yml
./roles/infrastructure/vars/Ubuntu-16.04.yml
./roles/infrastructure/vars/Ubuntu-18.04.yml

./roles/minemeld/vars/Ubuntu-14.04.yml
./roles/minemeld/vars/Ubuntu-16.04.yml
./roles/minemeld/vars/Ubuntu-18.04.yml

./roles/minemeld/tasks/Ubuntu-18.04-post.yml
./roles/minemeld/tasks/Ubuntu-16.04-post.yml
./roles/minemeld/tasks/Ubuntu-14.04-post.yml

8)b) As indicated at the beginning of the README.MD installation manual, we have modified the local.yml file to be able to install the stable version instead of the "dev" development one.
So that the local.yml file remains this way.

-----------------------------------------
- name: minemeld playbook
  hosts: 127.0.0.1
  connection: local
  become: true

  vars:
          #  minemeld_version: develop
    file_permissions: 'u=rwX,g=rwX,o=rX'
  # uncomment the following to install stable
    minemeld_version: master
    group_permissions: 'u=rwX,g=rX,o=rX'
  # remove comment to set custom repositories
  # core_repo: "https://github.com/jtschichold/minemeld-core.git"
  # prototype_repo: "https://github.com/jtschichold/minemeld-node-prototypes.git"
  # webui_repo: "https://github.com/jtschichold/minemeld-webui.git"

  roles:
  - infrastructure
  - minemeld
-------------------------------------------

9)    $ ansible-playbook -K -i 127.0.0.1, local.yml
10)    $ usermod -a -G minemeld # add your user to minemeld group, useful for development

Highlighted

Re: Nodes keep stopping - how to start and keep them started?

How do i do it?

 

thank you

Like what you see?

Show your appreciation!

Click Like if a post is helpful to you or if you just want to show your support.

Click Accept as Solution to acknowledge that the answer to your question has been provided.

The button appears next to the replies on topics you’ve started. The member who gave the solution and all future visitors to this topic will appreciate it!

These simple actions take just seconds of your time, but go a long way in showing appreciation for community members and the Live Community as a whole!

The Live Community thanks you for your participation!