Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No comms on OSX #8

Open
gerkey opened this issue Jul 28, 2016 · 5 comments
Open

No comms on OSX #8

gerkey opened this issue Jul 28, 2016 · 5 comments

Comments

@gerkey
Copy link
Contributor

gerkey commented Jul 28, 2016

I'm trying out cros on OSX 10.10. I have a branch that fixes simple compile problems and avoids passing -1 file descriptors to FD_SET() (the manpage says that this leads to undefined behavior, and while Linux tolerates it, OSX just segfaults):

master...gerkey:os_fixes2

So now I can build and run examples, but nothing connects. I don't see any indication that roscore is being contacted by the nodes. I'm happy to do the debugging, but I'd love to get some guidance on where to look for the problem

@onnivoro
Copy link
Collaborator

onnivoro commented Aug 2, 2016

Hi! Thanks for making cROS compiling under mac osx!
I will try your branch and I will give you feedbacks about communication issues soon.

@onnivoro
Copy link
Collaborator

onnivoro commented Sep 5, 2016

Ok, I made some trials with the following configuration:

  • standard indigo release on Ubuntu 14.04 virtual machine
  • cROS running under Mac OX Yosemite (10.10.5 (14F1808))
  • cROS api_test.c running
  • roscore running

The problem is the select call, it returns timeout everytime.
To check this out open cros_defs.h and increase the verbosity level setting #define CROS_DEBUG_LEVEL 2
I suggest to comment all the topic/service registration calls in the api_test node. In this way the node just calls an XMLRPC function in loop to test the communication status.

@gerkey
Copy link
Contributor Author

gerkey commented Nov 3, 2016

I finally had a chance to look back at this. I've tracked the problem down to an apparent difference between Linux and OSX in the behavior of select() on freshly created sockets. Here's a simple program that does the same steps as openXmlrpcClientSocket() to create a new socket, and then immediately uses select() to check for writability:

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>

int
main(void)
{
  int fd = socket ( AF_INET, SOCK_STREAM, IPPROTO_TCP );
  if(fd == -1)
  {
    printf("socket()\n");
    return 1;
  }
  int val = 1;
  if ( setsockopt ( fd, SOL_SOCKET, SO_REUSEADDR, ( const char* ) ( &val ), sizeof ( int ) ) != 0 )
  {
    printf ( "tcpIpSocketSetReuse() : setsockopt() with SO_REUSEADDR option failed \n" );
    return 1;
  }
  if ( fcntl ( fd, F_SETFL, O_NONBLOCK ) != 0 )
  {
    printf ( "tcpIpSocketSetNonBlocking() : fcntl() failed \n" );
    return 1;
  }

  fd_set w_fds;
  FD_ZERO(&w_fds);
  FD_SET(fd, &w_fds);
  struct timeval timeout;
  timeout.tv_sec = 0;
  timeout.tv_usec = 0;
  int ret = select(fd+1, NULL, &w_fds, NULL, &timeout);
  printf("select return: %d\n", ret);
  if(ret<0)
  {
    printf("error\n");
    return 1;
  } else if(ret==0) {
    printf("timeout\n");
    return 1;
  } else {
    printf("FD_ISSET: %d\n", FD_ISSET(fd, &w_fds));
  }
  printf("success\n");
  return 0;
}

On Linux, this program gives the following output:

select return: 1
FD_ISSET: 1
success

On OSX, here's the output:

select return: 0
timeout

That is, on Linux, a freshly created socket is apparently writable, while on OSX, it is not.

This is a problem in cros because to get things going, we need to find a writable xmlrpc_client_fd, which will cause us to call doWithXmlrpcClientSocket() here. Then, inside doWithXmlrpcClientSocket(), the socket will actually get connected here.

I'll now try for a workaround.

@gerkey
Copy link
Contributor Author

gerkey commented Nov 4, 2016

I haven't figured out the right way to fix the problem on OSX. In principle, I believe that we need to initiate the connect() call on the socket before going into the select() loop. For example, around here, I could imagine do something like this:

if(!n->xmlrpc_client_proc[i].socket.connected)
  tcpIpSocketConnect(...);

If the socket is non-blocking, then it would be OK to initiate the connect() call there. But it would require restructuring of the tcpipSocketConnect() call to make this change.

@onnivoro, what do you think about how to address this issue?

Also, would you consider accepting a PR to fix the basic compile, link, and run issues on OSX, without making it actually work? Or would you prefer to wait until we actually have it working? It would be a bit easier for me if the basic compilation fixes were merged earlier.

@onnivoro
Copy link
Collaborator

@gerkey you are welcome to make a pull request with OSX compile fixes. Many thanks for your work on trying to find out a solution to the select() problem. I am finally working on it, hope to get a suitable solution soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants