Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make /tf_static use transient_local durability #160

Merged
merged 4 commits into from
Sep 10, 2019
Merged

Conversation

sloretz
Copy link
Contributor

@sloretz sloretz commented Aug 29, 2019

This makes /tf_static use transient_local durability to get behavior like ROS 1 latching.

The static broadcaster has a queue depth of 1 so only the latest message from a broadcaster is given to late joining listeners. This matches latching in ROS 1.

@sloretz sloretz added the enhancement New feature or request label Aug 29, 2019
@sloretz sloretz self-assigned this Aug 29, 2019
@sloretz
Copy link
Contributor Author

sloretz commented Aug 29, 2019

CI (testing all packages above tf2_ros)

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@sloretz sloretz added the in review Waiting for review (Kanban column) label Aug 29, 2019
Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks OK to me. There is one nit and one documentation thing that we should do.

tf2_ros/include/tf2_ros/qos.hpp Outdated Show resolved Hide resolved
// broadcaster.sendTransform(msg);
// ROS_INFO("Spinning until killed publishing %s to %s", msg.header.frame_id.c_str(), msg.child_frame_id.c_str());
// rclcpp::spin(node);
broadcaster.sendTransform(msg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think this is the right change to make. However, there is some possibility of downstream breakage if existing code is subscribing to the /tf_static topic without using the TransformListener class without transient_local. I'll request that you make a release note for Eloquent that states that we are changing this behavior of the static_transform_publisher program.

@sloretz
Copy link
Contributor Author

sloretz commented Sep 4, 2019

Should StaticQoS be divided into different QoS for the Broadcaster and Listener?

The current state of the PR has both the Broadcaster and the Listener use a depth of 100, which I think means the Listener would receive the last 100 messages from the broadcaster. In ROS 1 only the very latest message was latched, which would be like the Broadcaster having a depth of 1 in ROS 2.

Is it more desirable for the Broadcaster to only offer the latest transform message to late joining subscribers, or for the Broadcaster to be able to send multiple transform messages?

EDIT:

And if it's desirable for the Static Broadcaster to offer multiple messages, why stop at 100? Why not use KEEP_ALL?

@clalancette
Copy link
Contributor

Should StaticQoS be divided into different QoS for the Broadcaster and Listener?

Hm, this is a good point. Given that this is tf_static, I would think that things published here wouldn't change, and thus a depth of 1 would be appropriate here on both sides. But maybe I'm missing something that happens on tf_static.

@sloretz
Copy link
Contributor Author

sloretz commented Sep 5, 2019

Given that this is tf_static, I would think that things published here wouldn't change, and thus a depth of 1 would be appropriate here on both sides.

IIUC if the listener's depth is 1 and there are 2 static transform broadcasters, the listener might miss one of the messages if they come in at the same time.

I divided the QoS classes into [Dynamic|Static]BroadcasterQoS and [Dynamic|Static]ListenerQoS. The static broadcaster's depth is 1 to be closer to the behavior in ROS 1.

@clalancette
Copy link
Contributor

IIUC if the listener's depth is 1 and there are 2 static transform broadcasters, the listener might miss one of the messages if they come in at the same time.

To test this, I setup two transient local publishers with a single transient local subscriber. I start up the two publishers, then the subscriber to see what happens. With both Fast-RTPS and CycloneDDS, the subscriber gets both messages, even with a depth of 1. With Opensplice, I see the behavior you describe where I only see one of the two. The question is whether this is a bug in Opensplice, or whether this is something that isn't guaranteed by RTPS. I think we'll need to figure out the answer to that question before making a decision here.

I divided the QoS classes into [Dynamic|Static]BroadcasterQoS and [Dynamic|Static]ListenerQoS. The static broadcaster's depth is 1 to be closer to the behavior in ROS 1.

Yeah, I think this is a good idea regardless.

@sloretz
Copy link
Contributor Author

sloretz commented Sep 5, 2019

To test this, I setup two transient local publishers with a single transient local subscriber. I start up the two publishers, then the subscriber to see what happens. With both Fast-RTPS and CycloneDDS, the subscriber gets both messages, even with a depth of 1. With Opensplice, I see the behavior you describe where I only see one of the two.

I thought a delay between the subscription being created and the subscription taking messages from the middleware would make a difference, but even with that I see the subscription get all messages with a depth of 1 using Fast-RTPS.

spub.py
import sys

import rclpy
import rclpy.node
import rclpy.qos
import std_msgs.msg

rclpy.init()

node = rclpy.node.Node('spub' + sys.argv[1])

qos = rclpy.qos.QoSProfile(
    depth=1,
    durability=rclpy.qos.DurabilityPolicy.TRANSIENT_LOCAL,
    history=rclpy.qos.HistoryPolicy.KEEP_LAST,
)

pub = node.create_publisher(std_msgs.msg.String, "/tf", qos)

pub.publish(std_msgs.msg.String(data='Hello' + sys.argv[1]))

rclpy.spin(node)
rclpy.shutdown()
ssub.py
import sys
import time

import rclpy
import rclpy.node
import rclpy.qos
import std_msgs.msg

rclpy.init()

node = rclpy.node.Node('ssub')

qos = rclpy.qos.QoSProfile(
    depth=1,
    durability=rclpy.qos.DurabilityPolicy.TRANSIENT_LOCAL,
    history=rclpy.qos.HistoryPolicy.KEEP_LAST,
)

def rx_msg(msg):
    node.get_logger().info(repr(msg))

sub = node.create_subscription(std_msgs.msg.String, "/tf", rx_msg, qos)

# Delay between subscriber matching and spinning
time.sleep(10)

rclpy.spin(node)
rclpy.shutdown()
$ python3 spub.py Alice & python3 spub.py Bob & python3 spub.py Trudy & python3 ssub.py 
[1] 29016
[2] 29017
[3] 29018
[INFO] [ssub]: std_msgs.msg.String(data='HelloTrudy')
[INFO] [ssub]: std_msgs.msg.String(data='HelloBob')
[INFO] [ssub]: std_msgs.msg.String(data='HelloAlice')

The question is whether this is a bug in Opensplice, or whether this is something that isn't guaranteed by RTPS. I think we'll need to figure out the answer to that question before making a decision here.

Regardless, a depth of 1 on the publisher seems fine. It looks like there's another DDS QoS setting for RESOURCE_LIMITS with a field max_samples. Maybe the DataReader is "taking" samples from the DataWriter, but then storing up to max_samples locally, and if so maybe rmw_opensplice is setting max_samples to depth while the other two rmw implementations are not?

@wjwwood any ideas why all messages are received with depth of 1 on the subscriber using Fast-RTPS and CycloneDDS?

Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in general.

And if it's desirable for the Static Broadcaster to offer multiple messages, why stop at 100? Why not use KEEP_ALL?

KEEP_ALL isn't very nice for real-time systems, so if we can safely get away with a known history depth as the default, I think that would be preferable.

The question is whether this is a bug in Opensplice, or whether this is something that isn't guaranteed by RTPS.

In my opinion, it's just a race condition. If you send two messages which are not keyed, then they're fighting for the same spot in a single "instance" with a history depth of 1. So unless you take the first out before the second arrives, I would expect them to replace one another. I understand that you're trying to eliminate this by sleeping in the subscription, but ultimately I don't know when latched messages are received or sent or anything.

@wjwwood any ideas why all messages are received with depth of 1 on the subscriber using Fast-RTPS and CycloneDDS?

No, but that might be a good question to ask of both the Fast-RTPS team (@richiprosima?) and perhaps @eboasson?

class RCLCPP_PUBLIC DynamicListenerQoS: public rclcpp::QoS
{
public:
DynamicListenerQoS() : rclcpp::QoS(100) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these, I'd recommend offering a single argument for the depth and default it to 100, that way users can customize it if desired when creating the transformer. You could argue they should instead use a rclcpp::QoS directly instead, but having an intermediate class lets us customize other things in the future, e.g. adding transient local in the Static versions.

I'd also recommend having a base class and using that as the type of the argument to the static/dynamic broadcaster constructor. That way its harder for users to use the wrong kind of QoS. They can still of course change it how ever they want, but we can document which settings we feel are safe to change in that case, e.g. depth but not durability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added depth with default values in e32710b.

I didn't implement the base class recommendation. I'm assuming users who want to change the QoS settings beyond history depth are already aware of QoS compatibility.

Signed-off-by: Shane Loretz <[email protected]>
Static Broadcaster QoS depth is 1

Signed-off-by: Shane Loretz<[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
@sloretz sloretz force-pushed the static_transient_local branch from 43ddb87 to e32710b Compare September 6, 2019 20:58
@eboasson
Copy link

eboasson commented Sep 7, 2019

@sloretz @wjwwood this was too intriguing not to have a good look at what's going on here. A QoS setting of KEEP_LAST with depth 1 indeed means there is a single slot for each key value — and so only a single slot if there are no key fields. Later arriving[*] data is supposed to simply push out whatever happens to be present in the history.

OpenSplice and Cyclone DDS both correctly implement this. With the 10s sleep, the historical data is always pushed into the reader history before the take occurs and it only prints whichever message it received last (which can be any of the three). Without the sleep, it depends on timing — when discovery takes place, when the data is actually published, when the data makes it to subscriber process, how quickly it takes that data when it arrives ... That run you mentioned with Cyclone DDS where the subscriber showed all three was simply a matter of lucky timing. It doesn't work out like that for me in any case.

I couldn't resist digging into what is going with FastRTPS. It turns out that the first sample that receives is accepted as expected. When the second and third arrive, it finds the history depth has been reached, tries to drop an older sample from the same writer (which obviously fails, because this is from a different writer) and refuses to store the sample.

That in turn triggers a sequence of requests for retransmission of the samples that haven't made it into the history. It is very obvious in wireshark. Once the reader takes the sample that is present, the first sample to arrive after that one is accepted, &c.

That process generates a lot of traffic: a sloppily conducted experiment says for this particular test it results in 10x increase of packets over what the correct behaviour should be. Not too speak of the other effects on the system.

[*] That's with the "by reception timestamp" destination order setting; there is also "by source timestamp" setting that takes the source time stamp into account to retain the one with the latest time stamp rather, but that doesn't affect this particular experiment.

@sloretz
Copy link
Contributor Author

sloretz commented Sep 10, 2019

CI (Testing all packages above tf2_ros)

@dirk-thomas
Copy link
Member

@eboasson it would be great if you could report this issue in the FastRTPS repo.

@clalancette
Copy link
Contributor

@eboasson Thanks for the detailed digging and explanation of what is going on. With that information, then, the naive thing would be to make the static listener have KEEP_ALL. As @wjwwood suggested, though, that isn't nice for real-time effects, so I think the compromise of having 100 for the static listener makes sense. Thus, I'll approve this PR.

@sloretz
Copy link
Contributor Author

sloretz commented Sep 10, 2019

CI looks good, merging

@sloretz sloretz merged commit 094d363 into ros2 Sep 10, 2019
@delete-merged-branch delete-merged-branch bot deleted the static_transient_local branch September 10, 2019 20:20
@eboasson
Copy link

@eboasson it would be great if you could report this issue in the FastRTPS repo.

Done

@dirk-thomas
Copy link
Member

@sloretz
Copy link
Contributor Author

sloretz commented Sep 11, 2019

@sloretz This patch might be responsible for the following nightly build failure

08:08:45 transform_listener.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: __cdecl tf2_ros::DynamicListenerQoS::DynamicListenerQoS(unsigned __int64)" (__imp_??0DynamicListenerQoS@tf2_ros@@QEAA@_K@Z) referenced in function "public: __cdecl tf2_ros::TransformListener::TransformListener(class tf2::BufferCore &,bool)" (??0TransformListener@tf2_ros@@QEAA@AEAVBufferCore@tf2@@_N@Z) [C:\J\workspace\nightly_win_deb\ws\build\tf2_ros\tf2_ros.vcxproj]
08:08:45 transform_listener.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: __cdecl tf2_ros::StaticListenerQoS::StaticListenerQoS(unsigned __int64)" (__imp_??0StaticListenerQoS@tf2_ros@@QEAA@_K@Z) referenced in function "public: __cdecl tf2_ros::TransformListener::TransformListener(class tf2::BufferCore &,bool)" (??0TransformListener@tf2_ros@@QEAA@AEAVBufferCore@tf2@@_N@Z) [C:\J\workspace\nightly_win_deb\ws\build\tf2_ros\tf2_ros.vcxproj]
08:08:45 C:\J\workspace\nightly_win_deb\ws\build\tf2_ros\Debug\tf2_ros.dll : fatal error LNK1120: 2 unresolved externals [C:\J\workspace\nightly_win_deb\ws\build\tf2_ros\tf2_ros.vcxproj]

@dirk-thomas Looks like it. I'm surprised this didn't show up in normal windows CI. Will make a PR adding visibility macros.

sloretz added a commit to ros2/ros2_documentation that referenced this pull request Nov 15, 2019
sloretz added a commit to ros2/ros2_documentation that referenced this pull request Nov 15, 2019
* Note /tf_static uses transient_local in Eloquent

Follow up from ros2/geometry2#160

* Warn about QoS incompatibility
ferranm99 added a commit to ferranm99/ferran-ros that referenced this pull request May 20, 2022
* Note /tf_static uses transient_local in Eloquent

Follow up from ros2/geometry2#160

* Warn about QoS incompatibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in review Waiting for review (Kanban column)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants