Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-32248 Add tracing to rowservice #19314

Merged
merged 1 commit into from
Jan 14, 2025
Merged

Conversation

jpmcmu
Copy link
Contributor

@jpmcmu jpmcmu commented Nov 25, 2024

  • Added opentelemetry tracing to rowservice

Signed-off-by: James McMullan [email protected]

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32248

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

@@ -162,6 +162,69 @@ static ISecureSocket *createSecureSocket(ISocket *sock, bool disableClientCertVe
}
#endif

//------------------------------------------------------------------------------
Copy link
Contributor Author

@jpmcmu jpmcmu Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActiveSpanScope is very similar to ThreadedSpanScope described by Gavin here: https://hpccsystems.atlassian.net/jira/software/c/projects/HPCC/issues/HPCC-32982. I liked the name ActiveSpanScope because I believe the class has utility outside of multithreaded contexts, IE: time slicing. Would it be worthwhile to move this out of dafilesrv into jtrace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. These utility changes should be in jlib, ideally as separate PRs/requests. It would be worth merging your other PR, and having a PR that implements an agreed solution to HPCC-32982, and then rebasing this PR on it.
I'm open to discussing what the different classes should be called.

@@ -366,13 +366,31 @@ version: 1.0
detail: 100
)!!";

IPropertyTree * loadConfigurationWithGlobalDefault(const char * defaultYaml, Owned<IPropertyTree>& globalConfig, const char * * argv, const char * componentTag, const char * envPrefix)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is similar to work Jake has done in HPCC-32991, might be worthwhile to retarget to master and call the overloaded doLoadConfiguration instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Should reuse the from HPCC-32991. If this change are wanted in to 9.8, we could consider cherry-picking back the changed in HPCC-32991 to 9.8.

std::string traceParent = fullTraceContext ? fullTraceContext : "";
traceParent = traceParent.substr(0,traceParent.find_last_of("-"));

if (!traceParent.empty() && requestTraceParent != traceParent)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I am checking if the traceParent has changed every time process is called here because the client side may use multiple spans during the lifetime a single CRemoteRequest. See below screenshots for an example.

@jpmcmu
Copy link
Contributor Author

jpmcmu commented Nov 25, 2024

Goal:
The goal of this PR is to add initial tracing support to the row service in dafilesrv, which will improve debuggability for downstream row service clients as well as reducing the time the platform team spends debugging issues.

Current Tracing Limitations:
There is limited support for intercepting errors and adding them to the tracing spans, adding annotations and/or statistics to spans, and no internal spans tracking work within the row service. These limitations are intentional to keep the initial PR as simple as possible, and will be addressed in future PRs.

Exported Tracing example:
Note that during the read the client side creates more than one span over the lifetime of connection to the row service. The row service tracing supports this and correct handles the batching the client side is doing.
Screenshot 2024-11-25 at 1 39 44 PM

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu looks good. A few minor comments. It would be good to rationalise the helper span scope classes so they cover all the options.

@@ -162,6 +162,69 @@ static ISecureSocket *createSecureSocket(ISocket *sock, bool disableClientCertVe
}
#endif

//------------------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. These utility changes should be in jlib, ideally as separate PRs/requests. It would be worth merging your other PR, and having a PR that implements an agreed solution to HPCC-32982, and then rebasing this PR on it.
I'm open to discussing what the different classes should be called.

const char* fullTraceContext = requestTree->queryProp("_trace/traceparent");

// We only want to compare the trace-id & span-id, so remove the last "sampling" group
std::string traceParent = fullTraceContext ? fullTraceContext : "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: You can use strrchr on the const char * to avoid cloning the string. Alternatively use std::string_view and assign to a new string.

fs/dafsserver/dafsserver.cpp Show resolved Hide resolved
Owned<IProperties> traceHeaders = createProperties();
traceHeaders->setProp("traceparent", fullTraceContext);

std::string requestSpanName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor efficiency: use a const char * and avoid a string being cloned.

if (traceParent != nullptr)
{
Owned<IProperties> traceHeaders = createProperties();
traceHeaders->setProp("traceparent", traceParent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also have the sampling suffix removed?

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu - please see comments.

fs/dafsserver/dafsserver.cpp Outdated Show resolved Hide resolved
fs/dafsserver/dafsserver.cpp Outdated Show resolved Hide resolved
fs/dafilesrv/dafilesrv.cpp Show resolved Hide resolved
#endif

// NB: bare-metal dafilesrv does not have a component specific xml, extracting relevant global configuration instead
Owned<IPropertyTree> config = loadConfigurationWithGlobalDefault(defaultYaml, extractedGlobalConfig, argv, "dafilesrv", "DAFILESRV");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of new function/adding to global, could you add 'tracing' to the component config instead?
e.g.:

#ifndef _CONTAINERIZED
    Owned<IPropertyTree> env = getHPCCEnvironment();
    IPropertyTree* tracing = env->getPropTree("Software/tracing");
    if (tracing)
        config->setPropTree("tracing", tracing);
#endif

(and combine with #else // __CONTAINERIZED block below)

@jpmcmu jpmcmu changed the base branch from candidate-9.8.x to master January 8, 2025 14:20
@jpmcmu jpmcmu requested review from jakesmith and ghalliday January 8, 2025 14:23

if (!traceParent.empty() && requestTraceParent != traceParent)
if (strlen(traceParent) != 0 && requestTraceParent != traceParent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: isEmptyString is more efficient (since it only checks the 1st character)


if (!traceParent.empty() && requestTraceParent != traceParent)
if (strlen(traceParent) != 0 && requestTraceParent != traceParent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I made a comment "minor: You can use strrchr on the const char * to avoid cloning the string.".

I think that comment should have more explicit - you should still be creating a temporary string from the result. The modified code is not equivalent to the previous code - it is probably better as it was.

Previously the code set traceParent to the text before the last '-'
now it sets traceParent to the last '-' and the text that follows it.
Also the comparison will implicitly convert it to a std:string anyway - it would be better if that was explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I am not sure what I was thinking here..

const char* traceParent = requestTree->queryProp("_trace/traceparent");
if (traceParent != nullptr)
{
Owned<IProperties> traceHeaders = createProperties();
traceHeaders->setProp("traceparent", traceParent);

versionSpan = queryTraceManager().createServerSpan("VersionRequest", traceHeaders);
versionSpan.set(queryTraceManager().createServerSpan("VersionRequest", traceHeaders));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be setown, otherwise I think the object will leak.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu - couple of comments.

int main(int argc, const char* argv[])
{
InitModuleObjects();

EnableSEHtoExceptionMapping();
#ifndef __64BIT__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial/formatting. extra newline added incidentally (and one removed on line 369, but that looks more deliberate)


const char* componentTag = "dafilesrv";
Owned<IPropertyTree> componentDefault;
if (defaultYaml)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why conditional? defaultYaml is always defined

{
Owned<IPropertyTree> defaultConfig = createPTreeFromYAMLString(defaultYaml, 0, ptr_ignoreWhiteSpace, nullptr);
componentDefault.set(defaultConfig->queryPropTree(componentTag));
if (!componentDefault)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should never be true, the const defaultYaml explicitly defines "dafilesrv" so I don't think there's any point in checking (code currently leads me to think it may be missing)

throw makeStringExceptionV(99, "Default configuration does not contain the tag %s", componentTag);
}
else
componentDefault.setown(createPTree(componentTag));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above would be better simplified to:

    const char* componentTag = "dafilesrv";
    Owned<IPropertyTree> defaultConfig = createPTreeFromYAMLString(defaultYaml, 0, ptr_ignoreWhiteSpace, nullptr);
    Owned<IPropertyTree> componentDefault = defaultConfig->getPropTree(componentTag);

or I might be tempted to encapsulate this pattern into a another loadConfiguration variety (that calls onto loadConfiguration(IPropertyTree * defaultConfig, IPropertyTree * globalConfig, ...).

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu a couple of comments/issues.

@jpmcmu jpmcmu requested review from ghalliday and jakesmith January 10, 2025 13:52
@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jan 10, 2025

@jakesmith @ghalliday Thanks for the reviews, I have implemented those changes

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu - 1 follow up issue.

else
componentDefault.setown(createPTree(componentTag));
Owned<IPropertyTree> defaultConfig = createPTreeFromYAMLString(defaultYaml, 0, ptr_ignoreWhiteSpace, nullptr);
Owned<IPropertyTree> componentDefault(defaultConfig->queryPropTree(componentTag));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-ve leak. Needs to be a getPropTree

Copy link
Contributor Author

@jpmcmu jpmcmu Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, right, I need to pay more attention to query/get and set/setown, it is slightly unituitive to me for some reason. So, is the following correct: get* methods will increase the reference count, query* methods will not. Owned constructor does NOT increase reference count, set increases reference count while setown does not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, and Linked would also increase the link count (and both Owned and Linked will descrease it).
So:

Linked<IPropertyTree> componentDefault(defaultConfig->queryPropTree(componentTag));

would also fix the -ve leak, i.e. Linked increasing the link count on ctor, and decreasing it on dtor.

@AttilaVamos
Copy link
Contributor

@jpmcmu It seems this PR went timeout on DFUAccessTest (unit test). Can you try to run that in your local environment?

/opt/HPCCSystems/bin/unittests DFUAccessTest

Usually it takes seconds to finish.

@AttilaVamos
Copy link
Contributor

Some stack trace:

Thread 1 (Thread 0x7f5934987300 (LWP 83703)):
#0  0x00007f59367dc301 in poll () from /lib64/libc.so.6
#1  0x00007f59396c5e70 in CSocket::wait_read(unsigned int) () from /opt/HPCCSystems/lib/libjlib.so
#2  0x00007f59396c96c2 in CSocket::readtms(void*, unsigned int, unsigned int, unsigned int&, unsigned int, bool) () from /opt/HPCCSystems/lib/libjlib.so
#3  0x00007f593c51bf17 in receiveDaFsBufferSize(ISocket*, unsigned int, CTimeMon*) () from /opt/HPCCSystems/lib/libdafsclient.so
#4  0x00007f593c51c0f6 in receiveDaFsBuffer(ISocket*, MemoryBuffer&, unsigned int, unsigned int) () from /opt/HPCCSystems/lib/libdafsclient.so
#5  0x00007f593c51cefe in CRemoteBase::sendRemoteCommand(MemoryBuffer&, MemoryBuffer&, bool, bool, bool) () from /opt/HPCCSystems/lib/libdafsclient.so
#6  0x00007f5933326e36 in dafsstream::CDFUPartFlatWriter::finalize() () from /opt/HPCCSystems/lib/libdafsstream.so
#7  0x00007f593332197b in dafsstream::CDaFileSrvClientBase::beforeDispose() () from /opt/HPCCSystems/lib/libdafsstream.so
#8  0x00007f5933322ac8 in non-virtual thunk to dafsstream::CDFUPartWriterBase::Release() const () from /opt/HPCCSystems/lib/libdafsstream.so
#9  0x00007f5933132e43 in ?? () from /opt/HPCCSystems/lib/libwsdfuaccess.so
#10 0x00007f593f157fa2 in CppUnit::TestCaseMethodFunctor::operator()() const () from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#11 0x00007f593f14df0e in CppUnit::DefaultProtector::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) ()
   from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x00007f593f155110 in CppUnit::ProtectorChain::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) () from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#13 0x00007f593f15f794 in CppUnit::TestResult::protect(CppUnit::Functor const&, CppUnit::Test*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#14 0x00007f593f157c59 in CppUnit::TestCase::run(CppUnit::TestResult*) () from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#15 0x00007f593f158233 in CppUnit::TestComposite::doRunChildTests(CppUnit::TestResult*) () from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#16 0x00007f593f1582b7 in CppUnit::TestComposite::run(CppUnit::TestResult*) () from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#17 0x00007f593f15f6f2 in CppUnit::TestResult::runTest(CppUnit::Test*) () from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#18 0x00007f593f161f9d in CppUnit::TestRunner::run(CppUnit::TestResult&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#19 0x00007f593f163ff2 in CppUnit::TextTestRunner::run(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, bool) ()
   from /opt/HPCCSystems/lib/libcppunit-1.15.so.1
#20 0x000000000044d62d in ?? ()
#21 0x00007f59366f2d85 in __libc_start_main () from /lib64/libc.so.6
#22 0x000000000045011e in ?? ()

@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jan 14, 2025

@AttilaVamos fixed the issue causing the tests to stall @ghalliday please review

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu Looks good. Please squash.

- Added opentelemetry tracing to rowservice

Signed-off-by: James McMullan [email protected]
@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jan 14, 2025

@ghalliday squashed

@ghalliday ghalliday merged commit 47f5633 into hpcc-systems:master Jan 14, 2025
47 of 48 checks passed
Copy link

Jirabot Action Result:
Added fix version: 9.10.0
Workflow Transition: 'Resolve issue'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants