Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-32874 Add WsLogAcces Health Report Method #19307

Open
wants to merge 4 commits into
base: candidate-9.8.x
Choose a base branch
from

Conversation

rpastrana
Copy link
Member

@rpastrana rpastrana commented Nov 20, 2024

HPCC-32874 Add WsLogAcces Health Report Method

  • Adds healthreport method to logaccess interface
  • Implements healthreport method for all logaccess plugins
  • Adds general rest request support in grafana client

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32874

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

@rpastrana rpastrana changed the title HPCC-32874 HPCC-32874 Add WsLogAcces Health Report Method Nov 20, 2024
@rpastrana
Copy link
Member Author

Sample Grafana/Loki output:
image

Add WsLogAcces Health Report Method

- Adds healthreport method to logaccess interface
- Implements healthreport method for all logaccess plugins

Signed-off-by: Rodrigo Pastrana <[email protected]>
@rpastrana rpastrana force-pushed the HPCC-32874-wslogaccesshealth branch from 9224ed6 to f77d122 Compare December 13, 2024 17:45
@rpastrana
Copy link
Member Author

rpastrana commented Dec 13, 2024

image
image
image

@jeclrsg I added this method to help developers debug config/connection logaccess issues, but thought ECLWatch might be able to make use of it as well. As of now, the different plugins (Elastic Stack, Azure, loki) generate their own reports with unique content related to the particular plugin, but we could enforce certain entries if you think ECLWatch could use them. Let me know what you think.

Copy link
Contributor

@asselitx asselitx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest item in my mind is a preference for a common structured output. Second are questions about encoding output for security- I'm in that mindset now, but not sure what would be best in this case.

bool IncludeSampleQuery(true);
};

ESPResponse GetHealthReportResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have [exceptions_inline] also


ESPResponse GetHealthReportResponse
{
string Report;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd much rather see a structured response that has some common subset of useful values for ECLWatch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, commented below for further discussion

bool success = true;
if (!queryRemoteLogAccessor())
{
report.append("\"Error\": \"LogAccess plugin not available, review logAccess configuration!\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be part of a structured response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In agreement. what is the ESP standard for these types of messages? (ie, what's the expected structure format)

resp.setReport(report.str());

return success;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing final newline

@@ -1675,6 +1710,106 @@ struct LogQueryResultDetails
unsigned int totalReceived;
unsigned int totalAvailable;
};
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing before you finalize you'll remove these comments and those in WsLogAccessService.cpp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some, but kept the structure around as a mechanism to track structured responses for the ESP

{
StringBuffer configJSON;
toJSON(m_pluginCfg, configJSON, 0, JSON_Format|JSON_HideRootArrayObject);
report.appendf("\"ConfigurationTree\": %s", configJSON.str()); //json encode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to be more JSON appended to the report, not a string that contains the JSON configuration, correct? If it is the latter then it probably should be encoded.

{
StringBuffer configJSON;
toJSON(m_pluginCfg, configJSON, 0);
report.appendf("\"ConfigurationTree\": %s", configJSON.str()); //json encode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here as in ElasticStackLogAccess.cpp about JSON vs JSON in a string which needs encoding.

StringBuffer queryString, queryIndex;
populateKQLQueryString(queryString, queryIndex, queryOptions);

StringBuffer encodedValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only string that needs to be encoded, or are there other cases where user input could be part of any value returned in the report? Would it be a good idea to encode all strings returned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, within the sample query section, yes. But point taken, I'll look around and see if anything else could/should be encoded.

@@ -107,4 +109,5 @@ class GrafanaLogAccessCurlClient : public CInterfaceOf<IRemoteLogAccess>
virtual IRemoteLogAccessStream * getLogReader(const LogAccessConditions & options, LogAccessLogFormat format) override;
virtual IRemoteLogAccessStream * getLogReader(const LogAccessConditions & options, LogAccessLogFormat format, unsigned int pageSize) override;
virtual bool supportsResultPaging() const override { return false;}
virtual bool healthReport(StringBuffer & report, LogAccessHealthReportOptions options) override;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing final newline

report.append("\"ConnectionInfo\": { ");
report.appendf("\"ConnectionString\": \"%s\"", m_grafanaConnectionStr.str());
report.appendf(", \"UserName\": \"%s\"", m_grafanaUserName.str());
report.appendf(", \"PasswordProvided\": %s", isEmptyString(m_grafanaPassword.str()) ? "true" : "false");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only thing I noticed that Terrence hadn't mentioned already. Are these backwards? Seems like if isEmptyString(...) then PasswordProvided should be set to false.

@rpastrana
Copy link
Member Author

rpastrana commented Dec 17, 2024

Thanks for the review guys (@asselitx / @jeclrsg ) I'll work those suggestions in today.

I do want to discuss the suggestion to provide further structure, which I fully agree with, and I'm glad it was brought up. Since the method is not meant for client consumption (it's really just for developer/operations debugging) it doesn't
The problem we're faced with is that each of the plugins report unique values/structures/concepts and I found myself trying to force non-overlapping data into common sounding elements/structures and it didn't feel right.

Here's my thought, if ECLWatch commits to consuming some common values, we'll go through the effort of structuring the report. Something like:

Report
        Configuration
              ConfigTree <JSON configuration tree>
              Message <JSON connection related message(s)>    
        Connection
                Status OK | Failed
                Message <JSON connection related message(s)>
        Internals
                Plugin <JSON string containing any internal info the plugin chooses to report for debugging>
                Server <JSON string containing any internal info the server chose to report>
        SampleQuery
                 Query  <JSON string describing the sample query>
                 QueryString  ActualQueryStringSubmitted
                 Results
                       Count CountofRecordsReported
                       Logs  <JSON string raw results from server>

It feels to me like ECLWatch would only benefit from the Report/Conection/Status. Thoughts?

@jeclrsg
Copy link
Contributor

jeclrsg commented Dec 17, 2024

@rpastrana That makes sense to me. I was looking at some of the docs yesterday, and one of them (either Grafana or Kibana) made it sound like they had an api status URI that only returned a 200 (no actual response body) if everything is ok. But I'd agree that ECL Watch probably only needs to get a status flag, and if the engine isn't available some message as to why.

@@ -11,7 +11,7 @@ global:
id: "1"
name: "Loki"
namespace:
name: "hpcc"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to revert this change

@asselitx
Copy link
Contributor

A "status" and "message" field makes sense to me too- really whatever it is that the UI would need. Then if more details are needed in the future we could extend the interface. I think it still makes sense to return discrete fields that don't require further parsing and have a neat SOAP representation even though ECLWatch uses (exclusively?) JSON.

If there is also a need for the full raw "health report" for debugging/troubleshooting exposed through ESP I'd not be opposed to that as a separate method. Trying to make a common representation for that probably doesn't make sense and may even make it less useful.

@rpastrana
Copy link
Member Author

A "status" and "message" field makes sense to me too- really whatever it is that the UI would need. Then if more details are needed in the future we could extend the interface. I think it still makes sense to return discrete fields that don't require further parsing and have a neat SOAP representation even though ECLWatch uses (exclusively?) JSON.

If there is also a need for the full raw "health report" for debugging/troubleshooting exposed through ESP I'd not be opposed to that as a separate method. Trying to make a common representation for that probably doesn't make sense and may even make it less useful.

@asselitx I think your comment makes a lot of sense. I'm wondering though if this single method could/should serve the debug report based on a flag. IE by default it only reports well established "status" and "message", but also optionally provides the debug report as a separate text field (in JSON format).

- Adds ESP respons structure
- Introduces top level gree/yellow/red status concept
- Introduces jlog structs to help status reporting
- Utilizes JSON appending/encoding functionality
- Various other minor changes

Signed-off-by: Rodrigo Pastrana <[email protected]>
bool IncludeSampleQuery(false);
};

ESPenum LogAccessStatusCode : int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeclrsg this service can either provide a green/yellow/red type status, or more specific status codes as in the enum below. Let me know what you'd prefer.

@rpastrana
Copy link
Member Author

@asselitx updated the ESP interface and some internal jlog interfaces to match the suggestions from code review. I only updated the Azure client, and will update the other clients once we finalize the ESP interface.

@rpastrana
Copy link
Member Author

Sample report from Azure client:
image

@rpastrana
Copy link
Member Author

Sample report from Azure client: image

@jeclrsg take a look at the latest proposed format. I'm not sure about the color based status/code (green|yellow|green), but the alternative is for the clients to send back a numeric value representing one of many possible statuses which you would then interpret as success, success with warnings, failure. Let me know what you think

@jeclrsg
Copy link
Contributor

jeclrsg commented Jan 7, 2025

@rpastrana Yeah, I would vote for status values being explicit, rather than subjective like the colors. I'd actually thought (as a health check type of api) status would be a boolean for if the engine is currently available & if not the message would communicate why.

@rpastrana
Copy link
Member Author

@rpastrana Yeah, I would vote for status values being explicit, rather than subjective like the colors. I'd actually thought (as a health check type of api) status would be a boolean for if the engine is currently available & if not the message would communicate why.

I see the health status as at least 3 distinct states: 1 healthy (configured correctly, timely connection, timely response, reasonable response content) 2 Warning (configured correctly, connected (could be lengthy connection), received response (could be sluggish), no logs returned (could signify no logs available) 3 Error (configuration error? connection error? invalid or no response?)

I can convert the color code to a 3 entry enumeration representing those states and ensure a message is provided with details. thoughts?

- Changed status codes to success/warning/fail

Signed-off-by: Rodrigo Pastrana <[email protected]>
@rpastrana
Copy link
Member Author

Updated status enumeration values:
image
@jeclrsg

Copy link
Contributor

@asselitx asselitx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Rodrigo this update looks great. I like the approach to have the default behavior return just a Status or maybe Code, with options to include the full raw dumps of info. That gives a good balance between needing a simple check and more in-depth debugging details.

Use of the JSON append functions makes the report-building more readable which is nice.

I'm assuming you've just fleshed out the Azure implementation for now and would do similar for the others once you know to use the Status vs. Code approach.

My gut says that the Codes may be more useful, but I'd defer to Jeremy's preference. If we stick with the stoplight Status, I'd suggest commenting somewhere generally what the colors mean or what sorts of problems would put you in yellow vs. red.

@rpastrana rpastrana force-pushed the HPCC-32874-wslogaccesshealth branch from 50ec547 to 523ea18 Compare January 16, 2025 13:51
@rpastrana
Copy link
Member Author

rpastrana commented Jan 16, 2025

@jeclrsg Sample Warning (minor update to Status/Messages entry, caller can expect a JSON array of message:

{
    "message": []
}
{
    "message": [
        "Exception fetching Grafana health (-1) - GrafanaLogAccessCurlClient Error (6): 'Could not resolve host: myloki4hpcclogs-grafana.default.svc.cluster.local'",
        "Exception fetching available Datasources (-1) - GrafanaLogAccessCurlClient Error (6): 'Could not resolve host: myloki4hpcclogs-grafana.default.svc.cluster.local'",
        "Exception fetching available labels (-1) - GrafanaLogAccessCurlClient Error (6): 'Could not resolve host: myloki4hpcclogs-grafana.default.svc.cluster.local'",
        "Exception while executing sample query (-1) - GrafanaLogAccessCurlClient: Cannot query because Grafana datasource was not established, check logaccess configuration!"
    ]
}

image

@rpastrana
Copy link
Member Author

Sample success:
image

@rpastrana
Copy link
Member Author

Sample successful ES Full report:
image

Formatted Debug report:

{
    "AvailableIndices": [
        {
            "pri.store.size": "977.7kb",
            "uuid": "zOLJ9_cZThG_sCsozXXPpg",
            "health": "yellow",
            "index": "hpcc-logs-2025.01.16-000001",
            "status": "open",
            "rep": "1",
            "docs.deleted": "0",
            "store.size": "977.7kb",
            "docs.count": "519",
            "pri": "1"
        }
    ],
    "TimestampField": {
        "hpcc-logs-2025.01.16-000001": {
            "mappings": null
        }
    },
    "ESStatus": {
        "number_of_in_flight_fetch": "0",
        "timed_out": "false",
        "active_shards_percent_as_number": "88.88888888888889",
        "number_of_data_nodes": "1",
        "active_shards": "8",
        "delayed_unassigned_shards": "0",
        "cluster_name": "elasticsearch",
        "number_of_pending_tasks": "0",
        "relocating_shards": "0",
        "status": "yellow",
        "unassigned_shards": "1",
        "number_of_nodes": "1",
        "active_primary_shards": "8",
        "initializing_shards": "0",
        "task_max_waiting_in_queue_millis": "0"
    }
}

- Changed status codes to success/warning/fail
- Updated All logaccess clients

Signed-off-by: Rodrigo Pastrana <[email protected]>
@rpastrana rpastrana force-pushed the HPCC-32874-wslogaccesshealth branch from 523ea18 to 45a365f Compare January 16, 2025 18:49
@rpastrana rpastrana requested a review from asselitx January 16, 2025 18:51
@rpastrana
Copy link
Member Author

@jeclrsg @asselitx please review latest version.
I added a few screenshots as well to aid in the review.

Copy link
Contributor

@asselitx asselitx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are just a couple spots (ecm and some of the cpp files) where it may make sense to remove commented code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants