Skip to content

Commit

Permalink
Fix parsing of timezones in timestamp strings (facebookincubator#9411)
Browse files Browse the repository at this point in the history
Summary:

When parsing strings into timestamps, timezone information was being
incorrectly handled. First, timezone offsets cannot be applied when parsing the
timestamp string, as the parsing function does not have access to the user
timezone - so one cannot simply assume GMT/UTC.
.
Therefore, this PR splits the parsing into two functions:
* fromTimestampString(): parses the string and returns a timestamp independent
  of timezones. Fails if timezone information is found on the string. In
  practice, the function returns a unix epoch relative to GMT, but logically it
  should not be "bound" to a specific timezone yet.
* fromTimestampWithTimezoneString(): parses the timestamp and timezone
  information, returning a pair with the independent values. The caller
  function is responsible for doing the right timezone adjustments.
.
The previous code used to simply adjust any timestamps parsed using the user
supplied session timezone. This is correct if no timezones are supplied in
the string, but not if there is a timezone in the string.
.
For example, "1970-01-01 00:00:00" is 0 if the user timezone is GMT, and 28800
if the user timezone is PST. But "1970-01-01 00:00:00Z" is always 0,
independent of the user timezone ("Z" is one of synonyms of GMT/UTC).

Differential Revision: D55906156
  • Loading branch information
pedroerp authored and facebook-github-bot committed Apr 9, 2024
1 parent dfd8df0 commit 29ddde0
Show file tree
Hide file tree
Showing 7 changed files with 205 additions and 97 deletions.
28 changes: 0 additions & 28 deletions velox/expression/CastExpr-inl.h
Original file line number Diff line number Diff line change
Expand Up @@ -727,34 +727,6 @@ void CastExpr::applyCastPrimitives(
});
}
}

// If we're converting to a TIMESTAMP, check if we need to adjust the
// current GMT timezone to the user provided session timezone.
if constexpr (ToKind == TypeKind::TIMESTAMP) {
const auto& queryConfig = context.execCtx()->queryCtx()->queryConfig();
// If user explicitly asked us to adjust the timezone.
if (queryConfig.adjustTimestampToTimezone()) {
auto sessionTzName = queryConfig.sessionTimezone();
if (!sessionTzName.empty()) {
// When context.throwOnError is false, some rows will be marked as
// 'failed'. These rows should not be processed further. 'remainingRows'
// will contain a subset of 'rows' that have passed all the checks (e.g.
// keys are not nulls and number of keys and values is the same).
exec::LocalSelectivityVector remainingRows(context, rows);
context.deselectErrors(*remainingRows);

// locate_zone throws runtime_error if the timezone couldn't be found
// (so we're safe to dereference the pointer).
auto* timeZone = date::locate_zone(sessionTzName);
auto rawTimestamps = resultFlatVector->mutableRawValues();

applyToSelectedNoThrowLocal(
context, *remainingRows, result, [&](int row) {
rawTimestamps[row].toGMT(*timeZone);
});
}
}
}
}

template <TypeKind ToKind>
Expand Down
17 changes: 16 additions & 1 deletion velox/expression/PrestoCastHooks.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,22 @@ PrestoCastHooks::PrestoCastHooks(const core::QueryConfig& config)
}

Timestamp PrestoCastHooks::castStringToTimestamp(const StringView& view) const {
return util::fromTimestampString(view.data(), view.size());
auto result = util::fromTimestampWithTimezoneString(view.data(), view.size());

// If the parsed string has timezone information, convert the timestamp at
// GMT at that time. For example, "1970-01-01 00:00:00 -00:01" is 60 seconds
// at GMT.
if (result.second != -1) {
result.first.toGMT(result.second);

}
// If no timezone information is available in the input string, check if we
// should understand it as being at the session timezone, and if so, convert
// to GMT.
else if (options_.timeZone != nullptr) {
result.first.toGMT(*options_.timeZone);
}
return result.first;
}

int32_t PrestoCastHooks::castStringToDate(const StringView& dateString) const {
Expand Down
20 changes: 19 additions & 1 deletion velox/expression/tests/CastExprTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -589,6 +589,24 @@ TEST_F(CastExprTest, stringToTimestamp) {
std::nullopt,
};
testCast<std::string, Timestamp>("timestamp", input, expected);

// Try with a different session timezone.
setTimezone("America/Los_Angeles");
input = {
"1970-01-01 00:00",
"1970-01-01 00:00 +00:00",
"1970-01-01 00:00 +01:00",
"1970-01-01 00:00 America/Sao_Paulo",
"2000-01-01 12:21:56Z",
};
expected = {
Timestamp(28800, 0),
Timestamp(0, 0),
Timestamp(-3600, 0),
Timestamp(10800, 0),
Timestamp(946729316, 0),
};
testCast<std::string, Timestamp>("timestamp", input, expected);
}

TEST_F(CastExprTest, timestampToString) {
Expand Down Expand Up @@ -765,7 +783,7 @@ TEST_F(CastExprTest, timestampAdjustToTimezone) {
Timestamp(946713600, 0),
Timestamp(0, 0),
Timestamp(946758116, 0),
Timestamp(-21600, 0),
Timestamp(-50400, 0),
std::nullopt,
Timestamp(957164400, 0),
});
Expand Down
124 changes: 77 additions & 47 deletions velox/type/TimestampConversion.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
#include "velox/type/TimestampConversion.h"
#include "velox/common/base/CheckedArithmetic.h"
#include "velox/common/base/Exceptions.h"
#include "velox/type/tz/TimeZoneMap.h"

namespace facebook::velox::util {

Expand Down Expand Up @@ -161,6 +162,13 @@ inline bool validDate(int64_t daysSinceEpoch) {
daysSinceEpoch <= std::numeric_limits<int32_t>::max();
}

// Skip leading spaces.
inline void skipSpaces(const char* buf, size_t len, size_t& pos) {
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
}

bool tryParseDateString(
const char* buf,
size_t len,
Expand All @@ -177,11 +185,7 @@ bool tryParseDateString(
int32_t year = 0;
bool yearneg = false;
int sep;

// Skip leading spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

if (pos >= len) {
return false;
Expand Down Expand Up @@ -320,10 +324,8 @@ bool tryParseDateString(

// In strict mode, check remaining string for non-space characters.
if (mode == ParseMode::kStrict) {
// Skip trailing spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

// Check position. if end was not reached, non-space chars remaining.
if (pos < len) {
return false;
Expand Down Expand Up @@ -353,11 +355,7 @@ bool tryParseTimeString(
if (len == 0) {
return false;
}

// Skip leading spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

if (pos >= len) {
return false;
Expand Down Expand Up @@ -424,10 +422,7 @@ bool tryParseTimeString(

// In strict mode, check remaining string for non-space characters.
if (strict) {
// Skip trailing spaces.
while (pos < len && characterIsSpace(buf[pos])) {
pos++;
}
skipSpaces(buf, len, pos);

// Check position. If end was not reached, non-space chars remaining.
if (pos < len) {
Expand All @@ -438,6 +433,43 @@ bool tryParseTimeString(
return true;
}

// String format is "YYYY-MM-DD hh:mm:ss.microseconds" (seconds and microseconds
// are optional). ISO 8601
bool tryParseTimestampString(
const char* buf,
size_t len,
size_t& pos,
Timestamp& result) {
int64_t daysSinceEpoch = 0;
int64_t microsSinceMidnight = 0;

if (!tryParseDateString(
buf, len, pos, daysSinceEpoch, ParseMode::kNonStrict)) {
return false;
}

if (pos == len) {
// No time: only a date.
result = fromDatetime(daysSinceEpoch, 0);
return true;
}

// Try to parse a time field.
if (buf[pos] == ' ' || buf[pos] == 'T') {
pos++;
}

size_t timePos = 0;
if (!tryParseTimeString(
buf + pos, len - pos, timePos, microsSinceMidnight, false)) {
return false;
}

pos += timePos;
result = fromDatetime(daysSinceEpoch, microsSinceMidnight);
return true;
}

bool tryParseUTCOffsetString(
const char* buf,
size_t& pos,
Expand Down Expand Up @@ -702,55 +734,53 @@ namespace {

Timestamp fromTimestampString(const char* str, size_t len) {
size_t pos;
int64_t daysSinceEpoch;
int64_t microsSinceMidnight;
Timestamp resultTimestamp;

if (!tryParseDateString(
str, len, pos, daysSinceEpoch, ParseMode::kNonStrict)) {
if (!tryParseTimestampString(str, len, pos, resultTimestamp)) {
parserError(str, len);
}
skipSpaces(str, len, pos);

if (pos == len) {
// No time: only a date.
return fromDatetime(daysSinceEpoch, 0);
if (pos >= len) {
return resultTimestamp;
}
parserError(str, len);
}

// Try to parse a time field.
if (str[pos] == ' ' || str[pos] == 'T') {
pos++;
}
std::pair<Timestamp, int64_t> fromTimestampWithTimezoneString(
const char* str,
size_t len) {
size_t pos;
Timestamp resultTimestamp;

size_t timePos = 0;
if (!tryParseTimeString(
str + pos, len - pos, timePos, microsSinceMidnight, false)) {
if (!tryParseTimestampString(str, len, pos, resultTimestamp)) {
parserError(str, len);
}

pos += timePos;
auto timestamp = fromDatetime(daysSinceEpoch, microsSinceMidnight);
int64_t timezoneID = -1;
skipSpaces(str, len, pos);

// If there is anything left to parse, it must be a timezone definition.
if (pos < len) {
// Skip a "Z" at the end (as per the ISO 8601 specs).
if (str[pos] == 'Z') {
pos++;
size_t timezonePos = pos;
while (timezonePos < len && !characterIsSpace(str[timezonePos])) {
timezonePos++;
}
int hourOffset, minuteOffset;
if (tryParseUTCOffsetString(str, pos, len, hourOffset, minuteOffset)) {
int32_t secondOffset =
(hourOffset * kSecsPerHour) + (minuteOffset * kSecsPerMinute);
timestamp = Timestamp(
timestamp.getSeconds() - secondOffset, timestamp.getNanos());
std::string_view timezone(str + pos, timezonePos - pos);

if ((timezoneID = util::getTimeZoneID(timezone, false)) == -1) {
VELOX_USER_FAIL("Unknown timezone value: \"{}\"", timezone);
}

// Skip any spaces at the end.
while (pos < len && characterIsSpace(str[pos])) {
pos++;
}
pos = timezonePos;
skipSpaces(str, len, pos);

if (pos < len) {
parserError(str, len);
}
}
return timestamp;
return {resultTimestamp, timezoneID};
}

} // namespace facebook::velox::util
41 changes: 39 additions & 2 deletions velox/type/TimestampConversion.h
Original file line number Diff line number Diff line change
Expand Up @@ -131,14 +131,51 @@ inline int64_t fromTimeString(const StringView& str) {

// Timestamp conversion

/// Parses a full ISO 8601 timestamp string, following the format
/// "YYYY-MM-DD HH:MM[:SS[.MS]] +00:00"
/// Parses a full ISO 8601 timestamp string, following the format:
///
/// "YYYY-MM-DD HH:MM[:SS[.MS]]"
///
/// Seconds and miliseconds are optional.
///
/// This function does not accept any timezone information in the string (e.g.
/// UTC, Z, or a timezone offsets). This is because the returned timestamp does
/// not contain timezone information; therefore, it would either be required for
/// this function to convert the parsed timestamp (but we don't know the
/// original timezone), or ignore the timezone information, which would be
/// incorecct.
///
/// For a timezone-aware version of this function, check
/// `fromTimestampWithTimezoneString()` below.
Timestamp fromTimestampString(const char* buf, size_t len);

inline Timestamp fromTimestampString(const StringView& str) {
return fromTimestampString(str.data(), str.size());
}

/// Parses a full ISO 8601 timestamp string, following the format:
///
/// "YYYY-MM-DD HH:MM[:SS[.MS]] +00:00"
///
/// This is a timezone-aware version of the function above
/// `fromTimestampString()` which returns both the parsed timestamp and the
/// timezone ID. It is up to the client to do the expected conversion based on
/// these two values.
///
/// The timezone information at the end of the string may contain a timezone
/// name (as defined in velox/type/tz/*), such as "UTC" or
/// "America/Los_Angeles", or a timezone offset, like "+06:00" or "-09:30". The
/// white space between the hour definition and timestamp is optional.
///
/// -1 means no timezone information was found. Throws VeloxUserError in case of
/// parsing errors.
std::pair<Timestamp, int64_t> fromTimestampWithTimezoneString(
const char* buf,
size_t len);

inline auto fromTimestampWithTimezoneString(const StringView& str) {
return fromTimestampWithTimezoneString(str.data(), str.size());
}

Timestamp fromDatetime(int64_t daysSinceEpoch, int64_t microsSinceMidnight);

} // namespace facebook::velox::util
4 changes: 2 additions & 2 deletions velox/type/tests/ConversionsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1012,14 +1012,14 @@ TEST_F(ConversionsTest, toTimestamp) {
"2000-01-01",
"1970-01-01 00:00:00",
"2000-01-01 12:21:56",
"1970-01-01 00:00:00-02:00",
"1970-01-01 00:01",
},
{
Timestamp(0, 0),
Timestamp(946684800, 0),
Timestamp(0, 0),
Timestamp(946729316, 0),
Timestamp(7200, 0),
Timestamp(60, 0),
});

// Invalid case.
Expand Down
Loading

0 comments on commit 29ddde0

Please sign in to comment.