Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/pronounce_digits #150

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,6 @@ venv.bak/
.mypy_cache/

# VSCod(e/ium)
.vscode/
.vscode*
vscode/
*.code-workspace
21 changes: 21 additions & 0 deletions lingua_franca/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
_REGISTERED_FUNCTIONS = ("nice_number",
"nice_time",
"pronounce_number",
"pronounce_digits",
"nice_response",
"nice_duration")

Expand Down Expand Up @@ -296,6 +297,26 @@ def pronounce_number(number, lang=None, places=2, short_scale=True,
"""


@localized_function()
def pronounce_digits(number, lang=None, places=2, all_digits=False):
"""
Pronounce a number's digits, either colloquially or in full

In English, the colloquial way is usually to read two digits at a time,
treating each pair as a single number.

Examples:
>>> pronounce_number(127, all_digits=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pronounce_digits not pronounce_number

'one twenty seven'
>>> pronounce_number(127, all_digits=True)
'one two seven'

Args:
number (int|float)
all_digits (bool): read every digit, rather than two digits at a time
"""


def nice_date(dt, lang=None, now=None):
"""
Format a datetime to a pronounceable date
Expand Down
39 changes: 39 additions & 0 deletions lingua_franca/lang/format_en.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
# limitations under the License.
#

from math import modf

from lingua_franca.lang.format_common import convert_to_mixed_fraction
from lingua_franca.lang.common_data_en import _NUM_STRING_EN, \
_FRACTION_STRING_EN, _LONG_SCALE_EN, _SHORT_SCALE_EN, _SHORT_ORDINAL_EN, _LONG_ORDINAL_EN
Expand Down Expand Up @@ -302,6 +304,43 @@ def _long_scale(n):
return result


def pronounce_digits_en(number, places=2, all_digits=False):
decimal_part = ""
op_val = ""
ChanceNCounter marked this conversation as resolved.
Show resolved Hide resolved
result = []
is_float = isinstance(number, float)
if is_float:
op_val, decimal_part = [part for part in str(number).split(".")]
ChanceNCounter marked this conversation as resolved.
Show resolved Hide resolved
decimal_part = pronounce_number_en(
float("." + decimal_part), places=places).replace("zero ", "")
ChanceNCounter marked this conversation as resolved.
Show resolved Hide resolved
else:
op_val = str(number)

if all_digits:
result = [pronounce_number_en(int(i)) for i in op_val]
if is_float:
result.append(decimal_part)
result = " ".join(result)
else:
while len(op_val) > 1:
idx = -2 if len(op_val) in [2, 4] else -3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without first reading this code I wrote the following tests:

self.assertEqual(pronounce_digits(238513096), "twenty three eighty five thirteen zero ninety six")
self.assertEqual(pronounce_digits(238513696), "twenty three eighty five thirteen sixty nine six")

I like that you go from the end rather than beginning so the final numbers can be read closer to what they actually are - "ninety six".

However being a longer number, it ends up getting broken down into multiple groups of three so we get:

self.assertEqual(pronounce_digits(238513096), "two thirty eight five thirteen ninety six")

What's the intended outcome here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're aiming for speaking in two digit numbers, should we check for an odd number length, speak the first digit and then speak all remaining pairs? Something like:

if len(op_val) % 2 == 1:
  result.append(pronounce_number(op_val[0]))
  op_val = op_val[1:]
remaining_pairs = # some code
for pair in remaining_pairs:
  result.append(pronounce_number(pair))

Copy link
Contributor Author

@ChanceNCounter ChanceNCounter Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be speaking in pairs slightly more often than intended. It doesn't really work on large numbers, but my intention was to "end with" three digit groupings in most cases, which just sounded most natural to me.

I'm gonna go over the code again top to bottom tomorrow, but the gist is:

123 -> "one twenty three"
1234 -> "twelve thirty four"
12345 -> "twelve three forty five"
123456 -> "one twenty three four fifty six"

It's definitely bugged on large numbers atm. The above should be followed by "one two thirty four five sixty seven", but I'm getting "twelve thirty four five sixty seven".

Once you're looking at 9+ digits, I don't think the function is much use without all_digits:

>>> assert(format.pronounce_digits(238513096, all_digits=True) == "two three eight five one three zero nine six")
>>> 

(edit: "tomorrow" to commence mid-afternoon UTC")

back_digits = op_val[idx:]
op_val = op_val[:idx]
result = pronounce_number_en(
int(back_digits)).split(" ") + result
if op_val:
result.insert(0, pronounce_number_en(int(op_val)))
if is_float:
result.append(decimal_part)
no_no_words = list(_SHORT_SCALE_EN.values())[:5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we specifically care about the first 5 values? Is this just an optimisation because the chances of the rest being there are so slim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it slices 2 or 3 digits at a time, the rest can't be there. Right now, I'm trying to remember why I included anything but 'hundred'.

no_no_words.append('and')
print(no_no_words)
print(result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug prints

result = [word for word in result if word.strip() not in no_no_words]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where you think this might happen that we can test for? Or is it just a safety measure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens anytime the input is longer than two digits. The algorithm acts by running pronounce_number() on 2-3 digits at a time. This often returns the words hundred and and.

The latter stray debug print (=P) is the result prior to this operation:

>>> pronounce_digits(234534)
['two', 'hundred', 'and', 'thirty', 'four', 'five', 'hundred', 'and', 'thirty', 'four']
'two thirty four five thirty four'

pronounce_number(534), prepended with pronounce_number(234), sanitized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strip, on the other hand, is probably unneeded.

result = " ".join(result)
return result


def nice_time_en(dt, speech=True, use_24hour=False, use_ampm=False):
"""
Format a time to a comfortable human format
Expand Down