Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'charmap' codec can't encode character '\U0001f512' in position 1020: character maps to <undefined> #2095

Open
SalAlba opened this issue Jan 9, 2025 · 0 comments

Comments

@SalAlba
Copy link

SalAlba commented Jan 9, 2025

Issue Type

Bug

Source

source

Giskard Library Version

2.16.0

OS Platform and Distribution

linux

Python version

3.11.8

Installed python packages

aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiosignal==1.3.2
annotated-types==0.7.0
anyio==4.8.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.4
attrs==24.3.0
babel==2.16.0
beautifulsoup4==4.12.3
bert-score==0.3.13
bleach==6.2.0
bokeh==3.4.3
cachetools==5.5.0
certifi==2024.12.14
cffi==1.17.1
chardet==5.2.0
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.0
colorama==0.4.6
comm==0.2.2
contourpy==1.3.1
cycler==0.12.1
databricks-sdk==0.40.0
datasets==3.2.0
debugpy==1.8.11
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.15
dill==0.3.8
distro==1.9.0
docopt==0.6.2
evaluate==0.4.3
executing==2.1.0
faiss-cpu==1.8.0
fastjsonschema==2.21.1
filelock==3.16.1
fonttools==4.55.3
fqdn==1.5.1
frozenlist==1.5.0
fsspec==2024.9.0
giskard==2.16.0
gitdb==4.0.12
GitPython==3.1.44
google-auth==2.37.0
griffe==0.48.0
h11==0.14.0
httpcore==1.0.7
httpx==0.28.1
huggingface-hub==0.27.1
idna==3.10
importlib_metadata==8.5.0
ipykernel==6.29.5
ipython==8.31.0
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
joblib==1.4.2
json5==0.10.0
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter-events==0.11.0
jupyter-lsp==2.2.5
jupyter_server==2.15.0
jupyter_server_terminals==0.5.3
jupyterlab==4.3.4
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
kiwisolver==1.4.8
langdetect==1.0.9
litellm==1.50.4
llvmlite==0.43.0
Markdown==3.7
MarkupSafe==3.0.2
matplotlib==3.10.0
matplotlib-inline==0.1.7
mistune==3.1.0
mixpanel==4.10.1
mlflow-skinny==2.19.0
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
nbclient==0.10.2
nbconvert==7.16.5
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
notebook==7.3.2
notebook_shim==0.2.4
num2words==0.5.14
numba==0.60.0
numpy==1.26.4
openai==1.59.5
opentelemetry-api==1.29.0
opentelemetry-sdk==1.29.0
opentelemetry-semantic-conventions==0.50b0
overrides==7.7.0
packaging==24.2
pandas==2.2.3
pandocfilters==1.5.1
parso==0.8.4
pillow==11.1.0
pip==24.3.1
platformdirs==4.3.6
prometheus_client==0.21.1
prompt_toolkit==3.0.48
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
pure_eval==0.2.3
pyarrow==18.1.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
pydantic==2.10.4
pydantic_core==2.27.2
Pygments==2.19.1
pynndescent==0.5.13
pyparsing==3.2.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==3.2.1
pytz==2024.2
pywin32==308
pywinpty==2.0.14
PyYAML==6.0.2
pyzmq==26.2.0
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.22.3
rsa==4.9
safetensors==0.5.2
scikit-learn==1.6.0
scipy==1.11.4
Send2Trash==1.8.3
setuptools==65.5.0
six==1.17.0
smmap==5.0.2
sniffio==1.3.1
soupsieve==2.6
sqlparse==0.5.3
stack-data==0.6.3
sympy==1.13.1
tenacity==9.0.0
terminado==0.18.1
threadpoolctl==3.5.0
tiktoken==0.8.0
tinycss2==1.4.0
tokenizers==0.21.0
torch==2.5.1
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.47.1
types-python-dateutil==2.9.0.20241206
typing_extensions==4.12.2
tzdata==2024.2
umap-learn==0.5.7
uri-template==1.3.0
urllib3==2.3.0
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
wrapt==1.17.0
xxhash==3.5.0
xyzservices==2024.9.0
yarl==1.18.3
zipp==3.21.0
zstandard==0.23.0

Current Behaviour?

The scanning is failing. I want get the report of the scanning.

Standalone code OR list down the steps to reproduce the issue

The scanning is failing.


import sys
sys.stdout.reconfigure(encoding='utf-8')

import os
import json
import pandas as pd
import giskard as gsk
from openai import AzureOpenAI


AZURE_OPENAI_API_KEY='xxxxx'
AZURE_OPENAI_ENDPOINT='https://xxxxxxx.openai.azure.com'
AZURE_OPENAI_DEPLOYMENT_NAME='xxxx'
AZURE_OPENAI_API_VERSION="xxxxx"


os.environ["AZURE_OPENAI_API_KEY"] = AZURE_OPENAI_API_KEY
os.environ["AZURE_OPENAI_ENDPOINT"] = AZURE_OPENAI_ENDPOINT
os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"] = AZURE_OPENAI_DEPLOYMENT_NAME
os.environ["AZURE_OPENAI_API_VERSION"] = AZURE_OPENAI_API_VERSION
os.environ["AZURE_API_KEY"] = AZURE_OPENAI_API_KEY
os.environ["AZURE_API_BASE"] = AZURE_OPENAI_ENDPOINT
os.environ["AZURE_API_VERSION"] = AZURE_OPENAI_API_VERSION




gsk.llm.set_llm_model(AZURE_OPENAI_DEPLOYMENT_NAME)


PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""



def ask_bot(question):
    # ....
    context = 'xxxxx'
    prompt = PROMPT_TEMPLATE.format(context=context, question=question)

    # ....
    client = AzureOpenAI(
        api_key=AZURE_OPENAI_API_KEY,  
        api_version=AZURE_OPENAI_API_VERSION,
        azure_endpoint =AZURE_OPENAI_ENDPOINT
    )

    # ....
    response = client.chat.completions.create(
        model=AZURE_OPENAI_DEPLOYMENT_NAME,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content

    return answer



def llm_wrap_fn(df: pd.DataFrame):
    outputs = []
    for question in df['question']:
        answer = ask_bot(question)
        outputs.append(answer)

    return outputs


model = gsk.Model(
    llm_wrap_fn,
    model_type="text_generation",
    name="Assistant demo",
    description="Assistant answering based on given context.",
    feature_names=["question"],
)


examples = pd.DataFrame(
    {
        "question": [
            "Do you offer company expense cards?",
            "What are the monthly fees for a business account?",
        ]
    }
)

demo_dataset = gsk.Dataset(
    examples,
    name="ZephyrBank Customer Assistant Demo Dataset",
    target=None
)


try:
    x = model.predict(demo_dataset).prediction
    
    print(json.dumps(x.tolist(), indent=4))
except Exception as error:
    print('-- error --')
    print(error)
    exit(0)




print(f"Dataset size: {len(demo_dataset)}")
# print(f"Dataset preview: {demo_dataset[0]}")  # Preview the first 5 items

print(f"Model type: {type(model)}")
print(f"Model: {model}")




report = ''

try:
    report = gsk.scan(
        model,
        demo_dataset,
        only="jailbreak",
        raise_exceptions=True,
    )
except Exception as error:
    print('-- scan error --')
    print(error)
    exit(0)


try:
    # display(report)
    report.to_html("scan_report.html")
except Exception as error:
    print('-- report.to_html error --')
    print(error)

Relevant log output

2025-01-09 11:13:04,578 pid:44752 MainThread giskard.models.automodel INFO     Your 'prediction_function' is successfully wrapped by Giskard's 'PredictionFunctionModel' wrapper class.
2025-01-09 11:13:04,582 pid:44752 MainThread giskard.datasets.base INFO     Your 'pandas.DataFrame' is successfully wrapped by Giskard's 'Dataset' wrapper class.
2025-01-09 11:13:04,585 pid:44752 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2025-01-09 11:13:06,110 pid:44752 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (2, 1) executed in 0:00:01.529347
[
    "I'm sorry, I cannot answer the question as there is no information provided in the context.",
    "It is not possible to answer the question as there is no information provided about the monthly fees for a business account in the given context."
]
Dataset size: 2
Model type: <class 'giskard.models.function.PredictionFunctionModel'>
Model: Assistant demo(bcf134a2-0e1a-4b5d-82e2-6fa5a426362f)
2025-01-09 11:13:07,099 pid:44752 MainThread httpx        INFO     HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"
🔎 Running scan…
Estimated calls to your model: ~35
Estimated LLM calls for evaluation: 0

2025-01-09 11:13:08,908 pid:44752 MainThread giskard.scanner.logger INFO     Running detectors: ['LLMPromptInjectionDetector']
Running detector LLMPromptInjectionDetector…
2025-01-09 11:13:08,908 pid:44752 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2025-01-09 11:13:10,919 pid:44752 MainThread httpx        INFO     HTTP Request: POST https://gad-nonprod-chatbot-openai.openai.azure.com/openai/deployments/Chatbot-NP-OAI/chat/completions?api-version=2024-02-01 "HTTP/1.1 200 OK"
2025-01-09 11:13:10,919 pid:44752 MainThread giskard.scanner.logger ERROR    Detector LLMPromptInjectionDetector failed with error: 'charmap' codec can't encode character '\U0001f512' in position 1351: character maps to <undefined>
Traceback (most recent call last):
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\scanner\scanner.py", line 162, in _run_detectors
    detected_issues = detector.run(model, dataset, features=features)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\scanner\llm\llm_prompt_injection_detector.py", line 59, in run
    evaluation_results = evaluator.evaluate(model, group_dataset, evaluator_configs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\llm\evaluators\string_matcher.py", line 58, in evaluate
    model_outputs = model.predict(dataset).prediction
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\base\model.py", line 376, in predict
    raw_prediction = self._predict_from_cache(dataset)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\base\model.py", line 430, in _predict_from_cache
    raw_prediction = self.predict_df(unpredicted_df)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\pydantic\_internal\_validate_call.py", line 38, in wrapper_function
    return wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\pydantic\_internal\_validate_call.py", line 111, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\base\wrapper.py", line 131, in predict_df
    output = self.model_predict(batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\venv11\Lib\site-packages\giskard\models\function.py", line 40, in model_predict
    return self.model(df)
           ^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\1_test.py", line 36, in llm_wrap_fn
    answer = simpleBot.ask_bot(question)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\python-sandbox\14-pro-giskard-test-llm\simpleBot.py", line 67, in ask_bot
    append_question_answer(question, answer)
  File "C:\python-sandbox\14-pro-giskard-test-llm\simpleBot.py", line 43, in append_question_answer
    myfile.write(f'\nQuestion: {q}\n')
  File "C:\Pythons\python10\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f512' in position 1351: character maps to <undefined>
-- scan error --
'charmap' codec can't encode character '\U0001f512' in position 1351: character maps to <undefined>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant