Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataBricks Unity Catalog and Cobrix #665

Open
schwalldorf opened this issue Apr 9, 2024 · 11 comments
Open

DataBricks Unity Catalog and Cobrix #665

schwalldorf opened this issue Apr 9, 2024 · 11 comments
Labels
accepted Accepted for implementation enhancement New feature or request

Comments

@schwalldorf
Copy link

schwalldorf commented Apr 9, 2024

Hi guys,

thanks a lot for Cobrix. It's really great!

We're moving from Spark (Hadoop) on Premises to DataBricks in the Azure Cloud.
And have encountered a strange problem when using the Unity Catalog.

Both the copybook and the data are stored in a managed Volume in Unity catalog. (Copybooks are simple, no nested fields.) If we do something as simple as

df = spark.read.format("cobol"). \
        option("copybook", "/Volumes/dev/raw/copybook.cob"). \
        load("/Volumes/dev/raw/my.data")

in a Python notebook or script, everything works fine if the code runs on a Compute cluster created by the same person who executes the code. If the code is run by Person A on a cluster created by person B, an "Insufficient Permissions" exception is raised.
See

[INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission SELECT on any file. SQLSTATE: 42501
File <command-4018475800944646>, line 1
----> 1 cobol_import("bnktfili")
File /databricks/spark/python/pyspark/sql/connect/client/core.py:1874, in SparkConnectClient._handle_rpc_error(self, rpc_error)
   1871             info = error_details_pb2.ErrorInfo()
   1872             d.Unpack(info)
-> 1874             raise convert_exception(
   1875                 info,
   1876                 status.message,
   1877                 self._fetch_enriched_error(info),
   1878                 self._display_server_stack_trace(),
   1879             ) from None
   1881     raise SparkConnectGrpcException(status.message) from None
   1882 else:

Person A has full read permissions on any item in the catalog.
The problem only arrises when using Cobrix. If we just load some CSV or parquet file form a Volume, no such problem occurs.

Any idea what goes on here or what we could do?
Any help is much appreciated. Thanks a lot.

@schwalldorf schwalldorf added the question Further information is requested label Apr 9, 2024
@schwalldorf
Copy link
Author

Some more error message context:

2024-04-05 12:58:20,105 1607 ERROR _handle_rpc_error GRPC Error received
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1485, in _execute_and_fetch_as_iterator
    for b in generator:
  File "/usr/lib/python3.10/_collections_abc.py", line 330, in __next__
    return self.send(None)
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 133, in send
    if not self._has_next():
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 194, in _has_next
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 166, in _has_next
    self._current = self._call_iter(
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 280, in _call_iter
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 263, in _call_iter
    return iter_fun()
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 167, in <lambda>
    lambda: next(self._iterator)  # type: ignore[arg-type]
  File "/databricks/python/lib/python3.10/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/databricks/python/lib/python3.10/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "[INSUFFICIENT_PERMISSIONS] Insufficient privileges:
User does not have permission SELECT on any file. SQLSTATE: 42501"
	debug_error_string = "UNKNOWN:Error received from peer unix:/databricks/sparkconnect/grpc.sock {grpc_message:"[INSUFFICIENT_PERMISSIONS] Insufficient privileges:\nUser does not have permission SELECT on any file. SQLSTATE: 42501", grpc_status:13, created_time:"2024-04-05T12:58:20.104583977+00:00"}"

@schwalldorf
Copy link
Author

Do you read the copybook and the data file via the RDD API? If so, this is the likely cause, as the RDD API is not supported by DataBricks in the Unity Catalog: https://learn.microsoft.com/en-us/azure/databricks/compute/access-mode-limitations#spark-api-limitations-for-unity-catalog-shared-access-mode

@yruslan
Copy link
Collaborator

yruslan commented Apr 10, 2024

@schwalldorf , Thanks for the interest in the project. Very glad you like it!

What is the Databrics-supported alternative for reading data files concurrently from Spark?

@schwalldorf
Copy link
Author

Hi Ruslan,

thanks a lot for your reply.
DataBricks supports both the DataFrame API and the Dataset API. I think the Dataset API should be closer to RDDs, but I'm not an expert in this. And I wouldn't know how to easily rewrite your code.

@yruslan
Copy link
Collaborator

yruslan commented Apr 17, 2024

Sure. Let's keep this issue open. This is something we might look at at some point. In the meantime somebody might suggest a workaround.

@meghanavemisetty
Copy link

Hi there,
I am also encountering this issue described in #665. I'm looking forward to any updates or workarounds that might be available. Following this for any progress.
Thanks!

@yruslan
Copy link
Collaborator

yruslan commented May 7, 2024

So far no progress on this since I don't have access to a Databricks instance at the moment. But this might change during the year, will keep in mind to fix it

@yruslan yruslan added enhancement New feature or request accepted Accepted for implementation and removed question Further information is requested labels May 7, 2024
@saikumare-a
Copy link

any luck with update on this?

@yruslan
Copy link
Collaborator

yruslan commented Aug 2, 2024

Not from our side since we are not yet using Databrix's volumes on Unity Catalog.

Has this issue been risen with Databricks support as well? If yes, please add a link to the issue.

A possible workaround is to use:

.option("enable_indexes", "true")

Let me know if it works

@saikumare-a
Copy link

Sure, will check and update. Thank you

@yruslan
Copy link
Collaborator

yruslan commented Aug 2, 2024

@schwalldorf, @saikumare-a, @meghanavemisetty, if you have a stack trace that show lines of Cobrix Scala code the error is happening, it would help a bit. This can at least confirm which API is used for file access at the location.

Also, you can try:

  • Loading an ASCII file via Cobrix - this uses a different API. If it works, this would give some additional information.
  • Are there differences between "record_format = F" and "record_format = V". Does access work for simple fixed length files, or only for variable-length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants