Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input Length Exceeds Maximum Limit in LLama:8B Model API (Deep Infra) #395

Open
sanchitsingh001 opened this issue Dec 31, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@sanchitsingh001
Copy link

Here's a clearer version of your message:

Hi,

I am using Deep Infra's model API, specifically the LLama:8B model, to scrape product data from e-commerce websites. However, for certain websites like amazon etc, I encounter the following error:

json
Copy code
{
"index": 0,
"error": true,
"tags": [
"error"
],
"content": "litellm.APIError: APIError: DeepinfraException - Error code: 500 - {'error': {'message': 'Requested input length 10452 exceeds maximum input length 8191'}}"
}
Is there a way to increase the input length, or can the model's structure handle longer inputs? If not, do you recommend any strategies for managing this limitation?

@unclecode
Copy link
Owner

@sanchitsingh001 I assume you are using the LLM extraction strategy, such limits relates to the model. However, you can fix the issue in certain ways. The LLM extraction strategy can chunk the content into smaller sizes, send each chunk to the LLM in parallel, and then combine the results. You can't adjust that threshold size. Share with me your codesnippet and the URL, and I will show you how to do that.

I am working on a set of new documents where I explain the different strategies you can use. It is currently in draft mode. I will give you the links so you can check them and get some ideas.

https://github.com/unclecode/crawl4ai/blob/main/docs/md_v3/tutorials/json-extraction-basic.md
https://github.com/unclecode/crawl4ai/blob/main/docs/md_v3/tutorials/json-extraction-llm.md

@unclecode unclecode self-assigned this Jan 1, 2025
@unclecode unclecode added the question Further information is requested label Jan 1, 2025
@sanchitsingh001
Copy link
Author

Thank you for your detailed response and for sharing the helpful documentation links. I have attached the requested code snippet and the product page URL for reference.

To provide more context, I primarily scrape e-commerce websites like Amazon and eBay to extract product details. However, I have encountered some challenges:

Hallucinated Responses: The structured data returned often includes hallucinated entries. For example, if I provide a page containing a single product, the response may include a list of products that do not exist.
Performance Requirements: I need to scrape and process approximately 90-100 product pages at a time, converting the content into structured data within 5 seconds. Achieving this level of performance has been challenging.
Given these constraints, I have the following questions:

Is using a more advanced LLM the only way to ensure highly accurate and reliable structured data?
For the performance bottleneck, is hardware the primary limitation, or are there additional optimizations I could consider?
I know tools like Perplexity and some ChatGPT versions can retrieve and process web data quickly, so I believe this level of efficiency is achievable. Any guidance or resources you could provide to address these challenges would be greatly appreciated.

Here's my code:
import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

Define the schema for product details

class ProductDetails(BaseModel):
product_name: str = Field(..., description="Name of the product.")
price: str = Field(..., description="Price of the product.")
rating: str = Field(..., description="Rating of the product.")
reviews_count: str = Field(..., description="Number of reviews for the product.")
availability: str = Field(..., description="Availability status of the product.")
product_description: str = Field(..., description="Detailed description of the product.")
features: list[str] = Field(..., description="List of features or specifications of the product.")

Function to extract product details from the webpage

async def extract_product_details():
url = 'https://www.amazon.com/s?k=shoes&crid=1M0PZKQQQ7OYT&sprefix=shoe%2Caps%2C191&ref=nb_sb_noss_2'

async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(
        url=url,
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            provider="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
            api_token="",
            base_url="https://api.deepinfra.com/v1/openai",
            schema=ProductDetails.model_json_schema(),
            extraction_type="schema",
            instruction=(
                "From the crawled content, extract the product's name, price, rating, number of reviews, "
                "availability status, detailed description, and list of features or specifications. "
                "Ensure all information is accurate and comprehensive. "
                'An example JSON format for a single product: '
                '{ "product_name": "PUMA Tazon Running Shoe", "price": "$50.00", "rating": "4.5 stars", '
                '"reviews_count": "1,200", "availability": "In Stock", '
                '"product_description": "Durable and comfortable running shoes.", '
                '"features": ["Rubber sole", "Mesh upper for breathability", "Padded collar and tongue"] }'
            )
        ),
        bypass_cache=True,
    )

product_details = json.loads(result.extracted_content)
print(f"Extracted product details: {product_details}")
with open(".data/product_details.json", "w", encoding="utf-8") as f:
    json.dump(product_details, f, indent=2)

asyncio.run(extract_product_details())

@unclecode
Copy link
Owner

@sanchitsingh001 You're welcom, sure I will take a look on this, coming weekend.

@sanchitsingh001
Copy link
Author

Thank You

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants