The Traders' Den - View Single Post - Any good Python coders here? Especially with regard to web scraping?

brentter · #2 2024-09-25, 11:26 AM

So what's happening is your scraper is trying to grab the page data before it's loaded because there's a js file that loaded before the content.
For example: http://www.thetradersden.org/forums/.../t-203252.html
If you pull up your dev tools in your browser->console-> this is the error:

Quote:

Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content. markup.js:250:53

clicking on the markup.js file highlights

Quote:

try {
// If we didn't wait for the document to load, we want to force a layout update
// to ensure the anonymous content will be rendered (see Bug 1580394).
const forceSynchronousLayoutUpdate = !this.waitForDocumentToLoad;
this._content = this.anonymousContentDocument.insertAnonymousContent(
forceSynchronousLayoutUpdate
);

The page hasn't been rendered. So there's a page there, hence the 200, just nothing on it yet. If you add in a wait timer that might fix it or just switch to a tool like Selenium that works well with javascript and has a built-in option to automatically wait until a certain page element has been loaded:

Quote:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Path to your WebDriver executable (e.g., chromedriver)
driver_path = '/path/to/chromedriver'

# Initialize WebDriver (Chrome in this case)
driver = webdriver.Chrome(executable_path=driver_path)

try:
#open the web page
driver.get('http://example.com')

# Wait until specific element is loaded (adjust selector as needed)
wait = WebDriverWait(driver, 10) # Timeout after 10 seconds
element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'your-target-element-selector')))

# Now that we are sure the content has loaded, get page source
html_content = driver.page_source

# You can now parse html_content with BeautifulSoup or any other parser as needed

finally:
driver.quit() # Close browser when done

Selenium uses chrome btw and this example was obv just for a single page scrape.
https://selenium-python.readthedocs.io/waits.html