Wednesday, February 17, 2021

Python - Get HTML Pages

1. Download Python

https://www.python.org/downloads/

2. Install Python

Create a Python folder and run executable (name likley to diifer)

C:\Users\arrge\Python>C:\Users\arrge\Downloads\python-3.9.1-amd64.exe

3. Get PIP

https://bootstrap.pypa.io/get-pip.py

Save it somewhere

C:\Users\arrge\Python>py get-pip.py

C:\Users\arrge\AppData\Local\Programs\Python\Python39\lib\site-packages\setuptools\distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
  warnings.warn(
Collecting pip
  Downloading pip-21.0.1-py3-none-any.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 2.2 MB/s
Collecting wheel
  Downloading wheel-0.36.2-py2.py3-none-any.whl (35 kB)
Installing collected packages: wheel, pip
  WARNING: The script wheel.exe is installed in 'C:\Users\arrge\AppData\Local\Programs\Python\Python39\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  Attempting uninstall: pip
    Found existing installation: pip 20.2.3
    Uninstalling pip-20.2.3:
      Successfully uninstalled pip-20.2.3
  WARNING: The scripts pip.exe, pip3.9.exe and pip3.exe are installed in 'C:\Users\arrge\AppData\Local\Programs\Python\Python39\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-21.0.1 wheel-0.36.2

4. Install Beautiful Soup and Requests


pip install beautifulsoup4
pip install requests

5. Write script test.py


from bs4 import BeautifulSoup, SoupStrainer
import requests

for i in range (2,136):
    xstr = str(i)
    if i < 10:
        xstr = "00" + str(i)
    elif i < 100:
        xstr = "0" + str(i)     
    url = "http://www.thearsenalhistory.com/stat/aftlu_files/sheet" + xstr + ".htm"  
    print (url)
    page = requests.get(url)    
    data = page.text
    soup = BeautifulSoup(data, features="html.parser")
    filename = str(1884+i) + "_" + str(1885+i) + ".htm"
    with open(filename, 'w', encoding="utf-8") as f:
        tables = soup.find_all('table')
        for tb in tables:
            print(tb.prettify().replace("-","~"), file=f)


6. run script test.py

C:\Users\arrge\Python>py test.py
http://www.thearsenalhistory.com/stat/aftlu_files/sheet002.htm
http://www.thearsenalhistory.com/stat/aftlu_files/sheet003.htm
http://www.thearsenalhistory.com/stat/aftlu_files/sheet004.htm




No comments:

Post a Comment

IHT at 60

You are married. You and your partner are 60 and seven years away from state pension.   You have at least one child.  You own your own home...