Scraping all the texts of Luxun(鲁迅) from the Internet using Python (用Python爬取《鲁迅全集》)

2019-10-12
2 min read

I want to do some text mining practices on the texts of Luxun(鲁迅), a great Chinese writer. The first step is to get all the texts by Luxun, and I have no time typing all the texts word by word. So I decided to srape the texts from an online source.

Source of the texts

The texts of Luxun are scraped from 子夜星网. As it claimed, it contains all the texts in the Complete works of Luxun(鲁迅全集). I checked it, and so it did.

Get the urls and titles of all the articles

The process starts at getting the contents and the urls of the text of Luxun from the parent url http://www.ziyexing.com/luxun/. To access all the urls, I constructed a regular expression and selected all a nodes that share the pattern.

homepage_res = requests.get("www.ziyexing.com/luxun/")
homepage_soup = BeautifulSoup(res.text, "html.parser")
href_re = re.compile(r"luxun_\w+_\w+_\d+.htm")
hrefs = homepage_soup.find_all("a", {"href":href_re})

It is found that although the regular expression covers most of the patterns, some urls are idiosyncratic and do not conform to the regex. I constructed another regex.

others_re = re.compile(r"(zhunfengyuetan)|(gushixinbian)|(gujixubaji)|(zgxssl)|(luxun_shici)\w+")
other_hrefs = homepage_soup.find_all("a", {"href":others_re})

It is, of course, idiosyncratic, but effective. To use two regexes, I have got all the urls.

links = [href.attrs["href"] for href in hrefs]
other_links = [line.attrs["href"] for line in other_hrefs]

The title of each url can also be accessed in the a nodes. It can easily accessed using a.text, yet another problem appeared. The most notorious problem in dealing with non-Latin alphabet languages, especially Chinese, is the problem of encoding. When applying a.text, the characters did not show normally. I am fortunately enough to have recently learnt that the encoding of an html page can be seen from the header of the page. I checked it and found that the page is encoded using gb2312. To make the texts return to normal requires encoding in Latin1 and subsequently decoding in gb2312. gbk, as a superset of gb2312, works better in decoding.

titles = [href.text.encode("latin1").decode("gbk") for href in hrefs]
other_titles = [line.text.encode("latin1").decode("gbk") for line in other_hrefs]

Have got all the urls and corresponding titles, I can proceed to the next step, to scrape all the articles. A brief inspection of the article page show that all the contents of the article are embedded in the p node with property line-height: 150%. A further inspection of other pages show that the line-height can also be 130%. So another regex is needed here.

ps_re = re.compile(r"line-height: 1\d0%")

Get the texts

Put all the pieces together, I wrote several functions to make the process modular and easy to understand.

The get_soup function accesses the given url and returns the BeautifulSoup object.

def get_soup(base_url, url):
    res = requests.get(base_url + url)
    soup = BeautifulSoup(res.text, "html.parser")
    return soup

The get_ps function accepts the soup object and outputs the p nodes, which contain the texts.

def get_ps(soup):
    ps_re = re.compile(r"line-height: 1\d0%")
    ps = soup.find_all("p", {"style":ps_re})
    return ps

The clean_text function accepts the p nodes and outputs the cleaned text.

def clean_text(texts):
    texts_decoded = [text.encode("latin1", "ignore").decode("gbk", "ignore") for text in texts]
    texts_decoded = [text.strip() for text in texts_decoded]
    cleaned_texts = [text for text in texts_decoded if text != ""]
    return cleaned_texts

The write_text function writes the text data in txt fromat in a file named after the title of the article.

def write_text(clean_text, titles, n):
    with open("luxun/" + titles[n].strip() + ".txt", "w", encoding="utf8") as file:
        file.write("\n".join(clean_text))

To wrap all these function together, I wrote a main function which do all these stuff at once.

def main(links, titles):
    for i in range(len(links)):
        soup = get_soup(base_url, links[i])
        ps = get_ps(soup)
        texts = ps[0].text.split("\n")
        cleaned_texts = clean_text(texts)
        write_text(cleaned_texts, titles, i)
        time.sleep(3)

To avoid too much traffic for the site, I used the time.sleep function to pause for three seconds between urls.

Running the main() function, and I got all the articles posted on 子夜星网 by Luxun in one folder.

Key points

There are some traps in this toy project, some of which are interesting. I list some key points below.

  • Use a regex to capture the pattern of the desired urls;
  • If a regex can not exhaust the pattern, write another one;
  • Look for encoding schemes in the header of html files;
  • Latin-1 works great for Chinese characters!
    • Use text.encode('latin1').decode('gbk').