Python 爬虫之多进程-CJavaPy

1、Python 多进程

2、Python 多进程爬虫

Python中想要提高执行效率，大部分开发者是通过编写多进程来提高运行效率，使用multiprocessing进行并行编程，可以编写多进程爬虫来爬取信息，缺点是每个进程都会有自己的内存，数据多会占用比较大的内存。

1）多进程使用示例

#!/usr/bin/python

from multiprocessing import Process, Semaphore, Lock, Queue
import time
from random import random
 
buffer = Queue(10)
buffer.put('init')
empty = Semaphore(0)
full = Semaphore(1)
lock = Lock()
 
class Consumer(Process):
 
    def run(self):
        global buffer, empty, full, lock
        while True:
            full.acquire()
            lock.acquire()
            print('Consumer get', buffer.get())
            time.sleep(1)
            lock.release()
            empty.release()
 
 
class Producer(Process):
    def run(self):
        global buffer, empty, full, lock
        while True:
            empty.acquire()
            lock.acquire()
            num = random()
            print('Producer put ', num)
            buffer.put(num)
            time.sleep(1)
            lock.release()
            full.release()
 
 
if __name__ == '__main__':
    p = Producer()
    c = Consumer()
    p.daemon = c.daemon = True
    p.start()
    c.start()
    p.join()
    c.join()
    print('运行完成')

2）多进程爬虫

from multiprocessing import Pool
import requests
from requests.exceptions import ConnectionError


def scrape(url):
    try:
        print(requests.get(url))
    except ConnectionError:
        print('Error Occured ', url)
    finally:
        print('URL ', url, ' Scraped')


if __name__ == '__main__':
    pool = Pool(processes=3) # 初始化一个 Pool，指定进程数为 3，如果不指定，那么会自动根据 CPU 内核来分配进程数。
    urls = [
        'https://www.baidu.com',
        'https://www.meituan.com/',
        'https://blog.csdn.net/',
        'https://www.zhihu.com'
    ]
    pool.map(scrape, urls) # map 函数可以遍历每个 URL，然后对其分别执行 scrape

Python 爬虫之多进程-CJavaPy

Python 爬虫之多进程

1、Python 多进程

2、Python 多进程爬虫

推荐文档

微信小程序

抖音小程序

相关文档

大家感兴趣的内容

随机列表