Crawler

information collection

Question: Crawl a page

  • request source file
  • Write a letter
  • Send the letter an get the reply
  • Read the reply
  • Save source file to local file system

Question: What is the network process when you are crawling a webpage?

Question: What is HTML?

  • a script language
  • uses a tree to represent a man
  • a standard to render web page

Question: crawl all the news of a website

  • find from host
  • identify the list page: easy for newest news
  • identify the links of news

Question: crawl from more websites

  • do a list crawler for every host
  • do a scheduler to run corresponding list crawler for host's pages

Question: How about crawl is wasted?

  • do a crawler pool
  • scheduler hook crawler to run
  • add taskID, crawler priority, type(list, page), state, link, available time(schedule a time to run), start time, end time

Question: Design Scheduler with lock

  • use lock
  • when the no new task, we just wait or sleep for n Seconds, then see again
  • lock taskTable to change state or run a crawler task
while (true) {
    Lock(taskTable)
    // if no task need to run in taskTable
    if taskTable.find("state=new") == nullptr {
        Release(taskTable)
        Sleep(1s)
    // if has task need to run in taskTable
    } else {
        task = taskTable.FindOne("state=new")
        task.state = "working"
        Release(taskTable)

        // fetch res of crawler
        res = Crawl(task.url)

        // if res is list of news
        if task.type == "list" {
            Lock(taskTable)
            for newTask in res {
                taskTable.Add(newTask)
            }
            task.state = "done"
            Release(taskTable)
        } else {
        // if res is page of news
            Lock(pageTable)
            pageTable.add(res)
            Release(pageTable)

            Lock(taskTable)
            task.state = "done"
            Release(taskTable)
        }
    }
}

Question: Design scheduler with conditional variable

  • when the condition cannot be met(no new task), we sleep and set a clock ring,
  • once the new task appear, we ring the clock (conditional variable) and wake up the condition variable

  • set clock ring:

While true {
    Lock(taskTable)
    // when the condition cannot be met(no new task), we sleep and set a clock ring, 
    while taskTable.Find("state=new") == nullptr {
        Cond_Wait(cond, taskTable)
    }
    task = taskTable.FindOne("state=new")
    task.state = "working"
    Release(taskTable)

    res = Crawl(task.url)

    if task.type == "list" {
        Lock(taskTable)
        for newTask in res {
            taskTable.Add(newTask)
            // once the new task appear, we ring the clock (conditional variable) and wake up the condition variable
            Cond_Singal(cond)
        }
        task.state = "done"
        Release(taskTable)
    } else {
        Lock(pageTable)
        pageTable.Add(res)
        Release(pageTable)

        Lock(taskTable)
        task.state = "done"
        Release(taskTable)
    }
}
// imp of cond_wait
Cond_Wait(cond, mutex) {
    Lock(cond.threadWaitList)
    cond.threadWaitList.Add(this.thread)
    Release(cond.threadWaitList)

    Release(mutex)
    Block(this.thread)
    Lock(mutex)
}
// imp of cond_signal
Cond_Signal(cond) {
    Lock(cond.threadWaitList)
    if cond.threadWaitList.size() > 0 {
        thread = cond.threadWaitList.Pop()
        Wakeup(thread)
    }
    Realse(cond.threadWaitList)
}

Question: Design scheduler wiht semaphore

  • record how many threads are waiting: wait(semaphore)
  • once add new task, add recording: signal(semaphore)
while true {
    // try to add a consumer
    // if numberOfNewTask <= 0, block current thread
    Wait(numberOfNewTask)
    Lock(taskTable)
    task = taskTable.FindOne("state=new")
    task.state = "working"
    Release(taskTable)

    page = Crawl(task.url)

    if task.type == "list" {
        Lock(taskTable)
        for newTask in page {
            taskTable.Add(newTask)
            // try to add a product
            // numberOfNewTask + 1 <= 0, we have new task need to run
            Singal(numberOfNewTask)
        }
        task.state = "done"
        Release(taskTable)
    } else {
        Lock(pageTable)
        pageTable.Add(page)
        Release(pageTable)
        Lock(taskTable)
        task.state = "done"
        Release(taskTable)
    }
}
// add a consumer
// if no product, block current consumer
// if has product, consumer take product, product count - 1
Wait(semaphore) {
    Lock(semaphore)
    semaphore.value -= 1
    if semaphore.value < 0 {
        semaphore.processWaitList.Add(this.process)
        Release(semaphore)
        Block(this.process)
    } else {
        Release(semaphore)
    }
}
// add a product
// if has a consumer is waiting, hooks it up
// if no one is waiting, just product count += 1 
Singal(semaphore) {
    Lock(semaphore)
    semaphore.value += 1
    if semaphore.value <= 0 {
        pricess = semaphore.processWaitList.pop()
        Wakeup(process)
    }
    Release(semaphore)
}

Question: Design the fastest thread-safe consumers and producers

LMAX Disruptor

Question: Distribute crawlers in multiple machines

  • one conector to one crawler

  • one sender/receiver to many crawler

Question: work with Database

111111

Tiny URL

URL -> shorten URL

How to?

  • Scenario:
    • func insert(longURL) -> shortURL: 1%
    • func lookup(shortURL) -> longURL : 100%
  • Necessary
    • insert: Daily active users * 1% * per user use(10)
      • per day: 1M user -> 0.1M
      • per year: 36.5M
      • per second: < 1000
    • look up: Daily active users * 100% * per user use(3)
      • per day
      • per second: < 1000
  • Algorithm
    • decode/ encode
    • counter:
      • from String(longURL) to Int(0 -> 999999)
      • from LongURL to 0-9, a-z, A-Z, ConvertTo62
  • Storage
    • average size of longURL: 100B
    • average size of shortURL: 4B
    • state: 4B
    • Daily new URL: 0.1M * 108B = 10MB
    • Year new URL: 3.65GB, can be added into Mem

Question:

  • Support Random: Random(0, range)
  • Avoid conflicting: Try again, mapping quite large, hard to occur multiple conflicts
  • Implement time-limited service: expire / state
  • How to cache: reload / replacement
  • How about 1M per second:
    • if just in a short time: use Queue + async
    • rate limit, refuse some users
    • more server -> load balancer
    • all memory, allow some error exchanging performance, write memory first, then write disk
    • one receptionist will die -> more receptionist -> more server
  • How to support analysis

    • add a log for every request

    • including: date, time, host, IP, location, shortURL, browser, platform

results matching ""

    No results matching ""