Crawler
information collection
Question: Crawl a page
- request source file
- Write a letter
- Send the letter an get the reply
- Read the reply
- Save source file to local file system
Question: What is the network process when you are crawling a webpage?
Question: What is HTML?
- a script language
- uses a tree to represent a man
- a standard to render web page
Question: crawl all the news of a website
- find from host
- identify the list page: easy for newest news
- identify the links of news
Question: crawl from more websites
- do a list crawler for every host
- do a scheduler to run corresponding list crawler for host's pages
Question: How about crawl is wasted?
- do a crawler pool
- scheduler hook crawler to run
- add taskID, crawler priority, type(list, page), state, link, available time(schedule a time to run), start time, end time
Question: Design Scheduler with lock
- use lock
- when the no new task, we just wait or sleep for n Seconds, then see again
- lock taskTable to change state or run a crawler task
while (true) {
Lock(taskTable)
// if no task need to run in taskTable
if taskTable.find("state=new") == nullptr {
Release(taskTable)
Sleep(1s)
// if has task need to run in taskTable
} else {
task = taskTable.FindOne("state=new")
task.state = "working"
Release(taskTable)
// fetch res of crawler
res = Crawl(task.url)
// if res is list of news
if task.type == "list" {
Lock(taskTable)
for newTask in res {
taskTable.Add(newTask)
}
task.state = "done"
Release(taskTable)
} else {
// if res is page of news
Lock(pageTable)
pageTable.add(res)
Release(pageTable)
Lock(taskTable)
task.state = "done"
Release(taskTable)
}
}
}
Question: Design scheduler with conditional variable
- when the condition cannot be met(no new task), we sleep and set a clock ring,
once the new task appear, we ring the clock (conditional variable) and wake up the condition variable
set clock ring:
While true {
Lock(taskTable)
// when the condition cannot be met(no new task), we sleep and set a clock ring,
while taskTable.Find("state=new") == nullptr {
Cond_Wait(cond, taskTable)
}
task = taskTable.FindOne("state=new")
task.state = "working"
Release(taskTable)
res = Crawl(task.url)
if task.type == "list" {
Lock(taskTable)
for newTask in res {
taskTable.Add(newTask)
// once the new task appear, we ring the clock (conditional variable) and wake up the condition variable
Cond_Singal(cond)
}
task.state = "done"
Release(taskTable)
} else {
Lock(pageTable)
pageTable.Add(res)
Release(pageTable)
Lock(taskTable)
task.state = "done"
Release(taskTable)
}
}
// imp of cond_wait
Cond_Wait(cond, mutex) {
Lock(cond.threadWaitList)
cond.threadWaitList.Add(this.thread)
Release(cond.threadWaitList)
Release(mutex)
Block(this.thread)
Lock(mutex)
}
// imp of cond_signal
Cond_Signal(cond) {
Lock(cond.threadWaitList)
if cond.threadWaitList.size() > 0 {
thread = cond.threadWaitList.Pop()
Wakeup(thread)
}
Realse(cond.threadWaitList)
}
Question: Design scheduler wiht semaphore
- record how many threads are waiting: wait(semaphore)
- once add new task, add recording: signal(semaphore)
while true {
// try to add a consumer
// if numberOfNewTask <= 0, block current thread
Wait(numberOfNewTask)
Lock(taskTable)
task = taskTable.FindOne("state=new")
task.state = "working"
Release(taskTable)
page = Crawl(task.url)
if task.type == "list" {
Lock(taskTable)
for newTask in page {
taskTable.Add(newTask)
// try to add a product
// numberOfNewTask + 1 <= 0, we have new task need to run
Singal(numberOfNewTask)
}
task.state = "done"
Release(taskTable)
} else {
Lock(pageTable)
pageTable.Add(page)
Release(pageTable)
Lock(taskTable)
task.state = "done"
Release(taskTable)
}
}
// add a consumer
// if no product, block current consumer
// if has product, consumer take product, product count - 1
Wait(semaphore) {
Lock(semaphore)
semaphore.value -= 1
if semaphore.value < 0 {
semaphore.processWaitList.Add(this.process)
Release(semaphore)
Block(this.process)
} else {
Release(semaphore)
}
}
// add a product
// if has a consumer is waiting, hooks it up
// if no one is waiting, just product count += 1
Singal(semaphore) {
Lock(semaphore)
semaphore.value += 1
if semaphore.value <= 0 {
pricess = semaphore.processWaitList.pop()
Wakeup(process)
}
Release(semaphore)
}
Question: Design the fastest thread-safe consumers and producers
LMAX Disruptor
Question: Distribute crawlers in multiple machines
one conector to one crawler
one sender/receiver to many crawler
Question: work with Database
111111
Tiny URL
URL -> shorten URL
How to?
- Scenario:
- func insert(longURL) -> shortURL: 1%
- func lookup(shortURL) -> longURL : 100%
- Necessary
- insert: Daily active users * 1% * per user use(10)
- per day: 1M user -> 0.1M
- per year: 36.5M
- per second: < 1000
- look up: Daily active users * 100% * per user use(3)
- per day
- per second: < 1000
- insert: Daily active users * 1% * per user use(10)
- Algorithm
- decode/ encode
- counter:
- from String(longURL) to Int(0 -> 999999)
- from LongURL to 0-9, a-z, A-Z, ConvertTo62
- Storage
- average size of longURL: 100B
- average size of shortURL: 4B
- state: 4B
- Daily new URL: 0.1M * 108B = 10MB
- Year new URL: 3.65GB, can be added into Mem
Question:
- Support Random: Random(0, range)
- Avoid conflicting: Try again, mapping quite large, hard to occur multiple conflicts
- Implement time-limited service: expire / state
- How to cache: reload / replacement
- How about 1M per second:
- if just in a short time: use Queue + async
- rate limit, refuse some users
- more server -> load balancer
- all memory, allow some error exchanging performance, write memory first, then write disk
- one receptionist will die -> more receptionist -> more server
How to support analysis
add a log for every request
including: date, time, host, IP, location, shortURL, browser, platform