PhD Scholarship at Inria on Crawling Algorithms
PhD Openning at Inria Sophia Antipolis, France
at Team NEO
under the supervision of Prof. K. Avrachenkov
The project is in the framework of the joint Inria - Qwant Search Engine Research Lab.
Topic: Adaptive crawling with machine learning techniques
We shall consider the problem of web crawling with limited bandwidth and computational resources. Some web sites could be crawled not sufficiently frequently resulting in resource underutilization and the other web sites could be crawled too frequently resulting in waste of resources. We shall try to design an adaptive crawling algorithm based on machine learning techniques such as clustering and reinforcement learning to try to find dynamically optimal crawling frequencies based on web site classification, behavior and changes.
Lefortier, D., Ostroumova, L., Samosvat, E. and Serdyukov, P.,
"Timely crawling of high-quality ephemeral new content".
In Proceedings of the 22nd ACM international conference on Information & Knowledge Management
(pp. 745-750). 2013.
Faheem, M., and Senellart, P.,
"Adaptive Web Crawling Through Structure-Based Link Classification".
In Proceedings of ICADL(pp. 39-51), 2015.
Avrachenkov, K., and Borkar, V.,
"Whittle Index Policy for Crawling Ephemeral Content".
to appear in IEEE Trans on Control of Network Systems,
Required skills: Solid knowledge of mathematics and, in particular,
Probability and Statistics; experience in machine learning or control
theory is a plus; knowledge of python is another plus.
Application: Please apply with CV, two reference letters and academic transcript.