PhD Scholarship at Inria on Crawling Algorithms

PhD Openning at Inria Sophia Antipolis, France

https://www.inria.fr/

at Team NEO

https://team.inria.fr/neo/presentation/

under the supervision of Prof. K. Avrachenkov

e-mail: K.Avrachenkov@inria.fr

http://www-sop.inria.fr/members/Konstantin.Avratchenkov/me.html

The project is in the framework of the joint Inria - Qwant Search Engine Research Lab.

Topic: Adaptive crawling with machine learning techniques

Resume: 

We shall consider the problem of web crawling with limited bandwidth and computational resources. Some web sites could be crawled not sufficiently frequently resulting in resource underutilization and the other web sites could be crawled too frequently resulting in waste of resources. We shall try to design an adaptive crawling algorithm based on machine learning techniques such as clustering and reinforcement learning to try to find dynamically optimal crawling frequencies based on web site classification, behavior and changes.

Related references:

Lefortier, D., Ostroumova, L., Samosvat, E. and Serdyukov, P., 
"Timely crawling of high-quality ephemeral new content". 
In Proceedings of the 22nd ACM international conference on Information & Knowledge Management 
(pp. 745-750). 2013.

Faheem, M., and Senellart, P.,
"Adaptive Web Crawling Through Structure-Based Link Classification". 
In Proceedings of ICADL(pp. 39-51), 2015.

Avrachenkov, K., and Borkar, V.,
"Whittle Index Policy for Crawling Ephemeral Content".
to appear in IEEE Trans on Control of Network Systems,
https://arxiv.org/pdf/1503.08558.pdf

Required skills: Solid knowledge of mathematics and, in particular,

Probability and Statistics; experience in machine learning or control

theory is a plus; knowledge of python is another plus.

Application: Please apply with CV, two reference letters and academic transcript.

Organisation: 
Job location: 
Inria Sophia Antipolis
2004 Route des Lucioles
06902 Sophia Antipolis
France
Contact and application information
Deadline: 
Thursday, May 31, 2018
Contact name: 
Konstantin Avrachenkov
Categorisation