- Objective This course will cover web search engines and their foundation, principles, and elements.
- Expectation The class will include lectures on topics relating to search engine technologies and on supplemental paper reading assignments. Programming assignments will augment the theoretical aspects of information retrieval. Please understand that this is a graduate-level and upper-division engineering course. As such, students are expected to devote a large amount of time to the programming assignments and course project.
- Instructor John J. Tran
- Lecture Saturday 1:20 PM - 5:00 PM @ ET A309
- Office Hours Saturday 5:15 PM - 6:15 PM or by appointment
- Text Book Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler, and Trevor Strohman. ISBN: 978-0136072249. Can be obtained from Amazon or the CSULA book store.
- Quizz (4) - 20 points
- Homework (4) - 60 points
- Project - 15 points
- Class Participation - 5 points
Final Project These homework assignments culminate a final project: a fully functional search engine. Students will engage in a team or individual project to leverage the search engine techniques learned during the course (e.g., ranking, crawling, content analysis and detection, and query models). Successful completion of the course project is a requirement for passing this course.
The schedule below is tentative and is subject to change.
-
1/10/2015 Reading Assignment: "The anatomy of a large-scale hypertextual Web search engine" by S. Brin and L. Page. We will have a quiz on this paper on 1/17. Please be prepared to answer specific questions.
-
1/17/2015 Introduction to information retrieval, source code management (github), software and applications (Java, eclipse, and maven), architecture of a search engine
-
1/24/2015 Architecture of a search engine (cont), building a simple search engine in practice [Quiz 1]
-
1/31/2015 Crawls and feeds
-
2/7/2015 Reading Assignment: "Crawling the Hidden Web" by S. Raghavan and H. Garcia-Molina. We will have a quiz on this paper 2/14. Please be prepared to answer specifc questions.
-
2/14/2015 Data extraction and text processing [Quiz 2]
-
2/21/2015 Ranking and Indexing
-
2/28/2015 Query Interfaces [Quiz 3]. Reading Assignment: "Hive: a warehousing solution over a map-reduce framework" by A. Thusoo and "Building a distributed full-text index for the web" by S. Melink, S. Raghavan, B.Yang, and H. Garcia-Molina. We will have a quiz on these papers 3/14. Please be prepared to answer specific questions.
-
3/14/2015 Building a complete search engine using open source software [Quiz 4]
-
Final Presentation
Throughout the course, we will read and discuss a number of papers. Students are expected to read and discuss papers.
- Abiteboul, S. and M. Preda, and G. Cobena. Adaptive on-line page importance computation. In Proceedings of the 12th International Conference on World Wide Web, pages 280–290, Budapest, Hungary, 2003
- Ben-Yitzhak, et al. Beyond basic faceted search. Proceedings of the international conference on Web search and web data mining. pp. 33-44, 2008
- Brin, S. and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems. Vol. 30, No. 1-7, pp. 107-117, 1998.
- Chakrabarti, S. and B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Hypersearching the Web. Scientific American, June 1999.
- Clough, P. Extracting metadata for spatially-aware information retrieval on the internet. Proceedings of the Workshop on Geographic Information Retrieval, pp. 25-30, 2005.
- Hastorun et al. Dynamo: Amazon's Highly Available Key-Value Store. Proceedings of the ACM Symposium on Operating System Principles (SOSP). pp. 205-220, 2007.
- Hogan, A. and A. Harth, J. Umrich, S. Decker.Towards a scalable search and query engine for the web.Proceedings of the 16th International Conference on the World Wide Web, 2007.
- Kleinberg, J. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
- Maarek, Y. S. and F. Z. Smadja. Full text indexing based on lexical relations an application: software libraries. ACM SIGIR Forum. volume 23, issue SI. pp. 198-206, 1989.
- Manku, G. and A. Jain, A. Das Sarma. Detecting near-duplicates for web crawling. WWW 2007, pp. 141-150, 2007.
- Melink, S. and S. Raghavan, B.Yang, H. Garcia-Molina. Building a distributed full-text index for the web. ACM Transactions Information Systems, volume 9, number 3, 217-241, 2001.
- Raghavan, S. and H. Garcia-Molina. Crawling the Hidden Web. 27th International Conference on Very Large Data Bases (VLDB 2001).
- Shi, R. and K. Maly, M. Zubair. Automatic Metadata Discovery from Non-cooperative Digital libraries. Proceedings of the IADIS International Conference on e-Society, 2003.
- Thusoo, A. et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009
- Wolf, J. L. and M. S. Squillante, P. S. Yu, J. Sethuraman, L. Ozsen. Optimal crawling strategies for web search engines. WWW 2002, pp. 136-147, 2002.