COMP6490 Document Analysis

Offered By	Research School of Computer Science
Academic Career	Graduate Coursework
Course Subject	Computer Science
Offered in	Second Semester, 2012 and Second Semester, 2013
Unit Value	6 units
Course Description	Processing of semi-structured documents such as internet pages, RSS feeds and their accompanying news items, and PDF brochures is considered from the perspective of interpreting the content. This course considers the \document" and its various genres as a fundamental object for business, government and community. For this, the course covers four broad areas: (A) information retrieval, (B) natural language processing, (C) machine learning for documents, and (D) relevant tools for the Web. Basic tasks here are covered including content collection and extraction, formal and informal natural language processing, information extraction, information retrieval, classification and analysis. Fundamental probabilistic techniques for performing these tasks, and some common software systems will be covered, though no area will be covered in any depth.
Learning Outcomes	Upon successful completion of the course, the student will have an understanding of the role documents play in business and community, and the various digital resources available for document analysis. Moreover, the student will have the background theory and practical knowledge necessary to plan and execute a basic document analysis project. The student will be able to: Understand the basic requirements digital libraries and business processes have w.r.t. documents. Obtain documents from various sources and transform them into a common XML or RDF format with a knowledge of SAX and XPATH. Understand the genres of documents available from the internet such as RSS feeds, social networks, blogs, wikis, archives, etc., and the role they play in the internet ecosystem. Understand the linguistic and semantic resources available from the internet and the so-called ``web of data'', such as dictionaries, repositories and ontologies Understand basic probabilistic theories of language and document structure, and the basic algorithms and software available for them, and be able to use some common libraries for natural language processing to perform basic analysis tasks. Understand basic probabilistic theories of information retrieval, and be able to index a document collection for use in an information retrieval system. Understanding basic theories and algorithms for large scale named-entity matching and standardization of names within a collection. Understand basic probabilistic theories of classification, clustering, and document feature ``engineering'', and be able to perform automated classification.
Indicative Assessment	Two written assignments with programming option (40%), written final exam (60%).
Workload	Thirty one-hour lectures and six two hour tutorial/laboratory sessions
Course Classification(s)	AdvancedAdvanced courses are designed for students having reached 'first degree' level of assumed knowledge, which provide a deep understanding of contemporary issues; or 'second degree' and higher levels of knowledge; or for transition to research training programs.
Requisite Statement	None
Recommended Courses	Programming ability in C, C++ or Java, and basic mathematical and statistical knowledge, at an undergraduate-level
Prescribed Texts	The following reference books will be used. Introduction to Information Retrieval, C.D. Manning, P. Raghavan and H. Scutze, Cambridge University Press, 2008. Foundations of Statistical Natural Language Processing, C.D. Manning and H. Scutze, MIT Press, 1999.
Academic Contact	peter.christen@anu.edu.au

The information published on the Study at ANU 2012 website applies to the 2012 academic year only. All information provided on this website replaces the information contained in the Study at ANU 2011 website.