COMP4650 Document Analysis

Later Year Course

Offered By	Research School of Computer Science
Academic Career	Undergraduate
Course Subject	Computer Science
Offered in	Second Semester, 2011 and Second Semester, 2012
Unit Value	6 units
Course Description	Processing of semi-structured documents such as internet pages, RSS feeds and their accompanying news items, and PDF brochures is considered from the perspective of interpreting the content. This course considers the \document" and its various genres as a fundamental object for business, government and community. For this, the course covers four broad areas: (A) information retrieval, (B) natural language processing, (C) machine learning for documents, and (D) relevant tools for the Web. Basic tasks here are covered including content collection and extraction, formal and informal natural language processing, information extraction, information retrieval, classification and analysis. Fundamental probabilistic techniques for performing these tasks, and some common software systems will be covered, though no area will be covered in any depth.
Learning Outcomes	Upon successful completion of the course, the student will have an understanding of the role documents play in business and community, and the various digital resources available for document analysis. Moreover, the student will have the background theory and practical knowledge necessary to plan and execute a basic document analysis project. The student will be able to: Understand the basic requirements digital libraries and business processes have w.r.t. documents. Obtain documents from various sources and transform them into a common XML or RDF format with a knowledge of SAX and XPATH. Understand the genres of documents available from the internet such as RSS feeds, social networks, blogs, wikis, archives, etc., and the role they play in the internet ecosystem. Understand the linguistic and semantic resources available from the internet and the so-called ``web of data'', such as dictionaries, repositories and ontologies Understand basic probabilistic theories of language and document structure, and the basic algorithms and software available for them, and be able to use some common libraries for natural language processing to perform basic analysis tasks. Understand basic probabilistic theories of information retrieval, and be able to index a document collection for use in an information retrieval system. Understanding basic theories and algorithms for large scale named-entity matching and standardization of names within a collection. Understand basic probabilistic theories of classification, clustering, and document feature ``engineering'', and be able to perform automated classification.
Indicative Assessment	Two written assignments with programming option (40%), written final exam (60%).
Workload	Thirty one-hour lectures and six two hour tutorial/laboratory sessions.
Requisite Statement	12 units of 3000 series IT courses including COMP3410 or COMP3420 and 6 units of MATH/STAT courses or COMP2600.
Recommended Courses	see requisite statement
Prescribed Texts	The following reference books will be used. Introduction to Information Retrieval, C.D. Manning, P. Raghavan and H. Scutze, Cambridge University Press, 2008. Foundations of Statistical Natural Language Processing, C.D. Manning and H. Scutze, MIT Press, 1999.
Programs	Bachelor of Information Technology (Honours)
Science Group	C
Academic Contact	wray.buntine@nicta.com.au

The information published on the Study at ANU 2011 website applies to the 2011 academic year only. All information provided on this website replaces the information contained in the Study at ANU 2010 website.