Project #54113 - Java Progrmaiing

Background

In this assignment you will implement a tool called Proffinder. This tool is a mini search engine to help you find a professor that works in a specific area that matches your interests. Proffinder will work by downloading the webpage of a professor and determining how similar it is to a query that consists of keyword (such as {"programming", "software", "engineering"}).

In addition to practicing the concepts seen in class to date, this assignment will expose you to web scraping, to some of the main ideas ininformation retrieval, and to the research areas of the professors in the School of Computer Science.

In this assignment, documents will be modeled as arrays of Strings, where each String is a word in the document (with duplicates). However, we will remove stop words from documents. These are words, like the, that do nothing to help us estimate similarity.

To estimate the similarity Sim(D,Q) between a document D and a query Q, we will use two different measures: the Jaccard index, and a home-grown measure we will call RelHits. The Jaccard index is the intersection of the query and a document (with duplicates removed) divided by the union of the query and document. RelHits will be defined as the number of times any query keyword is found in a document divided by the total number of words in the document.

Examples

DOC = {armadillo, camel, cat, cat, dog, dog, goat, the, tiger, tiger}
DOC_NO_STOP_WORDS = {armadillo, camel, cat, cat, dog, dog, goat, tiger, tiger}
QUERY = {camel, tiger}
SIM_JACCARD = 2/6 = 0.33
SIM_RELHITS = 3/9 = 0.33

DOC = {a, design, design, design, engineering, the, theory, testing}
DOC_NO_STOP_WORDS = {design, design, design, engineering, theory, testing}
QUERY = {graphics, testing}
SIM_JACCARD = 1/5 = 0.20
SIM_RELHITS = 1/6 = 0.17
QUERY = {design}
SIM_JACCARD = 1/4 = 0.25
SIM_RELHITS = 3/6 = 0.50

The processing of web pages will be handled by a special library called jsoup, so you don't have to worry about the details. The library should be automatically linked to your project once you import the assignment 1 package as described below

Requirements

You may be able to make the project work on other environments, but we will support the following cross-platform configuration:

  • Eclipse Luna
  • Java 7
  1. Download the assignment 1 package and import it into Eclipse (File -> Import -> General -> Existing Projects into Workspace). The code skeleton should build without errors. Run the main method to make sure everything works.
  2. Read the academic integrity statement at the bottom of this page and paste it into the header of file Assignment1.java intended for this purpose.
  3. Complete the code provided as part of the assignment package. All further instructions are in the source code comments.
  4. Submit your answer as a single file called Assignment1.java that contains the academic integrity statement and all your solution. Do not submit a zip file, class files, or any other kind of artifact besides that single file. The file you submit should at the very least compile without any errors.
  5. Paste this statement in the header of your code. We will not grade your work without it. If you have any doubt about the validity of the content of your submission, contact an instructor or TA before submitting it.

    /* ACADEMIC INTEGRITY STATEMENT
     * 
     * By submitting this file, we state that all group members associated
     * with the assignment understand the meaning and consequences of cheating, 
     * plagiarism and other academic offenses under the Code of Student Conduct 
     * and Disciplinary Procedures (see www.mcgill.ca/students/srr for more information).
     * 
     * By submitting this assignment, we state that the members of the group
     * associated with this assignment claim exclusive credit as the authors of the
     * content of the file (except for the solution skeleton provided).
     * 
     * In particular, this means that no part of the solution originates from:
     * - anyone not in the assignment group
     * - Internet resources of any kind.
     * 
     * This assignment is subject to inspection by plagiarism detection software.
     * 
     * Evidence of plagiarism will be forwarded to the Faculty of Science's disciplinary
     * officer.
     */

Subject Computer
Due By (Pacific Time) 01/30/2015 12:00 am
Report DMCA
TutorRating
pallavi

Chat Now!

out of 1971 reviews
More..
amosmm

Chat Now!

out of 766 reviews
More..
PhyzKyd

Chat Now!

out of 1164 reviews
More..
rajdeep77

Chat Now!

out of 721 reviews
More..
sctys

Chat Now!

out of 1600 reviews
More..
sharadgreen

Chat Now!

out of 770 reviews
More..
topnotcher

Chat Now!

out of 766 reviews
More..
XXXIAO

Chat Now!

out of 680 reviews
More..
All Rights Reserved. Copyright by AceMyHW.com - Copyright Policy