Back to projects

information retrieval system

Google Search based on newspaper articles

Python
TypeScript
Information Retrieval
Web Scraping
Next.js
Docker

Project Overview

IR Evaluation on Social Event Detection
A Comparative Evaluation of a Vector Space and Boolean Model for Korean Election Information Retrieval

Developed by Patrik Lytschman, Lukas Welker, and Tony Meissner at Chung Ang University for the course on Natural Language Processing and Information Retrieval.

Data Collection

Data collection was focused on gathering articles related to “Korea” within a specific time window (March 12, 2024 to April 12, 2024). Several APIs and web scraping methods were used:

  • New York Times API: Initially used, but full articles were behind a paywall and a temporary ban occurred due to bot-like behavior.
  • The Guardian API: Retrieved 96 articles using a similar approach with BeautifulSoup4.
  • GNews API: Leveraged a free 10-day trial to collect 627 articles from Google News.

All retrieved articles were stored in CSV files and then inserted into a PostgreSQL database (the “Post” table). See below for an example cutout of the database table:

Cutout of the Post database table
Figure 1: Cutout of the Post database table

Crawler for GNews API

Implementation of IR System

The IR system was built in three major parts:

  1. Preprocessing:
    Based on all of our collected documents, we have a total of about 440.000 words. This large number illustrates how important preprocessing is. In Table 1 you can see a list of methods we implemented to reduce the vocabulary size.
    Vocabulary Changes During Preprocessing
    MethodVocab beforeVocab afterChange (percentage)
    Removing special characters and lowercase everything440.000438.309-0.38 %
    is_english function438.309397.684-9.26 %
    Removing non English words397.684315.195-20.74 %
    Removing stop words315.195165.091-47.62 %
    Lemmatization165.091165.091-0 %
    Removing words that occur only once165.091161.587-2.12 %
    Removing duplicate words161.5876.834-95.77 %
    After preprocessing, documents were inserted into a “Processed_Post” table.
    Cutout of the Processed_Post database table
    Figure 2: Cutout of the Processed_Post database table
  2. Boolean Model:
    An inverted index was built using a dictionary and custom linked list to store document IDs.
    The search utilizes set algebra (AND, OR, NOT) to compute intersections, unions, and differences of document sets.
    Optimization was achieved by sorting query tokens based on the size of their inverted list.
    Set processing (Venn diagrams) for Boolean model
    Figure 3: Set processing (Venn diagrams) for Boolean model
  3. Vector Space Model:
    Documents and queries are represented as TF-IDF vectors. Key steps include:
    • Calculating TF and IDF values for each term.
    • Applying Singular Value Decomposition (SVD) to reduce the dimensionality of the term-document matrix (from over 6500 dimensions) and improve cosine similarity discrimination.
    • Adjusting the cosine similarity with a time-based multiplier to favor newer documents.
    Cosine similarities improved notably after applying SVD (e.g. from [0.07, 0.0] to [0.32, -0.28]).

Exposing the Functions

The IR system’s functionalities are exposed through multiple channels:

  1. FastAPI Endpoints: Endpoints for both search_boolean_model and search_vector_space are implemented. All queries are lemmatized before processing.
    Search
    GET
    /search/vector-space[?q=search_term]

    Search Vector Space

    POST
    /search/boolean

    Search Token Actor

  2. User Interface: A web interface built with Next.js and NextUI allows interactive search. Users can switch between Boolean and Vector Space models.
    Searching with vector space on the UI
    Figure 4: Searching with vector space on the UI
    Boolean model query interface
    Figure 5: Boolean model query interface
  3. Deployment: The complete application is deployed on Microsoft Azure App Service. Access the interface here and the API documentation here.

Evaluation

The IR system was evaluated using standard metrics: precision, recall, and F1-score. Four queries were used to assess both models.

Example Search Queries:

Search Queries for Evaluation
Query #Vector Space ModelBoolean Model
1political corruption scandalsAND political AND corruption OR scandals
2election 2024 turnoutAND election OR 2024 OR turnout
3democratic partydemocratic OR party
4women rights people power partyAND women AND rights OR people OR power AND party

Overall Evaluation Metrics:
Boolean Model vs. Vector Space Model

Total Evaluation Metrics
MetricBoolean ModelVector Space Model
Recall (Mean)0.710.84
Precision (Max)0.600.26
F1-Score~0.34~0.32

Additionally, evaluation over time (see Figure 5) indicates that the election event on April 10, 2024 resulted in a marked increase in retrieved documents. The query-specific evaluation (Figure 6) further details differences in recall, precision, and F1 across the models.

Evaluation of temporal relevance
Figure 6: Evaluation of temporal relevance
Query evaluation of recall, precision and F1
Figure 7: Query evaluation of recall, precision and F1