information retrieval system

Google Search based on newspaper articles

April 2024

GitHub Visit Website

Python

TypeScript

Information Retrieval

Web Scraping

Next.js

Docker

Table of Contents

Project Overview
Data Collection
Implementation of IR System
Exposing the Functions
Evaluation

Project Overview

IR Evaluation on Social Event Detection
A Comparative Evaluation of a Vector Space and Boolean Model for Korean Election Information Retrieval

Developed by Patrik Lytschman, Lukas Welker, and Tony Meissner at Chung Ang University for the course on Natural Language Processing and Information Retrieval.

Data Collection

Data collection was focused on gathering articles related to “Korea” within a specific time window (March 12, 2024 to April 12, 2024). Several APIs and web scraping methods were used:

New York Times API: Initially used, but full articles were behind a paywall and a temporary ban occurred due to bot-like behavior.
The Guardian API: Retrieved 96 articles using a similar approach with BeautifulSoup4.
GNews API: Leveraged a free 10-day trial to collect 627 articles from Google News.

All retrieved articles were stored in CSV files and then inserted into a PostgreSQL database (the “Post” table). See below for an example cutout of the database table:

Figure 1: Cutout of the Post database table

Crawler for GNews API

Implementation of IR System

The IR system was built in three major parts:

Preprocessing:
Based on all of our collected documents, we have a total of about 440.000 words. This large number illustrates how important preprocessing is. In Table 1 you can see a list of methods we implemented to reduce the vocabulary size.

Vocabulary Changes During Preprocessing
Method	Vocab before	Vocab after	Change (percentage)
Removing special characters and lowercase everything	440.000	438.309	-0.38 %
is_english function	438.309	397.684	-9.26 %
Removing non English words	397.684	315.195	-20.74 %
Removing stop words	315.195	165.091	-47.62 %
Lemmatization	165.091	165.091	-0 %
Removing words that occur only once	165.091	161.587	-2.12 %
Removing duplicate words	161.587	6.834	-95.77 %

After preprocessing, documents were inserted into a “Processed_Post” table.

Figure 2: Cutout of the Processed_Post database table

Boolean Model:
An inverted index was built using a dictionary and custom linked list to store document IDs.
The search utilizes set algebra (AND, OR, NOT) to compute intersections, unions, and differences of document sets.
Optimization was achieved by sorting query tokens based on the size of their inverted list.
Figure 3: Set processing (Venn diagrams) for Boolean model
Vector Space Model:
Documents and queries are represented as TF-IDF vectors. Key steps include:
- Calculating TF and IDF values for each term.
- Applying Singular Value Decomposition (SVD) to reduce the dimensionality of the term-document matrix (from over 6500 dimensions) and improve cosine similarity discrimination.
- Adjusting the cosine similarity with a time-based multiplier to favor newer documents.
Cosine similarities improved notably after applying SVD (e.g. from [0.07, 0.0] to [0.32, -0.28]).

Exposing the Functions

The IR system’s functionalities are exposed through multiple channels:

FastAPI Endpoints: Endpoints for both search_boolean_model and search_vector_space are implemented. All queries are lemmatized before processing.
Search
GET
Search Vector Space
POST
Search Token Actor
User Interface: A web interface built with Next.js and NextUI allows interactive search. Users can switch between Boolean and Vector Space models.
Figure 4: Searching with vector space on the UI
Figure 5: Boolean model query interface
Deployment: The complete application is deployed on Microsoft Azure App Service. Access the interface here and the API documentation here.

Evaluation

The IR system was evaluated using standard metrics: precision, recall, and F1-score. Four queries were used to assess both models.

Example Search Queries:

Search Queries for Evaluation
Query #	Vector Space Model	Boolean Model
1	political corruption scandals	AND political AND corruption OR scandals
2	election 2024 turnout	AND election OR 2024 OR turnout
3	democratic party	democratic OR party
4	women rights people power party	AND women AND rights OR people OR power AND party

Overall Evaluation Metrics:
Boolean Model vs. Vector Space Model

Total Evaluation Metrics
Metric	Boolean Model	Vector Space Model
Recall (Mean)	0.71	0.84
Precision (Max)	0.60	0.26
F1-Score	~0.34	~0.32

Additionally, evaluation over time (see Figure 5) indicates that the election event on April 10, 2024 resulted in a marked increase in retrieved documents. The query-specific evaluation (Figure 6) further details differences in recall, precision, and F1 across the models.

Figure 6: Evaluation of temporal relevance

Figure 7: Query evaluation of recall, precision and F1