Open IRC Search Engine (OISE)

OISE is an open-source search engine for IRC servers and channels

OISE is an open-source search engine for IRC servers and channels.

Homepage Sample Search

Left: OISE homepage | Right: Sample search for "lego"

Running

1
$ docker run -it --name oise -p 8080:8080 avojak/oise:latest

It takes a very long time for a new index to be created, so it is recommended to mount a volume to persist the index. This will allow you to restart the container and have immediate access to the previous index while the new one is being created.

1
2
3
4
5
6
$ docker run \
    -it \
    --name oise \
    -v lucene-index:/lucene-index \
    -p 8080:8080 \
    avojak/oise:latest

If you would like to customize the list of servers that are indexed, you can mount your own server list.

For demonstration purposes, a much shorter file (servers.txt) has been provided in this repository:

1
2
irc.freenode.net
irc.bsdunix.us
1
2
3
4
5
6
7
$ docker run \
    -it \
    --name oise \
    -v lucene-index:/lucene-index \
    -v <full_working_directory_path>/oise/servers.txt:/servers.txt \
    -p 8080:8080 \
    avojak/oise:latest

Usage

REST API

Once running, the API documentation is viewable at http://localhost:8080/swagger-ui.html.

Example Request

1
curl -X GET http://localhost:8080/api/v1/search?q=uiuc

A webpage for UI-based search is also available at http://localhost:8080. Simply type your query in the search field and hit Enter, or select the “Search” button.

Implementation Details

Architecture Diagram

High-level architecture diagram

OISE is implemented with Spring Boot as the application framework, and each key component of the application (crawling IRC servers and indexing channels) is implemented as a Guava service.

Technologies Used

Services

All services are started in the background when the application starts (see: ApplicationRunner).

Crawling Service

The crawling service is scheduled to run every 24 hours. An IRC bot (PircBot) is created for each server to be crawled, and then a crawling thread is submitted for background execution.

Once crawling completes, the raw text response from the IRC servers is converted into POJOs and any URLs are scraped for additional content to supplement the channel topic. Finally, the collection of models representing each channel found on the server is sent to the IndexingService for processing.

Indexing Service

The indexing service is idle until it receives an event from the crawling service that a crawl action of a server has completed. A background thread is submitted for background execution to process the index update. In our index the “documents” representing each IRC channel contain the following fields:

  • The server
  • The channel name
  • The topic
  • The web content of all URLs present in the topic
  • The number of users

Only the channel name, topic and web content are considered during queries, but the other data is also available.

REST API

The REST API has a single endpoint: /api/v1/search (see: SearchController). The endpoint accepts a single query parameter (q) which contains the query.

An example query URL would look like: http://localhost:8080/api/v1/search?q=test.

The response contains all the data from the index for the 10 most relevant results. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
  {
    "server": "irc.freenode.net",
    "channel": "#QuantumZNC-Test",
    "topic": "#QuantumZNC-Test",
    "urlContent": "",
    "users": 5,
    "scoreDoc": {
      "score": 6.029067,
      "doc": 12987,
      "shardIndex": 0,
      "fields": [
        6.029067
      ]
    }
  },
  ...
]

Configuration

Several parameters are available for configuration in application.properties:

1
2
3
4
oise.serversfile=servers.txt # The file containing the list of servers
oise.index.directory=index/  # The directory containing the Apache Lucene index
oise.crawler.max.threads=10  # The maximum number of threads used for crawling servers for channels
oise.scraper.max.threads=500 # The maximum number of threads used for scraping web content from URLs mentioned in topics

OISE was developed to satisfy the Final Project requirement for CS410: Text Information Systems at the University of Illinois at Urbana-Champaign

Related