Module: Literature Search
1. Overview
The Literature Search module provides configurable multi-source paper search: configuration (databases, cache, API keys), Paper data model and CacheManager, adapters (e.g. ArXiv, IEEE) with a QueryParser for Boolean queries, and a DatabaseConfig that integrates SecureKeyManager for keys. A search facade (and optional CLI in search_cli.py) runs queries across enabled adapters, deduplicates, and can export results.
Roles:
- Configuration:
DatabaseConfig — load/save JSON config; per-database enabled/rate_limit/categories; cache dir and expiry; SecureKeyManager for API keys.
- Model & cache:
Paper dataclass (title, authors, abstract, year, venue, doi, arxiv_id, etc.); CacheManager for caching search results (e.g. by query hash).
- Adapters: Abstract
DatabaseAdapter; concrete adapters (e.g. ArXiv) implement search and map hits to Paper; AdapterFactory returns adapter by name.
- Query parsing:
QueryParser — parse Boolean queries (AND/OR, quoted phrases, parentheses) into a structured form for adapter-specific translation.
- Search orchestration: High-level search that uses config, adapters, cache, and deduplication to return a list of
Paper.
2. Basics
Terminology
| Term |
Meaning |
| Paper |
Dataclass: title, authors, abstract, year, venue, doi, arxiv_id, url, citations, keywords, database_source, pdf_url, publication_type, issue, volume, pages, publisher. Normalized on init. |
| DatabaseAdapter |
Abstract interface: search capability returning papers. |
| AdapterFactory |
Creates adapter instances by database name (from config). |
| QueryParser |
Parses Boolean query string into quoted phrases and groups (AND/OR/SINGLE). |
| DatabaseConfig |
Config dir, app name, SecureKeyManager; lit_search_config.json with databases, cache, search, export sections. |
| CacheManager |
Caches search results (e.g. by query hash); expiry and max age from config. |
Entry points
- Programmatic: Use
DatabaseConfig, get adapters via factory, run QueryParser.parse_boolean_query, call adapter search methods, deduplicate/export as needed. Or use the high-level search API if exposed (e.g. a Search class that wraps config + adapters + cache).
- CLI:
ures.literature.search_cli (when used as script) for search/export commands.
- Import:
from ures.literature.search import Paper, CacheManager, DatabaseConfig; adapters and factory from ures.literature.search.adapters.
3. Architecture & Logic
- Core logic:
- DatabaseConfig: Loads JSON from config_dir; merges with defaults; ensures each DB has a key slot in SecureKeyManager (dummy if missing). Saves back to JSON.
- Paper:
__post_init__ normalizes text (whitespace, HTML entities), author names, DOI, year bounds, citations; publication_type lowercased.
- QueryParser: Extract quoted phrases with placeholders; parse parentheses for nested groups; split on OR/AND; restore phrases in terms; return structure with quoted_phrases and groups.
- Adapters: Each adapter uses config (rate limit, categories) and optional API key; fetches from external API, parses response into list of Paper.
- Cache: CacheManager hashes query (or similar) and stores results; checks expiry/max_age before returning.
- Dependencies: Internal:
ures.secrets.SecureKeyManager, StorageMethod. External: requests, BeautifulSoup, bibtexparser (if used), pathlib, json, sqlite3 (CacheManager), etc.
4. UML & Structure
Class diagram (simplified)
classDiagram
class Paper {
+title: str
+authors: List
+abstract: str
+year: int
+venue: str
+doi: str
+arxiv_id: str
+database_source: str
+_normalize_fields()
}
class DatabaseConfig {
+config_dir: Path
+key_manager: SecureKeyManager
+config: Dict
+_load_config(): Dict
}
class DatabaseAdapter {
<<abstract>>
+search(query, ...): List~Paper~
}
class AdapterFactory {
+create(name, config): DatabaseAdapter
}
class QueryParser {
+parse_boolean_query(query): Dict
-_extract_parenthetical_groups()
}
class CacheManager {
+get(key): Optional
+set(key, value)
}
DatabaseConfig ..> SecureKeyManager : uses
AdapterFactory ..> DatabaseAdapter : creates
DatabaseAdapter ..> Paper : returns
Data flow (search)
sequenceDiagram
participant U as User
participant Cfg as DatabaseConfig
participant P as QueryParser
participant F as AdapterFactory
participant A as DatabaseAdapter
participant Cache as CacheManager
U->>Cfg: load config
U->>P: parse_boolean_query(q)
P-->>U: structured query
U->>Cache: get(query_key)
alt cache hit
Cache-->>U: cached papers
else cache miss
U->>F: create(adapter_name)
F-->>U: adapter
U->>A: search(query / structured)
A-->>U: List~Paper~
U->>Cache: set(query_key, papers)
end
5. Code-Level Understanding
Paper normalization
- Text: Strip, collapse whitespace, strip HTML-like tags, unescape &/</>.
- Authors: Clean each name; drop empty.
- Year: Clamp to 1900–2030 or set 0.
- DOI: Normalized (lowercase, strip prefix, etc. per _normalize_doi).
- Citations: int ≥ 0.
- publication_type: Lowercased.
QueryParser
- Protects quoted phrases with placeholders
__PHRASE_i__; finds matching parentheses for groups; within each group content, splits on OR then AND; restores phrases into terms. Output: quoted_phrases, groups (each with type and terms), original_query.
Adapters
- Typically: build request from config + parsed query; rate limit; HTTP request; parse XML/HTML/JSON; map to Paper list; set database_source.
Config and keys
- Default config defines databases (arxiv, ieee, springer, etc.) with enabled, rate_limit, categories (where applicable). Cache section: directory, expire_days, max_age_hours. Search: max_results, deduplication, min_year, similarity_threshold. Export: default_format, include_abstracts, max_export_size. Keys are stored per db name via SecureKeyManager (dummy placeholder if missing).
6. Usage & Examples
Integration
from ures.literature.search import DatabaseConfig, Paper, CacheManager
from ures.literature.search.adapters import AdapterFactory, DatabaseAdapter
from ures.literature.search.paper import PaperFormatter # if needed
Config and search (conceptual)
config = DatabaseConfig(app_name="literature-search")
parser = QueryParser()
structured = parser.parse_boolean_query('("machine learning" AND deep) OR nlp')
factory = AdapterFactory(config)
adapter = factory.create("arxiv", config.config)
papers = adapter.search(structured or "machine learning", max_results=50)
# Deduplicate/export per project conventions
7. Public API / Interfaces
Paper
| Attribute |
Type |
Description |
| title, authors, abstract, year, venue, doi, arxiv_id, url |
str / list / int |
Core metadata. |
| citations, keywords, database_source, pdf_url |
int / list / str |
Optional. |
| publication_type, issue, volume, pages, publisher |
str |
Publication details. |
DatabaseConfig
| Member |
Description |
config_dir, app_name, config |
Config path and loaded dict. |
config_path |
Path to lit_search_config.json. |
_load_config() |
Load/merge with defaults; init key slots. |
QueryParser
| Method |
Description |
parse_boolean_query(query) |
Returns dict: quoted_phrases, groups (type + terms), original_query. |
Adapters
| Symbol |
Description |
DatabaseAdapter |
Abstract base; search returns list of Paper. |
AdapterFactory.create(name, config) |
Return adapter for name. |
CacheManager
| Method |
Description |
get(key) |
Cached value or None. |
set(key, value) |
Store with expiry. |
8. Maintenance & Troubleshooting
- API keys: Stored via SecureKeyManager; ensure keys are set for enabled DBs that require them (e.g. IEEE, Elsevier). Dummy placeholder only reserves the slot.
- Rate limits: Config rate_limit per database; adapters should throttle requests.
- Deduplication: Search config has similarity_threshold; dedupe logic may use title/abstract similarity or DOI/arxiv_id.
- QueryParser: Complex nested Boolean may need testing; quoted phrases and parentheses must balance.
9. Execution Protocol
- Map logic:
ures/literature/search/search.py (DatabaseConfig, search orchestration), ures/literature/search/paper.py (Paper, CacheManager), ures/literature/search/adapters.py (DatabaseAdapter, adapters, AdapterFactory, QueryParser); ures/literature/search_cli.py for CLI.
- Abstract: Document purpose and flow; do not duplicate line-by-line code.
- Draft: Follow this template.
- Review: No API keys or sensitive paths in docs.