ResAIKit
Research Integrity Toolkit

Documentation

Everything you need to screen manuscripts for research integrity risks.

Getting Started

ResAIKit analyzes scientific manuscripts for potential research integrity issues across three dimensions: AI-generated text patterns, image and figure manipulation, and statistical fabrication signals. Upload a manuscript in any common format (PDF, DOCX, LaTeX) or paste text directly.

The analysis runs through four processing layers. Layer 1 applies deterministic rules (regex, dictionaries, GRIM tests). Layer 2 uses NLP (spaCy part-of-speech tagging, dependency parsing, Hunspell). Layer 3 queries external academic databases (CrossRef, PubMed, Semantic Scholar) to verify references. Layer 4 uses large language models for semantic coherence and factual consistency analysis.

Results are organized by indicator. Each indicator produces a 0-5 score. Scores are aggregated into a 0-100 composite per module. The output is designed for human review: you inspect flagged evidence, not automated verdicts.

Interpreting Scores

ResAIKit uses a 0-100 composite scale per module. This is a risk-screening score, not a probability of misconduct. Scores should be interpreted in context:

  • 0-30 (Low risk): Few indicators flagged. Pattern consistent with normally reported research. Routine editorial review is sufficient.
  • 30-60 (Moderate risk): Multiple indicators with elevated scores. Requires closer inspection. Look at which specific indicators contributed and whether they converge on the same passage or figure.
  • 60-100 (High risk): Strong convergence of indicators across multiple dimensions. Warrants detailed expert review. The "Review Priorities" section lists the most concerning findings.

The composite score weights different indicator categories. Deterministic indicators (GRIM, Benford, digit checks) carry more weight because they have near-zero false-positive rates. LLM-based indicators carry less weight individually but gain significance when they corroborate lower-layer findings.

Text Analysis (31 indicators)

The text module examines linguistic patterns characteristic of AI-generated academic prose:

  • Stylistic (C1-C10): Generality, verbosity, artificial structure, weak methodological anchoring, hedging patterns, promotional tone. These detect the generic, surface-level discourse patterns common in LLM outputs.
  • Forensic (E1-E7): Unicode anomalies, encoding artifacts, punctuation entropy, rhythm variation. These detect technical traces left by specific generation pipelines.
  • Fingerprints (F1-F5): Model-specific n-gram and token distribution patterns for ChatGPT, Claude, Gemini, Grok, and Perplexity.
  • Hallucination (G1-G3): Citation fabrication, statistical impossibility, factual inconsistency with known databases.

When text indicators flag, cross-reference with the References tab. If citations are verified (Layer 3), hallucination risk is low. If citations fail verification AND stylistic indicators are elevated, the text warrants close review.

Image & Figure Analysis (47 indicators)

Image indicators apply digital forensics techniques to figures extracted from manuscripts:

  • Generic forensics (I1-I10): Error Level Analysis, FFT, noise consistency, clone detection (ORB feature matching), metadata inspection. These run on every image regardless of type.
  • Western blots (W1-W4): Band duplication detection (SSIM), morphology analysis, background tile consistency.
  • Microscopy (M1-M10): Depth of field, panel overlap, inpainting detection, Poisson noise profile.
  • Charts & graphs (G1-G12): Vector/raster consistency, axis coherence, error bar analysis, p-value clustering, Benford test on reported values.
  • Tables (T1-T14): Arithmetic consistency, GRIM/GRIMMER/SPRITE on tabulated values, copy-paste detection across rows.

Statistical Analysis (65 indicators)

The statistics module is the most mathematically rigorous component. It applies three categories of checks:

  • Consistency (S1-S20): Mean*N integer plausibility (GRIM), SD variance check (GRIMMER), Monte Carlo distribution reconstruction (SPRITE), proportion verification (DEBIT), p-value recomputation (Statcheck), digit preference analysis, Benford distribution.
  • Methodology (R1-R12): Design-test compatibility, sample size reporting adequacy, multiple comparison correction, pre-registration fidelity, missing data handling.
  • Fabrication (D1-D34): Individual participant data patterns: absent expected correlations, excessive Gaussianity, demographic anomalies, inlier clustering, cross-variable rule violations against 6 domain-specific dictionaries.

The D10 composite fabrication score uses weighted aggregation across 4 sub-domains with multiplicative bonuses for critical combinations that co-occur in known fabrication cases (e.g., too-perfect data + absent correlations + identical SDs across groups).

Editorial Workflow Integration

ResAIKit is designed as a triage and documentation tool within editorial workflows:

  1. Pre-submission screening: Authors can self-check manuscripts before submission. The report documents due diligence.
  2. Editorial triage: Handling editors run a quick scan on incoming submissions. High-risk manuscripts are routed to integrity specialists.
  3. Reviewer support: Peer reviewers use the tool to verify specific concerns (suspicious figures, improbable statistics). The indicator-level evidence provides objective documentation for review comments.
  4. Institutional compliance: Research integrity officers maintain an audit trail of screened manuscripts with timestamped PDF reports.

Always document the human decision separately from the automated flags. The tool records what was detected; the editor records what was decided.

FAQ

Does ResAIKit make decisions about misconduct?
No. It flags patterns that warrant human review. The output is evidence for expert assessment, not an automated accusation. Always apply editorial judgment.
What file formats are supported?
PDF, DOCX, LaTeX source, and plain text. Images are extracted automatically from documents. Maximum file size is 50 MB.
How long does an analysis take?
Quick analysis (layers 1-2): 1-10 seconds. Full analysis (all 4 layers): 30-90 seconds depending on document length and image count. Results stream in real-time via the progress bar.
What LLM does the AI analysis use?
Free: no LLM. Plus: DeepSeek. Pro: Claude Sonnet 4.6. Max: Claude Sonnet 4.6. The LLM provider is fixed per plan to ensure predictable performance and cost.
Is my manuscript data secure?
Uploaded manuscripts are processed in-memory and stored temporarily for analysis. Cached results are retained for 7-30 days (depending on analysis type) to avoid redundant processing. No manuscript content is shared with third parties beyond the API calls required for analysis (CrossRef, PubMed, LLM provider).

Ready to screen a manuscript?

Start with 10 free quick analyses. No credit card required.

Analyze a manuscript