Welcome to DATAPHREAK, your comprehensive data analysis and cleaning tool. This guide covers all features and capabilities.
🚀 Getting Started - Your First 5 Minutes
New to DATAPHREAK? Follow this quick walkthrough to get started immediately!
📊 Step 1: Try Sample Data (30 seconds)
• Click the "✨ Try Sample Data" button in the Load Data section
• This loads 400 rows of customer data with some intentional issues to explore
• Perfect for learning without uploading your own files
👀 Step 2: See Your Data Overview (1 minute)
• Check the KPI cards that appear - they show row count, columns, missing values, and duplicates
• Look for the Data Quality Score - higher is better!
• Notice the Quick Actions like "Analyze All" and "Quick Fix"
📈 Step 3: Visualize Your Data (1 minute)
• Click the "Data Charts" button to see beautiful charts
• Hover over bars to see exact counts and percentages
• Age data automatically groups into 5-year ranges
🧹 Step 4: Clean Your Data (2 minutes)
• Click "🧹 Clean All" for smart instant cleaning
• Or go to "Data Operations" section to choose specific fixes
• Watch the data quality score improve!
💾 Step 5: Export Results (30 seconds)
• Use "Export" to download your cleaned data
• Choose CSV, JSON, or create a Data Dictionary report
• Files are automatically named based on operations performed
🎯 Pro Tip: Start with sample data to learn, then upload your own CSV files. Everything happens locally in your browser - your data never leaves your computer!
Ready to dive deeper? Continue reading the sections below, or jump to any topic using the navigation above.
1. Loading Your Data
Getting your files into DATAPHREAK is easy!
- Drag & Drop: Simply drag your CSV or TSV file into the gray drop zone
- Browse Files: Click "Choose File" to select from your computer
- Try Sample Data: Click "✨ Try Sample Data" to practice with example data first
💡 File Requirements:
• First row should contain column names (like: Name, Email, Age)
• CSV files work best, but TSV (tab-separated) files work too
• Excel files (.xlsx/.xls) are supported directly
2. Understanding Your Results
Once your data loads, here's what you'll see:
- KPI Cards: Quick stats showing total rows, columns, missing values, and duplicate records
- Data Quality Score: AI-powered assessment (0-100%) measuring completeness, consistency, and pattern validity with A-F grading
- Quick Actions: One-click buttons for common tasks:
- Analyze All: Deep-dive into every column
- Clean All: AI-powered cleaning that auto-corrects emails, phones, and dates
- Health Check: Get a detailed data quality report
🎯 What's a good quality score?
• 90-100%: Excellent - very clean data
• 80-89%: Good - minor issues to fix
• 70-79%: Fair - several problems to address
• Below 70%: Needs significant cleaning
AI-Powered Data Quality Assessment
Comprehensive quality scoring that analyzes your entire dataset:
- Overall Score: 0-100% assessment with letter grades (A-F) like a report card
- Completeness: How much of your data is filled vs missing
- Consistency: Measures duplicate rows and data conflicts
- Validity: Checks if data follows expected patterns (emails, phones, dates)
- Visual Indicators: Color-coded quality cards (green = good, red = needs work)
- Actionable Insights: Shows exactly what needs attention
- Real-time Updates: Quality score improves as you clean your data
AI-Powered Duplicate Detection
Advanced duplicate detection that goes beyond basic matching:
- Exact Duplicates: Finds rows that are completely identical across all fields
- Fuzzy Duplicates: AI similarity matching finds near-duplicates like "John Smith" vs "Jon Smith"
- Smart Export: CSV includes both exact_duplicate_group and fuzzy_duplicate_group columns
- Performance Optimized: Handles large datasets with chunked processing
- Visual Grouping: Clear separation between exact matches and similar records
- Row References: Shows exact row numbers for easy verification in Excel
Field Analysis & Profiling
Detailed analysis for each column in your data with AI pattern detection:
- Data Types: Number, date, boolean, string with confidence scores
- AI Patterns: Automatically detects emails, phone numbers, and dates with confidence indicators
- Pattern Confidence: Green = high confidence, yellow = partial match, gray = low confidence
- Coverage: Percentage of filled vs missing values
- Uniqueness: How many different values appear in each column
- Number Analysis: Minimum, maximum, average, and unusual values
- Date Ranges: Earliest and latest dates in date columns
- Text Analysis: Character length patterns and text formatting details
- Sample Values: Representative examples from each column
AI Pattern Detection
Intelligent recognition and standardization of common data formats:
- Email Detection: Automatically identifies and standardizes email addresses to lowercase
- Phone Numbers: Recognizes various phone formats and standardizes to consistent formatting
- Date Intelligence: Handles mixed international date formats (US, European, ISO) and converts to YYYY-MM-DD
- Smart Confidence: Shows green (high), yellow (partial), or gray (low) confidence indicators
- Auto-Correction: Clean All button applies pattern-based fixes automatically
- International Support: Properly handles global phone and date formats
- Column Intelligence: Uses column names for better pattern prioritization
Data Distribution Charts
Professional-grade interactive charts accessible via the Data Charts button:
- Smart Binning: Automatic age grouping (5-year ranges) and intelligent numeric binning
- Visual Features: Gradient colors, grid lines, Y-axis labels with professional styling
- Statistical Overlays: Mean (μ) and median (M) lines with exact values displayed
- Interactive Tooltips: Hover for exact counts, ranges, and percentages
- Color Coding: Frequency-based colors (red=high frequency, blue=low frequency)
- Categorical Charts: Top 10 values for text columns with horizontal bars
- Smooth Animations: Loading transitions and hover effects for enhanced user experience
- Print Function: Export histograms to PDF with professional formatting
- Responsive Design: Charts automatically scale from 580px to 1160px based on screen size
- Age Detection: Automatically formats age data with meaningful 5-year groupings
Validation Rules & Quality Checks
Define and enforce data quality standards:
- Allowed Values: Comma-separated lists of valid values per column
- Pattern Matching: Custom rules for complex data validation
- Rule Persistence: Automatically saves rules using column header signatures
- Issue Detection: Highlights cells that don't match your rules
- Issue Reporting: "Rows with Issues" section shows problematic data
- Batch Processing: Apply rules across entire datasets efficiently
Export Options
Multiple export formats and customization options:
- CSV Export: Standard comma-separated format with proper escaping
- JSON Export: Array of objects with column headers as keys
- Data Dictionary: Complete field definitions with statistics and rules
- Enhanced Duplicate Exports: Includes both exact_duplicate_group and fuzzy_duplicate_group columns for comprehensive review
- Histogram Printing: Professional PDF reports of distribution charts
- Security: Formula injection protection (prefixes dangerous characters)
- File Naming: Intelligent naming based on operations performed
11. Compare Files (Find Similar Records)
Have two files with similar but not identical data? This feature helps you find matching records across files.
📋 Common Use Cases:
- Customer Lists: Find "John Smith" in file A that matches "Jon Smith" in file B
- Email Variations: Match "j.smith@company.com" with "john.smith@company.com"
- Company Names: Find "ABC Corp" that matches "ABC Corporation"
🔧 How to Use:
Step 1: Load your first file (becomes your "primary" dataset)
Step 2: Use "Merge Files" to load a second file (becomes "secondary")
Step 3: Choose which columns to compare between the files
Step 4: Set similarity level (0.80 = very similar, 0.50 = somewhat similar)
Step 5: Click "Compare A↔B" to find matches
⚡ Performance Note: To keep your browser responsive, we limit comparisons for very large files. If you see a message about too many comparisons, try increasing the similarity setting or working with smaller datasets.
Cross-File Merging & ID Matching
The Merge Files feature allows combining the loaded dataset with another CSV or TSV file. Choose key columns from each dataset, select a join type (Left, Inner, Right or Full) and optionally enable Salesforce ID conversion for matching. The merged output is downloaded automatically and the second file persists in memory as your secondary dataset for subsequent cross-file analysis. You can switch between the primary and secondary datasets via the dataset selector next to the dataset name.
Themes
Use the theme button to toggle between Dark, Light, and Matrix modes. Your selection is saved locally and persists between sessions.
Settings & Preferences
Open the Settings panel from the top navigation to customise DATAPHREAK. From this modal you can:
- Choose a theme – Dark, Light and Matrix Mode styles are available.
- Select default cleaning operations – decide which actions (trim spaces, fix letter case, remove accents and convert Salesforce IDs) are pre-selected when you open the Data Cleaning & IDs panel.
- Pick your language – switch the interface language once translations are available.
- Enable or disable persistence – save preferences and rules to your browser’s local storage or opt to discard them on page reload.
- Encrypt local data – optionally protect your preferences and rules with a passphrase using in-browser AES-GCM encryption.
Your preferences are stored locally only and never sent to a server. If persistence is disabled, settings revert to defaults when the page is refreshed.
Keyboard Navigation
You can tab through inputs and use keyboard shortcuts like Ctrl+L to reload the file or Esc to close modals. Most buttons also display an informative tooltip when hovered.
4. Quick Data Cleaning
Fix common data problems with one click! Select columns and choose which fixes to apply:
🧹 Available Cleaning Options:
- Trim Spaces: Removes extra spaces before/after text
Example: " John Smith " becomes "John Smith"
- Fix Letter Case: Smart formatting based on field type
Names → Title Case, Emails → lowercase, IDs → UPPERCASE
- Remove Accents: Converts special characters to regular letters
Example: "José" becomes "Jose"
- Convert Salesforce IDs: Extends 15-character IDs to 18-character format
Useful if you work with Salesforce data
🚀 Quick Start:
Step 1: Click checkboxes next to columns you want to clean
Step 2: Choose which operations to apply
Step 3: Click "Apply" to clean your data
Step 4: Watch your data quality score improve!
💡 Pro Tip: Use "🧹 Clean All" button for AI-powered instant cleaning that auto-corrects email, phone, and date formats, or customize specific operations below.
Unique Keys & IDs
The assistant in the Data Cleaning & IDs panel helps you find one or two fields that can uniquely identify each row. It scans for fields or field pairs with no missing values and no duplicates. These are ideal as primary keys or External IDs. If none exist, the assistant offers to add a new surrogate_id
column with sequential values.
Security & Privacy
DATAPHREAK is designed to keep your data private and secure. All processing happens locally in your browser—no network calls are made. This section summarises the key security measures:
- Offline-only operation: your datasets never leave your computer.
- Sanitised exports: exported CSV values are escaped to prevent formula injection and malicious scripts; cells that begin with
=
, +
, -
, or @
are prefixed with a single quote to turn them into plain text.
- No macros or scripting engine: unlike traditional spreadsheets, DATAPHREAK has no macro support, eliminating a major attack vector.
- Strict Content Security Policy: the page is served with a Content Security Policy that blocks inline and remote scripts.
- Cross-file comparison cap: approximate matching across files is capped at ~200,000 comparisons with lightweight blocking to prevent browser lock-ups.
- Error handling: file parsing is wrapped in
try
/catch
blocks with friendly error messages.
- Local persistence: rule sets and preferences are stored only in your browser’s
localStorage
; nothing is uploaded to any server. You can disable persistence entirely or encrypt your saved data with a passphrase in the Settings panel.
Legal Notice
This tool is currently in beta and is provided “as is” without warranty of any kind. It was created and is owned by Zachary Sluss. During the beta period it is made available as open source. For inquiries, contact zacsluss@yahoo.com.