Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

url-metadata

laurengarcia69kMIT5.2.2TypeScript support: included

Request a url and scrape the metadata from its HTML using Node.js or the browser.

html, metadata, meta tags, scrape, scraper, parser, seo, url, article, citations, node, node.js, browser, og, open graph, og: tags, json-ld, twitter cards

readme

url-metadata

Request a url and scrape the metadata from its HTML using Node.js or the browser. Has an optional mode that lets you pass in a string of html or a Response object as well (see Options section below).

Includes:

More details in the Returns section below.

v5.1.0+ Protects against:

To report a bug or request a feature please open an issue or pull request in GitHub. Please read the Troubleshooting section below before filing a bug.

Install

Works with Node.js versions >=6.0.0 or in the browser when bundled with Webpack (see /example-typescript) or Vite (see /example-vite). For Next.js, see /example-nextjs. Use previous version 2.5.0 which uses the (now-deprecated) request module if you don't have access to node-fetch or window.fetch in your target environment.

npm install url-metadata --save

Usage

In your project file:

const urlMetadata = require('url-metadata');

(async function () {
  try {
    const url = 'https://www.npmjs.com/package/url-metadata';
    const metadata = await urlMetadata(url);
    console.log(metadata);
  } catch (err) {
    console.log(err);
  }
})();

Options & Defaults

To override the default options, pass in a second options argument. The default options are the values below.

const options = {

  // Customize the default request headers:
  requestHeaders: {
    'User-Agent': 'url-metadata (+https://www.npmjs.com/package/url-metadata)',
    From: 'example@example.com'
  },

  // (Node.js v18+ only)
  // To prevent SSRF attacks, the default option below blocks
  // requests to private network & reserved IP addresses via:
  // https://www.npmjs.com/package/request-filtering-agent
  // Browser security policies prevent SSRF automatically.
  requestFilteringAgentOptions: undefined,

  // (Node.js v6+ only)
  // Pass in your own custom `agent` to override the
  // built-in request filtering agent above
  // https://www.npmjs.com/package/node-fetch/v/2.7.0#custom-agent
  agent: undefined,

  // (Browser only) `fetch` API cache setting
  cache: 'no-cache',

  // (Browser only) `fetch` API mode (ex: 'cors', 'same-origin', etc)
  mode: 'cors',

  // Maximum redirects in request chain, defaults to 10
  maxRedirects: 10,

  // `fetch` timeout in milliseconds, default is 10 seconds
  timeout: 10000,

  // (Node.js v6+ only) max size of response in bytes (uncompressed)
  // Default set to 0 to disable max size
  size: 0,

  // (Node.js v6+ only) compression defaults to true
  // Support gzip/deflate content encoding, set `false` to disable
  compress: true,

  // Charset to decode response with (ex: 'auto', 'utf-8', 'EUC-JP')
  // defaults to auto-detect in `Content-Type` header or meta tag
  // if none found, default `auto` option falls back to `utf-8`
  // override by passing in charset here (ex: 'windows-1251'):
  decode: 'auto',

  // Number of characters to truncate description to
  descriptionLength: 750,

  // Force image urls in selected tags to use https,
  // valid for images & favicons with full paths
  ensureSecureImageRequest: true,

  // Include raw response body as string
  includeResponseBody: false,

  // Alternate use-case: pass in `Response` object here to be parsed
  // see example below
  parseResponseObject: undefined
};

// Basic options usage
try {
  const url = 'https://www.npmjs.com/package/url-metadata';
  const metadata = await urlMetadata(url, options);
  console.log(metadata);
} catch (err) {
  console.log(err);
}

// Alternate use-case: parse a Response object instead
try {
  // fetch the url in your own code
  const response = await fetch('https://www.npmjs.com/package/url-metadata');
  // ... do other stuff with it...
  // pass the `response` object to be parsed for its metadata
  const metadata = await urlMetadata(null, {
    parseResponseObject: response
  });
  console.log(metadata);
} catch (err) {
  console.log(err);
}

// Similarly, if you have a string of html you can create
// a response object and pass the html string into it.
const html = `
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Metadata page</title>
    <meta name="author" content="foobar">
    <meta name="keywords" content="HTML, CSS, JavaScript">
  </head>
  <body>
    <h1>Metadata page</h1>
  </body>
</html>
`;
const response = new Response(html, {
  headers: {
    'Content-Type': 'text/html'
  }
});
const metadata = await urlMetadata(null, {
  parseResponseObject: response
});
console.log(metadata);

Returns

Returns a promise resolved with an object. Note that the url field returned will be the last hop in the request chain. If you pass in a url from a url shortener you'll get back the final destination as the url.

A basic template for the returned metadata object can be found in lib/metadata-fields.js. Any additional meta tags found on the page are appended as new fields to the object.

The returned metadata object consists of key/value pairs as strings, with a few exceptions:

  • favicons is an array of objects containing key/value pairs of strings
  • jsonld is an array of objects
  • responseHeaders is an object containing key/value pairs of strings
  • all meta tags that begin with citation_ (ex: citation_author) return with keys as strings and values that are an array of strings to conform to the Google Scholar spec which allows for multiple citation meta tags with different content values. So if the html contains:
    <meta name="citation_author" content="Arlitsch, Kenning">
    <meta name="citation_author" content="OBrien, Patrick">
    ... it will return as:
    'citation_author': ["Arlitsch, Kenning", "OBrien, Patrick"],

Troubleshooting

Issue: DNS Lookup errors. The SSRF filtering agent defaults on this package prevent calls to private ip addresses, link-local addresses and reserved ip addresses. To change or disable this feature you need to pass custom requestFilteringAgentOptions. More info here.

Issue: No fetch implementation found. You're in either an older browser that doesn't have the native fetch API or a Node.js environment that doesn't support node-fetch (Node.js < v6). File a GitHub issue or try dowgrading to url-metadata version 2.5.0 which uses the now-deprecated request module.

Issue: Response status code 0 or CORS errors. The fetch request failed at either the network or protocol level. Possible causes:

  • CORS errors. Try changing the mode option (ex: cors, same-origin, etc) or setting the Access-Control-Allow-Origin header on the server response from the url you are requesting if you have access to it.

  • Trying to access an https resource that has invalid certificate, or trying to access an http resource from a page with an https origin.

  • A browser plugin such as an ad-blocker or privacy protector.

Issue: Request returns 404, 403 errors or a CAPTCHA form. Your request may have been blocked by the server because it suspects you are a bot or scraper. Check this list to ensure you're not triggering a block.

changelog

CHANGELOG

5.2.2

  • add og:image:alt to metadata returned
  • issue #103: improve Next.js support, add /example-nextjs directory
  • improve error messaging if fetch is undefined (browser vs node.js)

5.2.1

  • just README changes

5.2.0

  • add agent option for Node v6+

5.1.1

  • return relevant responseHeaders with metadata (see lib/extract-headers.js)
  • improve package.json browser bundling support

5.1.0

  • separate entry points for browser and node.js for more efficient bundling (see package.json) & SSRF support
  • switch to node-fetch v2 on node.js side to support proper SSRF filtering
  • add options.compress from node-fetch to our node.js user options
  • add options.size to set a max size for the response in Node.js envs
  • added npm url to the default User-Agent header
  • improve cleanup of memory leaks when a fetch attempt errors out
  • switch /example-typescript to webpack (vs. parcel)
  • update both /example- dirs to use the 5.1.0, ensure build works as expected

5.0.5

  • add /example-vite directory (per issue #99)

5.0.4

  • bugfix issue #99: conditional useAgent for vite (browser) builds & Node.js < v18

5.0.3

  • improve README & readability UX

5.0.2

  • README: clarfiy support for SSRF prevention/ Node.js versioning
  • update default request headers User-Agent string

5.0.1

  • in /example-typescript: update parcel version in devDependencies

5.0.0

  • issue #97: prevent SSRF attacks
  • issue #97: add maxRedirects option to prevent infinite redirect loops

4.1.4

  • bugfix issue #90: ignore meta tags outside of <head> tag

4.1.3

  • issue #90: temporarily remove support for itemprop meta

4.1.2

  • update/fix failing tests

4.1.1

  • support favicons rel='shortcut icon'

4.1.0

  • support json-ld @graph syntax

4.0.1

  • update /example-typescript to use version 4.0.0

4.0.0

  • bugfix: allow multiple json-ld objects. this is a breaking change, previous versions returned jsonld as a single object but is now an array of objects.

3.5.6

  • update typescript def so url param can be null when in parseResponseObject option mode
  • add this mode to /example-typescript

3.5.5

  • update example browser usage dir /example-typescript to use parcel instead of browserify
  • README changes

3.5.4

  • add new test for parsing from string to /test/options.test.js
  • add as new test as example to README

3.5.3

  • bug: missing option parseResponseObject from Typescript definition
  • add Checklist to PR template so this doesn't happen again

3.5.2

  • README changes only

3.5.1

  • README changes only

3.5.0

  • new option: parseResponseObject
  • bug: 'unsupported content type' errors hang

3.4.9

  • bugs with ensureSecureImageRequest opt true
    • favicons not obeying opt when scheme is missing ex: '//:'
    • handle img tags w data: URIs

3.4.8

  • improve favicon support & tests

3.4.7

  • return imgTags on page (obey ensureSecureImageRequest opt)
  • bug: update TS Result definition to fit complex/varied json-ld use-cases in wild
  • change from 3.4.4: heading.content -> heading.text

3.4.6

  • return requestUrl (the url the user passed in to this module) alongside url, the final hop in request chain

3.4.5

  • handle multiple meta tags with same key, diff values by concatenating & comma-delimiting in one string
  • bug: fix meta tag charset
  • headings: strip newlines and extra spaces

3.4.4

  • h1-h6 headings

3.4.3

  • tighten up regex in extract-charset.js

3.4.2

  • return lang attribute

3.4.1

  • README, keyword changes

3.4.0

  • opts.decode defaults to auto-detecting charset & supports user-specificed charset overrides

3.3.1

  • add troubleshooting section to README

3.3.0

  • citations handling & tests, explainer in README

3.2.0

  • fix title bug

3.1.1

  • update Typescript definitions index.d.ts to account for favicons

3.1.0

  • scrape favicon(s) & add test

3.0.3

  • test: add test for descriptionLength option

3.0.1

  • bug: missing option from index.d.ts

3.0

  • replace request, q modules with js-native fetch and async/await
  • update dependencies
  • add test suite

2.5

2.4

  • Typescript definitions index.d.ts

2.3

2.2.3

  • handle mixed case in options.sourceMap keys

2.2.2

  • fix YouTube source mapping by updating the DOM selector it is derived from
  • better sourceMap example in README

2.2.1

2.2.0

  • add support for metatags: price, pricecurrency, availability
  • add support for metatags with attribute itemprop, in addition to property

2.1.9

  • bugfix: truncated og:image (issue #9)

2.1.8

  • bugfix: bad responses neither rejecting nor resolving

2.1.7

  • README typo in usage instructions

2.1.6

  • add keywords to package.json

2.1.5

  • make options usage more explicit in README