This PowerShell script, named FilteredLinks.psm1
, defines a module called MyModule
that contains a function called Get-FilteredLinks
. The purpose of this function is to scrape a webpage and extract specific hyperlinks based on a defined pattern.
The Get-FilteredLinks
function performs the following steps:
- Sets the URL of the webpage to scrape as
https://www.esd.whs.mil/Directives/forms/
. - Uses the
Invoke-WebRequest
cmdlet to send a request to the specified URL and stores the response in the$response
variable. - Extracts the hyperlinks from the HTML content of the response using the
$response.Links
property and stores them in the$links
variable. - Defines a regular expression pattern (
/Directives/forms/dd\d{4}_\d{4}/
) to match specific links. - Filters the
$links
array based on the defined pattern using the-match
operator and stores the filtered links in the$filteredLinks
variable. - Iterates over each filtered link and extracts the last section of the URL by splitting it using the
/
delimiter. The extracted last section is stored in the$lastSection
variable. - Outputs the last section of each filtered link using the
Write-Output
cmdlet. - The
Export-ModuleMember
cmdlet is used to export theGet-FilteredLinks
function from the module.
This PowerShell script, named GetFormData.ps1
, imports a module named FilteredLinks.psm1
and uses it to scrape a webpage for specific hyperlinks. It then processes these links to extract and store information about certain PDF files.
The script performs the following steps:
- Imports the
FilteredLinks.psm1
module, which contains theGet-FilteredLinks
function. - Sets the base URL and resource path for the webpage to scrape.
- Calls the
Get-FilteredLinks
function to get an array of specific hyperlinks from the webpage. - Initializes an empty array to store information about the PDF files.
- Iterates over each link:
- Sends a request to the link URL and stores the response.
- Extracts all hyperlinks from the response.
- Filters the hyperlinks to only include those that end in
.pdf
or contain the current link. - For each filtered link that does not end in
.pdf
, sends another request to the link URL and extracts all hyperlinks from the response. - For each nested link that ends in
.pdf
, extracts the filename from the link and checks if it starts withdd
. If it does, prints the link URL and adds a file object with the filename and URL to thefiles
array.
- Converts the
files
array to JSON format and writes it to a file namedFileMetadata.json
.